Bug #1749425 “Neutron integrated with OpenVSwitch drops packets ...” : Bugs : openvswitch package : Ubuntu

Revision history for this message

James Hebden (ec0) wrote on 2018-02-14:

#1

ovs-missing-taps-14-02-2018.log Edit (216.8 KiB, text/plain)

Revision history for this message

James Hebden (ec0) wrote on 2018-02-14:

#2

SystemTap output Edit (512.2 KiB, text/plain)

Revision history for this message

James Hebden (ec0) wrote on 2018-02-14:

#3

SystemTap script Edit (3.6 KiB, text/plain)

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2018-02-14:

#4

I've subscribed Canonical Field Critical

Revision history for this message

James Page (james-page) wrote on 2018-02-14:

#5

Would it be possible to get log files from /var/log/neutron and /var/log/openvswitch as well as the syslog and auth.log files from an impacted machine. I suspect that due to the large number of namespaces being managed something is just taking to long to parse/sync/refresh causing this issue.

Revision history for this message

Junien F (axino) wrote on 2018-02-14:

#6

We're going to provide the logs, but neutron-l3-agent is getting spammed with https://pastebin.canonical.com/p/PBPpJJkRgy/ for dozens of routers.

Does neutron support changing the "ha" property of a router live ?

Revision history for this message

Junien F (axino) wrote on 2018-02-14:

#7

This is happening on all 3 neutron-l3-agents.

Revision history for this message

James Page (james-page) wrote on 2018-02-14:

#8

@axino

yes switching between ha/non-ha is support so long as the router is in admin state false at the time of change (which I believe is the case in this issue).

Some further observations; switching a router from ha mode to non-ha mode should remove all existing router agent hosting arrangements and then re-schedule the new non-ha router on an appropriate agent.

In the cloud experiencing this issue, I see only one remaining qrouter namespace but its also still got a keepalived and state monitor running - which I don't think is right.

I'm wondering whether due to the sheer number of routers being hosted on each network node, the router never gets removed before being rescheduled back to the same node, resulting in the teardown being skipped in someway.

Revision history for this message

James Page (james-page) wrote on 2018-02-14:

#9

Some sequencing:

unbind from agent message inbound:

2018-02-13 14:12:57.931 2065380 DEBUG neutron.agent.l3.agent [req-13456a6c-5583-4147-9b47-94026ea7f3b4 b327544aba2a482b9f12f1e6e615c394 9a4311b33381401fbc835c739981ce03 - - -] Got router removed from agent :{u'router_id': u'213b6544-ab4b-4e46-a5c6-5d8d587a0c6d'} router_removed_from_agent /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:419

some sort of update message:

2018-02-13 14:13:00.667 2065380 DEBUG neutron.agent.l3.agent [req-13456a6c-5583-4147-9b47-94026ea7f3b4 b327544aba2a482b9f12f1e6e615c394 9a4311b33381401fbc835c739981ce03 - - -] Got routers updated notification :[u'213b6544-ab4b-4e46-a5c6-5d8d587a0c6d'] routers_updated /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:409

and then we see a state change on the router (goes to master) inferring that no teardown has occurred since the first message above:

2018-02-13 14:14:33.113 2065380 DEBUG neutron.agent.l3.ha [-] Handling notification for router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d, state master enqueue /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:50
2018-02-13 14:14:33.113 2065380 INFO neutron.agent.l3.ha [-] Router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d transitioned to master

and then:

2018-02-13 14:14:40.380 2065380 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d _update_metadata_proxy /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:156

(I think the agent still thinks this is a HA router...)

and then:

2018-02-13 14:14:59.839 2065380 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'28015629-217b-4eec-b557-6f93a2bb0230': 'active', '237d2839-5687-4104-9270-ff974de59800': 'active', '266167d5-39e5-4d97-a3a8-45d4ddad9407': 'active', '213b6544-ab4b-4e46-a5c6-5d8d587a0c6d': 'active', '2367f6bd-02c1-4ef1-a642-fe54d916fe2e': 'active'} notify_server /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:177

and slight after:

2018-02-13 14:18:08.275 2065380 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'remove_vip_by_ip_address'

This seems to support the theory that the HA router never actually gets torn down before the new non-ha router is scheduled to the same network node, resulting in the agent not really knowing whether the router is arthur or marther.

Some sequencing:

unbind from agent message inbound:

2018-02-13 14:12:57.931 2065380 DEBUG neutron.agent.l3.agent [req-13456a6c-5583-4147-9b47-94026ea7f3b4 b327544aba2a482b9f12f1e6e615c394 9a4311b33381401fbc835c739981ce03 - - -] Got router removed from agent :{u'router_id': u'213b6544-ab4b-4e46-a5c6-5d8d587a0c6d'} router_removed_from_agent /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:419

some sort of update message:

2018-02-13 14:13:00.667 2065380 DEBUG neutron.agent.l3.agent [req-13456a6c-5583-4147-9b47-94026ea7f3b4 b327544aba2a482b9f12f1e6e615c394 9a4311b33381401fbc835c739981ce03 - - -] Got routers updated notification :[u'213b6544-ab4b-4e46-a5c6-5d8d587a0c6d'] routers_updated /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:409

and then we see a state change on the router (goes to master) inferring that no teardown has occurred since the first message above:

2018-02-13 14:14:33.113 2065380 DEBUG neutron.agent.l3.ha [-] Handling notification for router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d, state master enqueue /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:50
2018-02-13 14:14:33.113 2065380 INFO neutron.agent.l3.ha [-] Router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d transitioned to master

and then:

2018-02-13 14:14:40.380 2065380 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d _update_metadata_proxy /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:156

(I think the agent still thinks this is a HA router...)

and then:

2018-02-13 14:14:59.839 2065380 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'28015629-217b-4eec-b557-6f93a2bb0230': 'active', '237d2839-5687-4104-9270-ff974de59800': 'active', '266167d5-39e5-4d97-a3a8-45d4ddad9407': 'active', '213b6544-ab4b-4e46-a5c6-5d8d587a0c6d': 'active', '2367f6bd-02c1-4ef1-a642-fe54d916fe2e': 'active'} notify_server /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:177

and slight after:

2018-02-13 14:18:08.275 2065380 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'remove_vip_by_ip_address'

This seems to support the theory that the HA router never actually gets torn down before the new non-ha router is scheduled to the same network node, resulting in the agent not really knowing whether the router is arthur or marther.

Revision history for this message

James Page (james-page) wrote on 2018-02-14:

#10

OK and now:

2018-02-13 14:18:08.281 2065380 DEBUG neutron.agent.l3.agent [-] Starting router update for 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d, action 1, priority 0 _process_router_update /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:483

which I think is the actual processing of the message from the work queue from 14:12:57 (wow).

action 1 is a DELETE_ROUTER.

this fails:

2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent [-] Error while deleting router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 366, in _safe_router_removed
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent self._router_removed(router_id)
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 385, in _router_removed
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent ri.delete()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 420, in delete
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent self.destroy_state_change_monitor(self.process_monitor)
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 355, in destroy_state_change_monitor
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent pm = self._get_state_change_monitor_process_manager()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 326, in _get_state_change_monitor_process_manager
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent default_cmd_callback=self._get_state_change_monitor_callback())
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 330, in _get_state_change_monitor_callback
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent ha_cidr = self._get_primary_vip()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 166, in _get_primary_vip
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent return self._get_keepalived_instance().get_primary_vip()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'get_primary_vip'
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent
2018-02-13 14:18:09.960 2065380 DEBUG neutron.agent.l3.agent [-] Finished a router update for 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d _process_router_update /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:513

OK and now:

2018-02-13 14:18:08.281 2065380 DEBUG neutron.agent.l3.agent [-] Starting router update for 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d, action 1, priority 0 _process_router_update /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:483

which I think is the actual processing of the message from the work queue from 14:12:57 (wow).

action 1 is a DELETE_ROUTER.

this fails:

2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent [-] Error while deleting router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 366, in _safe_router_removed
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent     self._router_removed(router_id)
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 385, in _router_removed
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent     ri.delete()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 420, in delete
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent     self.destroy_state_change_monitor(self.process_monitor)
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 355, in destroy_state_change_monitor
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent     pm = self._get_state_change_monitor_process_manager()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 326, in _get_state_change_monitor_process_manager
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent     default_cmd_callback=self._get_state_change_monitor_callback())
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 330, in _get_state_change_monitor_callback
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent     ha_cidr = self._get_primary_vip()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 166, in _get_primary_vip
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent     return self._get_keepalived_instance().get_primary_vip()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'get_primary_vip'
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent
2018-02-13 14:18:09.960 2065380 DEBUG neutron.agent.l3.agent [-] Finished a router update for 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d _process_router_update /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:513

Revision history for this message

James Page (james-page) wrote on 2018-02-14:

#11

Supposition might be that ha_vr_id has been unset at this point in time?

def _get_keepalived_instance(self):
return self.keepalived_manager.config.get_instance(self.ha_vr_id)

Revision history for this message

James Page (james-page) wrote on 2018-02-14:

#12

You can see an update message prior to the remove-from-agent processing with:

"ha_vr_id": null,

Revision history for this message

James Page (james-page) wrote on 2018-02-14:

#13

OK so the original bug report was actually about packet loss in ovs; the network nodes have a large number of lost tap devices tagged with 4095 segmentation id, reporting no such device in OVS.

Theory might be is is dosing the ovs-vswitchd process resulting in some userspace datapath lossyness.

Revision history for this message

James Hebden (ec0) wrote on 2018-02-14:

#14

Download full text (7.9 KiB)

@james-page, @axino -

Just a +1 to the HA property being changed requiring the router to be set down prior, and back up after to start the recreation of the router as HA.

We have seen various other side effects in Neutron/OVS environments and specifically the environment in question, such as -
* Missing interfaces inside qrouter namespaces (OVS taps)
* Missing iptables rules
* Missing floating IP aliases on OVS interfaces inside the qrouter namespaces
All of which are tasks which are performed during bringup of HA routers. We have seen fewer of these issues on non-HA routers, and whether the router is HA or not, rescheduling the router or converting from HA to non-HA or vice versa will rebuild and as a result repair the router.

I should also point out that at the time of these issues, we have rarely observed high system load, but I do also agree that the number of routers and therefore the workload on both Neutron and OVS to orchestrate interface plugging and unplugging and namespace (and associated network stack plumbing) work is much higher than a typical environment. Having three servers doing this work rather than scaling horizontally seems like it might be exposing bottlenecks in either Neutron or OVS when it comes to the orchestration of these tasks.

I'm not sure if you are seeing the following traceback in the logs provided, but the below traceback has also been common when this issue crops up, and shows an example of a task performed during the bringup of a router (the IPTablesManager initialisation) falling over.

2018-02-14 05:04:32.101 1352665 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:158
2018-02-14 05:04:32.103 1352665 DEBUG neutron.agent.linux.iptables_manager [-] IPTablesManager.apply completed with success. 0 iptables commands were issued _apply_synchronized /usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_manager.py:576
2018-02-14 05:04:32.103 1352665 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "iptables-qrouter-43801324-72ce-469f-a628-a5c645041e30" lock /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:228
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 253, in call
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 1115, in process
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self.process_external()
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 890, in process_external
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self._process_external_gateway...

@james-page, @axino -

Just a +1 to the HA property being changed requiring the router to be set down prior, and back up after to start the recreation of the router as HA.

We have seen various other side effects in Neutron/OVS environments and specifically the environment in question, such as -
* Missing interfaces inside qrouter namespaces (OVS taps)
* Missing iptables rules 
* Missing floating IP aliases on OVS interfaces inside the qrouter namespaces
All of which are tasks which are performed during bringup of HA routers. We have seen fewer of these issues on non-HA routers, and whether the router is HA or not, rescheduling the router or converting from HA to non-HA or vice versa will rebuild and as a result repair the router.

I should also point out that at the time of these issues, we have rarely observed high system load, but I do also agree that the number of routers and therefore the workload on both Neutron and OVS to orchestrate interface plugging and unplugging and namespace (and associated network stack plumbing) work is much higher than a typical environment. Having three servers doing this work rather than scaling horizontally seems like it might be exposing bottlenecks in either Neutron or OVS when it comes to the orchestration of these tasks.

I'm not sure if you are seeing the following traceback in the logs provided, but the below traceback has also been common when this issue crops up, and shows an example of a task performed during the bringup of a router (the IPTablesManager initialisation) falling over.

2018-02-14 05:04:32.101 1352665 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:158
2018-02-14 05:04:32.103 1352665 DEBUG neutron.agent.linux.iptables_manager [-] IPTablesManager.apply completed with success. 0 iptables commands were issued _apply_synchronized /usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_manager.py:576
2018-02-14 05:04:32.103 1352665 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "iptables-qrouter-43801324-72ce-469f-a628-a5c645041e30" lock /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:228
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 253, in call
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info     return func(*args, **kwargs)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 1115, in process
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info     self.process_external()
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 890, in process_external
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info     self._process_external_gateway(ex_gw_port)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 777, in _process_external_gateway
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info     self.external_gateway_updated(ex_gw_port, interface_name)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 403, in external_gateway_updated
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info     self._remove_vip(old_gateway_cidr)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 202, in _remove_vip
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info     instance.remove_vip_by_ip_address(ip_cidr)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info AttributeError: 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info 
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent [-] Failed to process compatible router: 43801324-72ce-469f-a628-a5c645041e30
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 517, in _process_router_update
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 454, in _process_router_if_compatible
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self._process_updated_router(router)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 469, in _process_updated_router
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     ri.process()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 426, in process
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     super(HaRouter, self).process()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 256, in call
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self.logger(e)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self.force_reraise()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     six.reraise(self.type_, self.value, self.tb)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 253, in call
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     return func(*args, **kwargs)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 1115, in process
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self.process_external()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 890, in process_external
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self._process_external_gateway(ex_gw_port)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 777, in _process_external_gateway
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self.external_gateway_updated(ex_gw_port, interface_name)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 403, in external_gateway_updated
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     self._remove_vip(old_gateway_cidr)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 202, in _remove_vip
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent     instance.remove_vip_by_ip_address(ip_cidr)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent

Revision history for this message

James Hebden (ec0) wrote on 2018-02-14:

#15

Correction - The `remove_vip_by_ip_address` call failing just after the IPTablesManager is called and succeeds.

Revision history for this message

James Troup (elmo) wrote on 2018-02-14:

#16

I've unsubscribed Canonical Field Critical as removing the 4095 tagged ports in OVS on all 3 Neutron Gateway nodes seems to have made the packet loss go away (for now?).

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-02-14:

#17

You described a lot of issues in comment #14:

* Missing interfaces inside qrouter namespaces (OVS taps)
* Missing iptables rules
* Missing floating IP aliases on OVS interfaces inside the qrouter namespaces

Some of those might be fixed in master, especially the iptables one, and should have been cherry-picked to the stable branches but probably only to Ocata. The "add floating ip" path should re-queue the message and retry in a second or two, if it doesn't then please see if there is a trackback and put the info here or another bug.

There could also be something happening with keepalived where it's not getting things done, since it is managing the VIPs when HA is enabled.

Finally, regarding the traceback, I've never seen that before. My first thought is to sprinkle "if instance" in all those code paths, but maybe there's something else going on here that we should figure out. For example, if the initial creation of the instance failed, then a message came to add a floating IP, returning without doing anything (not instance case) isn't what we want to do. This would require some log examination to figure out what exactly happened.

Revision history for this message

Brian Haley (brian-haley) wrote on 2018-02-15:

#18

And just as a follow-on to my previous comment - is it possible to test this with the latest code from the master branch? I know you're running packages from Ubuntu I just didn't want to waste cycles tracking things down that might be fixed already, and maybe just need a backport to a stable branch. That's not saying there isn't a new bug here as there most likely is.

Revision history for this message

James Page (james-page) wrote on 2018-02-16:

#19

Hi Brian

I'm working on a reproducer so this is a little easier to test; once I've reproduced using Ocata, I'll then re-test using the same process with pike and/or queens to see if things have improved.

Revision history for this message

James Page (james-page) wrote on 2018-02-16:

#20

I've spent the last two days trying to reproduce these problems on an OpenStack Cloud with three gateway/network units, each with 8 cores and 16G of RAM; I have 99 routers configured on the cloud, plumbed into a shared external provider network, and I've been switching them between ha and non-ha configuration continually for the last 24 hours using:

   openstack router set --disable $router_id
   openstack router set --{no-}ha $router_id
   openstack router set --enable $router_id

Routers successfully switched between HA and non-HA mode, with keepalived, netns and ovs ports being tidied as appropriate.

We think we saw issues with ovs performance when the db contained a large number of dangling ovs interfaces on br-int, so I loaded on of my units up with 400 dangling ovs instances.

I'll run the switching again for a while and see if that exacerbates things sufficiently to cause the same issue again.

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2018-02-23:

#21

Hi folks, while trying to reproduce this behaviour myself I think i've stumbled upon some interesting behaviour. I setup a test as follows and checked for errors at specific points. I have a 4 node setup (24 core/64G ram) with 3 gateways and 1 compute. Neutron is configured with l3_ha enabled and max_agents_per_router set to 3;

stage 1: created two projects with 1 router each (which gives two sets of keepalived each with the same VR_ID (1)) and checked keepalived logs - system load is minimal, no re-elections observed post-create.

stage 2: scaled horizontally to 200 projects each with 1 router (giving 200 routers with VR_ID 1 each within their own network). system load is minimal, no re-elections observed post-create, observed that all master state routers are on the same host.

stage 3: scaled one project vertically by creating 200 routers within same project. As i started to get into the VR_70s i started to see some of the extant routers get re-elected e.g. "VRRP_Instance(VR_76) Received higher prio advert". If i run a tcpdump on one of my ha- interfaces inside a qrouter- namespace I see a flood of "VRRPv2, Advertisement" with each VR_ID being advertised every 2s from the current master (as expected since that's the default interval in neutron). The consequence of this is that neutron is frequently having to cathup with keepalived (by running neutron-keepalived-state-change) which causes more traffic and all without cause since there is no need for these failovers to be occurring.

Since the advert interval is configurable in neutron [1] I am going to go ahead and try changing it to see of that can stop these re-elections but that seems a little hacky as a fix so just wondering if there's another way to mitigate these effects. I need to double check the vrrp spec but iirc since these advertisements are sent out by the master, if the master dies it would affect how long it takes for a re-election to occur (spec says "(3 * Advertisement_Interval) + Skew_time") and during that time VMs would be unreachable so maybe there's another way.

[1] https://github.com/openstack/neutron/blob/b90ec94dc3f83f63bdb505ace1e4c272435c494b/neutron/conf/agent/l3/ha.py#L35

Hi folks, while trying to reproduce this behaviour myself I think i've stumbled upon some interesting behaviour. I setup a test as follows and checked for errors at specific points. I have a 4 node setup (24 core/64G ram) with 3 gateways and 1 compute. Neutron is configured with l3_ha enabled and max_agents_per_router set to 3;

stage 1: created two projects with 1 router each (which gives two sets of keepalived each with the same VR_ID (1)) and checked keepalived logs - system load is minimal, no re-elections observed post-create.

stage 2: scaled horizontally to 200 projects each with 1 router (giving 200 routers with VR_ID 1 each within their own network). system load is minimal, no re-elections observed post-create, observed that all master state routers are on the same host.

stage 3: scaled one project vertically by creating 200 routers within same project. As i started to get into the VR_70s i started to see some of the extant routers get re-elected e.g. "VRRP_Instance(VR_76) Received higher prio advert". If i run a tcpdump on one of my ha- interfaces inside a qrouter- namespace I see a flood of "VRRPv2, Advertisement" with each VR_ID being advertised every 2s from the current master (as expected since that's the default interval in neutron). The consequence of this is that neutron is frequently having to cathup with keepalived (by running neutron-keepalived-state-change) which causes more traffic and all without cause since there is no need for these failovers to be occurring.

Since the advert interval is configurable in neutron [1] I am going to go ahead and try changing it to see of that can stop these re-elections but that seems a little hacky as a fix so just wondering if there's another way to mitigate these effects. I need to double check the vrrp spec but iirc since these advertisements are sent out by the master, if the master dies it would affect how long it takes for a re-election to occur (spec says "(3 * Advertisement_Interval) + Skew_time") and during that time VMs would be unreachable so maybe there's another way.

[1] https://github.com/openstack/neutron/blob/b90ec94dc3f83f63bdb505ace1e4c272435c494b/neutron/conf/agent/l3/ha.py#L35

James Page (james-page) on 2018-11-22

Changed in openvswitch (Ubuntu):
status:	New → Incomplete

Revision history for this message

James Page (james-page) wrote on 2021-06-30:

#22

Marking OVS task as invalid as it appears this is a neutron bug related to configuration of VRRP for HA routers.

Changed in openvswitch (Ubuntu):
status:	Incomplete → Invalid

Affects		Status	Importance	Assigned to	Milestone
	neutron	New	Undecided	Unassigned
	openvswitch (Ubuntu)	Invalid	Undecided	Unassigned

Ubuntu
openvswitch package

Neutron integrated with OpenVSwitch drops packets and fails to plug/unplug interfaces from OVS on router interfaces at scale

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntuopenvswitch package

Neutron integrated with OpenVSwitch drops packets and fails to plug/unplug interfaces from OVS on router interfaces at scale

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
openvswitch package