Neutron integrated with OpenVSwitch drops packets and fails to plug/unplug interfaces from OVS on router interfaces at scale

Bug #1749425 reported by James Hebden
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned
openvswitch (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Description: Ubuntu 16.04.3 LTS
Release: 16.04
Linux 4.4.0-96-generic on AMD64
Neutron 2:10.0.4-0ubuntu2~cloud0 from Cloud Archive xenial-updates/ocata
OpenVSwitch 2.6.1-0ubuntu5.2~cloud0 from Cloud Archive xenial-upates/ocata

In an environment with three bare-metal Neutron deployments, hosting upward of 300 routers, with approximately the same number of instances, typically one router per instance, packet loss on instances accessed via floating IPs, including complete connectivity loss, is experienced. The problem is exacerbated by enabling L3HA, likely due to the increase in router namespaces to be scheduled and managed, and the additional scheduling work of bringing up keepalived and monitoring the keepalived VIP.

Reducing the number of routers and rescheduling routers on new hosts, causing the routers to undergo a full recreation of namespace, iptables rules, and replugging of interfaces into OVS will correct packet loss or connectivity loss on impacted routers.

On Neutron hosts in this environment, we have used systemtap to trace calls to kfree_skb which reveals the majority of dropped packets occur in the openvswitch module, notably on the br-int bridge. Inspecting the state of OVS shows many qtap interfaces which are no longer present on the Neutron host which are still plugged in to OVS.

Diagnostic outputs in following comments.

Revision history for this message
James Hebden (ec0) wrote :
Revision history for this message
James Hebden (ec0) wrote :
Revision history for this message
James Hebden (ec0) wrote :
Revision history for this message
Chris Sanders (chris.sanders) wrote :

I've subscribed Canonical Field Critical

Revision history for this message
James Page (james-page) wrote :

Would it be possible to get log files from /var/log/neutron and /var/log/openvswitch as well as the syslog and auth.log files from an impacted machine. I suspect that due to the large number of namespaces being managed something is just taking to long to parse/sync/refresh causing this issue.

Revision history for this message
Junien F (axino) wrote :

We're going to provide the logs, but neutron-l3-agent is getting spammed with https://pastebin.canonical.com/p/PBPpJJkRgy/ for dozens of routers.

Does neutron support changing the "ha" property of a router live ?

Revision history for this message
Junien F (axino) wrote :

This is happening on all 3 neutron-l3-agents.

Revision history for this message
James Page (james-page) wrote :

@axino

yes switching between ha/non-ha is support so long as the router is in admin state false at the time of change (which I believe is the case in this issue).

Some further observations; switching a router from ha mode to non-ha mode should remove all existing router agent hosting arrangements and then re-schedule the new non-ha router on an appropriate agent.

In the cloud experiencing this issue, I see only one remaining qrouter namespace but its also still got a keepalived and state monitor running - which I don't think is right.

I'm wondering whether due to the sheer number of routers being hosted on each network node, the router never gets removed before being rescheduled back to the same node, resulting in the teardown being skipped in someway.

Revision history for this message
James Page (james-page) wrote :

Some sequencing:

unbind from agent message inbound:

2018-02-13 14:12:57.931 2065380 DEBUG neutron.agent.l3.agent [req-13456a6c-5583-4147-9b47-94026ea7f3b4 b327544aba2a482b9f12f1e6e615c394 9a4311b33381401fbc835c739981ce03 - - -] Got router removed from agent :{u'router_id': u'213b6544-ab4b-4e46-a5c6-5d8d587a0c6d'} router_removed_from_agent /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:419

some sort of update message:

2018-02-13 14:13:00.667 2065380 DEBUG neutron.agent.l3.agent [req-13456a6c-5583-4147-9b47-94026ea7f3b4 b327544aba2a482b9f12f1e6e615c394 9a4311b33381401fbc835c739981ce03 - - -] Got routers updated notification :[u'213b6544-ab4b-4e46-a5c6-5d8d587a0c6d'] routers_updated /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:409

and then we see a state change on the router (goes to master) inferring that no teardown has occurred since the first message above:

2018-02-13 14:14:33.113 2065380 DEBUG neutron.agent.l3.ha [-] Handling notification for router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d, state master enqueue /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:50
2018-02-13 14:14:33.113 2065380 INFO neutron.agent.l3.ha [-] Router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d transitioned to master

and then:

2018-02-13 14:14:40.380 2065380 DEBUG neutron.agent.l3.ha [-] Spawning metadata proxy for router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d _update_metadata_proxy /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:156

(I think the agent still thinks this is a HA router...)

and then:

2018-02-13 14:14:59.839 2065380 DEBUG neutron.agent.l3.ha [-] Updating server with HA routers states {'28015629-217b-4eec-b557-6f93a2bb0230': 'active', '237d2839-5687-4104-9270-ff974de59800': 'active', '266167d5-39e5-4d97-a3a8-45d4ddad9407': 'active', '213b6544-ab4b-4e46-a5c6-5d8d587a0c6d': 'active', '2367f6bd-02c1-4ef1-a642-fe54d916fe2e': 'active'} notify_server /usr/lib/python2.7/dist-packages/neutron/agent/l3/ha.py:177

and slight after:

2018-02-13 14:18:08.275 2065380 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'remove_vip_by_ip_address'

This seems to support the theory that the HA router never actually gets torn down before the new non-ha router is scheduled to the same network node, resulting in the agent not really knowing whether the router is arthur or marther.

Revision history for this message
James Page (james-page) wrote :

OK and now:

2018-02-13 14:18:08.281 2065380 DEBUG neutron.agent.l3.agent [-] Starting router update for 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d, action 1, priority 0 _process_router_update /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:483

which I think is the actual processing of the message from the work queue from 14:12:57 (wow).

action 1 is a DELETE_ROUTER.

this fails:

2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent [-] Error while deleting router 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 366, in _safe_router_removed
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent self._router_removed(router_id)
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 385, in _router_removed
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent ri.delete()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 420, in delete
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent self.destroy_state_change_monitor(self.process_monitor)
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 355, in destroy_state_change_monitor
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent pm = self._get_state_change_monitor_process_manager()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 326, in _get_state_change_monitor_process_manager
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent default_cmd_callback=self._get_state_change_monitor_callback())
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 330, in _get_state_change_monitor_callback
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent ha_cidr = self._get_primary_vip()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 166, in _get_primary_vip
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent return self._get_keepalived_instance().get_primary_vip()
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'get_primary_vip'
2018-02-13 14:18:09.959 2065380 ERROR neutron.agent.l3.agent
2018-02-13 14:18:09.960 2065380 DEBUG neutron.agent.l3.agent [-] Finished a router update for 213b6544-ab4b-4e46-a5c6-5d8d587a0c6d _process_router_update /usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py:513

Revision history for this message
James Page (james-page) wrote :

Supposition might be that ha_vr_id has been unset at this point in time?

    def _get_keepalived_instance(self):
        return self.keepalived_manager.config.get_instance(self.ha_vr_id)

Revision history for this message
James Page (james-page) wrote :

You can see an update message prior to the remove-from-agent processing with:

     "ha_vr_id": null,

Revision history for this message
James Page (james-page) wrote :

OK so the original bug report was actually about packet loss in ovs; the network nodes have a large number of lost tap devices tagged with 4095 segmentation id, reporting no such device in OVS.

Theory might be is is dosing the ovs-vswitchd process resulting in some userspace datapath lossyness.

Revision history for this message
James Hebden (ec0) wrote :
Download full text (7.9 KiB)

@james-page, @axino -

Just a +1 to the HA property being changed requiring the router to be set down prior, and back up after to start the recreation of the router as HA.

We have seen various other side effects in Neutron/OVS environments and specifically the environment in question, such as -
* Missing interfaces inside qrouter namespaces (OVS taps)
* Missing iptables rules
* Missing floating IP aliases on OVS interfaces inside the qrouter namespaces
All of which are tasks which are performed during bringup of HA routers. We have seen fewer of these issues on non-HA routers, and whether the router is HA or not, rescheduling the router or converting from HA to non-HA or vice versa will rebuild and as a result repair the router.

I should also point out that at the time of these issues, we have rarely observed high system load, but I do also agree that the number of routers and therefore the workload on both Neutron and OVS to orchestrate interface plugging and unplugging and namespace (and associated network stack plumbing) work is much higher than a typical environment. Having three servers doing this work rather than scaling horizontally seems like it might be exposing bottlenecks in either Neutron or OVS when it comes to the orchestration of these tasks.

I'm not sure if you are seeing the following traceback in the logs provided, but the below traceback has also been common when this issue crops up, and shows an example of a task performed during the bringup of a router (the IPTablesManager initialisation) falling over.

2018-02-14 05:04:32.101 1352665 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:158
2018-02-14 05:04:32.103 1352665 DEBUG neutron.agent.linux.iptables_manager [-] IPTablesManager.apply completed with success. 0 iptables commands were issued _apply_synchronized /usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_manager.py:576
2018-02-14 05:04:32.103 1352665 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "iptables-qrouter-43801324-72ce-469f-a628-a5c645041e30" lock /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:228
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 253, in call
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 1115, in process
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self.process_external()
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 890, in process_external
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self._process_external_gateway...

Read more...

Revision history for this message
James Hebden (ec0) wrote :

Correction - The `remove_vip_by_ip_address` call failing just after the IPTablesManager is called and succeeds.

Revision history for this message
James Troup (elmo) wrote :

I've unsubscribed Canonical Field Critical as removing the 4095 tagged ports in OVS on all 3 Neutron Gateway nodes seems to have made the packet loss go away (for now?).

Revision history for this message
Brian Haley (brian-haley) wrote :

You described a lot of issues in comment #14:

* Missing interfaces inside qrouter namespaces (OVS taps)
* Missing iptables rules
* Missing floating IP aliases on OVS interfaces inside the qrouter namespaces

Some of those might be fixed in master, especially the iptables one, and should have been cherry-picked to the stable branches but probably only to Ocata. The "add floating ip" path should re-queue the message and retry in a second or two, if it doesn't then please see if there is a trackback and put the info here or another bug.

There could also be something happening with keepalived where it's not getting things done, since it is managing the VIPs when HA is enabled.

Finally, regarding the traceback, I've never seen that before. My first thought is to sprinkle "if instance" in all those code paths, but maybe there's something else going on here that we should figure out. For example, if the initial creation of the instance failed, then a message came to add a floating IP, returning without doing anything (not instance case) isn't what we want to do. This would require some log examination to figure out what exactly happened.

Revision history for this message
Brian Haley (brian-haley) wrote :

And just as a follow-on to my previous comment - is it possible to test this with the latest code from the master branch? I know you're running packages from Ubuntu I just didn't want to waste cycles tracking things down that might be fixed already, and maybe just need a backport to a stable branch. That's not saying there isn't a new bug here as there most likely is.

Revision history for this message
James Page (james-page) wrote :

Hi Brian

I'm working on a reproducer so this is a little easier to test; once I've reproduced using Ocata, I'll then re-test using the same process with pike and/or queens to see if things have improved.

Revision history for this message
James Page (james-page) wrote :

I've spent the last two days trying to reproduce these problems on an OpenStack Cloud with three gateway/network units, each with 8 cores and 16G of RAM; I have 99 routers configured on the cloud, plumbed into a shared external provider network, and I've been switching them between ha and non-ha configuration continually for the last 24 hours using:

   openstack router set --disable $router_id
   openstack router set --{no-}ha $router_id
   openstack router set --enable $router_id

Routers successfully switched between HA and non-HA mode, with keepalived, netns and ovs ports being tidied as appropriate.

We think we saw issues with ovs performance when the db contained a large number of dangling ovs interfaces on br-int, so I loaded on of my units up with 400 dangling ovs instances.

I'll run the switching again for a while and see if that exacerbates things sufficiently to cause the same issue again.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi folks, while trying to reproduce this behaviour myself I think i've stumbled upon some interesting behaviour. I setup a test as follows and checked for errors at specific points. I have a 4 node setup (24 core/64G ram) with 3 gateways and 1 compute. Neutron is configured with l3_ha enabled and max_agents_per_router set to 3;

 stage 1: created two projects with 1 router each (which gives two sets of keepalived each with the same VR_ID (1)) and checked keepalived logs - system load is minimal, no re-elections observed post-create.

 stage 2: scaled horizontally to 200 projects each with 1 router (giving 200 routers with VR_ID 1 each within their own network). system load is minimal, no re-elections observed post-create, observed that all master state routers are on the same host.

 stage 3: scaled one project vertically by creating 200 routers within same project. As i started to get into the VR_70s i started to see some of the extant routers get re-elected e.g. "VRRP_Instance(VR_76) Received higher prio advert". If i run a tcpdump on one of my ha- interfaces inside a qrouter- namespace I see a flood of "VRRPv2, Advertisement" with each VR_ID being advertised every 2s from the current master (as expected since that's the default interval in neutron). The consequence of this is that neutron is frequently having to cathup with keepalived (by running neutron-keepalived-state-change) which causes more traffic and all without cause since there is no need for these failovers to be occurring.

 Since the advert interval is configurable in neutron [1] I am going to go ahead and try changing it to see of that can stop these re-elections but that seems a little hacky as a fix so just wondering if there's another way to mitigate these effects. I need to double check the vrrp spec but iirc since these advertisements are sent out by the master, if the master dies it would affect how long it takes for a re-election to occur (spec says "(3 * Advertisement_Interval) + Skew_time") and during that time VMs would be unreachable so maybe there's another way.

[1] https://github.com/openstack/neutron/blob/b90ec94dc3f83f63bdb505ace1e4c272435c494b/neutron/conf/agent/l3/ha.py#L35

James Page (james-page)
Changed in openvswitch (Ubuntu):
status: New → Incomplete
Revision history for this message
James Page (james-page) wrote :

Marking OVS task as invalid as it appears this is a neutron bug related to configuration of VRRP for HA routers.

Changed in openvswitch (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.