race leads same floating ip to be associated to separate ports

Bug #1581220 reported by Armando Migliaccio
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Expired
High
Unassigned

Bug Description

It looks like steps to create server as outlined in [2] lead to the same FIP being associated to two individual ports [1] (offending fip f1169479-7d88-4526-a60a-82f18b617077). It looks like two interleaving calls to do FIP association by port-id will end up clashing. This has been reproduced in Kilo, but we would need to find out if the race is still possible in newer versions.

[1] https://gist.github.com/cloudnull/d90eff7efc524844d8d3504fea8fa419
[2] https://gist.github.com/cloudnull/2f3931e51062dc7b325faf9a24974006

Tags: neutron
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
summary: - same floating ip can be associated to separate ports
+ race leads same floating ip to be associated to separate ports
Changed in neutron:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Monty Taylor (mordred) wrote :

Looking at the output gist, search for f1169479-7d88-4526-a60a-82f18b617077 and you can see two different entries that are distinct, but each having the same uuid

description: updated
description: updated
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

provisionally adding Nova until we figure out better what's going on.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

We need some more info. The Neutron DB data model has no way to allow two floating IPs with the same UUID so it must be one floating IP. And one floating IP only has a single column for the associated port ID so the DB can only have one entry.

Is this using a 3rd party plugin or anything that is bypassing the DB? Can you see this with a direct output from 'neutron floatingip-list' without using shade?

Revision history for this message
Kevin Carter (kevin-carter) wrote :

The environment we tested this with is running OpenStack Kilo from source without any third party plugins. For neutron, we're running the following code [0] using the LinuxBridge Agent for our L3 services. The deployment has 3 neutron servers and 3 Neutron agents. The reference architecture follows what has been outlined here: [0.1].

While I can recreate this issue using shade w/ the multithreaded script on the gist [1] I'm not able to do so using heat [2] which is performing similar operations with the same number of instances. It's also worth noting that if modify the shade script to add a random sleep before the task execution [1.1] all of the instances come online within a floating IP; This is the expected outcome. The fact a random sleep before the process execution allows it to complete without issues leads me to believe a race condition in exists somewhere in neutron however I've not been able to prove that at this time.

[0] - https://github.com/openstack/openstack-ansible/blob/11.2.15/playbooks/defaults/repo_packages/openstack_services.yml#L76-L91
[0.1] - http://docs.openstack.org/developer/openstack-ansible/install-guide/targethosts-network.html
[1] - https://gist.github.com/cloudnull/2f3931e51062dc7b325faf9a24974006
[1.1] - https://gist.github.com/cloudnull/2f3931e51062dc7b325faf9a24974006#file-test-shade-py-L47
[2] - https://gist.github.com/cloudnull/36c92e0da5e61b13510560ae15227453

Revision history for this message
Kevin Carter (kevin-carter) wrote :

I created a second entry on the shade script gist to show the random sleep being used which is allowing everything to work as expected [ https://gist.github.com/cloudnull/2f3931e51062dc7b325faf9a24974006 ].

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Can you confirm with a 'neutron floatingip-list' output that it's duped there? I'm wondering if there is something internal in the neutronclient that is racey since shade uses a single instance of it.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Can someone explain to me how shade concurrency prevents a race condition? I see that it pulls the first floating IP from a list of available IPs[1], then _neutron_attach_ip_to_server just does an associate on that floating IP to the server's port.[2]

If multiple threads are all executing concurrently, they will all use the same floating IP that they think is available, all of their association requests will succeed (floating IPs are just re-associated if they are already associated to a port) and only one will have the floating IP.

1. https://github.com/openstack-infra/shade/blob/d026cab5c7548517aa15e969fdd336b9fc663e7d/shade/openstackcloud.py#L3843
2. https://github.com/openstack-infra/shade/blob/d026cab5c7548517aa15e969fdd336b9fc663e7d/shade/openstackcloud.py#L3744-L3746

Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
Revision history for this message
Kevin Benton (kevinbenton) wrote :

From what I can tell shade is the one associating the same floating IP to multiple ports. The Neutron API will happily accept each call as changing which port a floating IP is associated with.

Changed in neutron:
status: Confirmed → Incomplete
tags: added: neutron
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → nobody
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

As Kevin pointed out and as I also can see from code [1] (both for methods _add_auto_ip and _add_ip_from_pool) the action of associating a floating ip to server is a two step process that involve creating an unassociated FIP which then gets assigned to a fixed ip via its port uuid (neutron case). It seems that the race part is where two separate threads get hold of the same FIP uuid: that can happen into cases (as far as I can see): a) pull a FIP from a list of available ones; b) create a fresh one.

I wonder if the source of contention is in the a) code path, and whether there's a chance to protect this critical region on client-side. Two separate client requests should lead to two separate FIPs being generated and thus lead to separate associations and thus no contention.

[1] https://github.com/openstack-infra/shade/blob/d026cab5c7548517aa15e969fdd336b9fc663e7d/shade/openstackcloud.py#L3955
[2] https://github.com/openstack-infra/shade/blob/d026cab5c7548517aa15e969fdd336b9fc663e7d/shade/openstackcloud.py#L3296
[3] https://github.com/openstack-infra/shade/blob/d026cab5c7548517aa15e969fdd336b9fc663e7d/shade/openstackcloud.py#L3304

no longer affects: nova
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.