Neutron port leak when connection is dropped during port create in instance boot.

Bug #1324934 reported by Matthew Gilliard
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matthew Gilliard
Icehouse
Fix Released
Undecided
Unassigned
neutron
Invalid
Undecided
Unassigned

Bug Description

Sometimes an instance fails to boot because the call to neutron to allocate a port fails. However, we see cases where this happens but the port is actually created and allocated to the now-deleted instance.

The same problem has been reported regarding hpcloud internal monitoring tools, and the openstack-infra nodepool tenant. There seems to be a port leak.

Evidence
========

Sometimes instances fail to boot with the following error:

2014-05-27 00:09:23 ERROR : [NOV58] NovaServers.add Failed: OverLimit - Maximum number of ports exceeded (HTTP 413) (Request-ID: req-e05525c3-0876-4da4-8a81-8dcc3432b418) args('('SLAM_META_m1_az1_00_09_NC_TEMP', u'8c096c29-a666-4b82-99c4-c77dc70cfb40', u'100', 'metastmkey_m1_az1', 'metastm_m1_az1', u'ee7d6d37-d855-4d30-a67b-0d88a03e72fc', 'az1'),{}')

How did we run out of ports? Investigating further, starting with the neutron database:

  mysql> select * from ports where device_owner like 'comput%';

This gives a table which shows neutron ports and the instance uuids that they are allocated to (example: http://paste.openstack.org/show/82394/)

Matching neutron's `device_id` with `uuid` in nova's instances table, we found that approximately 50% of the ports were allocated to instances that had been deleted. As far as we know this must be a bug, as there is no way to create a port without linking it to an instance, and deleting an instance should delete its ports atomically.

The effect is that the user has unused ports counting toward their port quota, which will prevent them from booting instances when the quota is fully allocated.

Logs
====

The nova-compute log which relates to an instance that is failing to boot because of port starvation is not interesting here. However we have the case where an instance fails to boot for "Neutron error creating port", but a port is actually created:

nova-compute.log:
2014-05-28 08:08:53.413 16699 DEBUG neutronclient.client [-] throwing ConnectionFailed : [Errno 104] Connection reset by peer _cs_request /usr/lib/python2.7/dist-packages/neutronclient/client.py:153
2014-05-28 08:08:53.417 16699 ERROR nova.network.neutronv2.api [-] [instance: 2e479806-d13e-4d11-81c1-cc2244a26ef7] Neutron error creating port on network 63657422-b84f-4d2d-b7d2-765ac560546b

(fuller section of log: http://paste.openstack.org/show/82392/)

0.2s later, nova-compute.log:
2014-05-28 08:08:53.664 16699 DEBUG neutronclient.client [-] RESP:{'date': 'Wed, 28 May 2014 08:08:53 GMT', 'status': '200', 'content-length': '13', 'content-type': 'application/json', 'content-location': 'https://region-b.geo-1.network-internal.hpcloudsvc.com/v2.0/ports.json?tenant_id=10409882459003&device_id=2e479806-d13e-4d11-81c1-cc2244a26ef7'} {"ports": []}

(this is repeated once more after 0.2s longer. Slightly longer log section: http://paste.openstack.org/show/82395/)

But eventually the port is present in the neutron database:

+----------------+--------------------------------------+--------------------+--------------------------------------+-------------------+----------------+--------+--------------------------------------+--------------+
| tenant_id | id | name | network_id | mac_address | admin_state_up | status | device_id | device_owner |
+----------------+--------------------------------------+--------------------+--------------------------------------+-------------------+----------------+--------+--------------------------------------+--------------+
| 10409882459003 | 916cba73-8925-45a2-80e9-6e9d03e602c8 | | 63657422-b84f-4d2d-b7d2-765ac560546b | fa:16:3e:a8:7d:14 | 1 | ACTIVE | 2e479806-d13e-4d11-81c1-cc2244a26ef7 | compute:az2 |

It looks like this port has been leaked by neutron. Our guess is that the "Failed to create port" is spuriously caused by the neutronclient's connection being dropped. In fact the port is being created, but it takes some time, and during that time neutron reports that there are no ports on that instance, so nothing is cleaned up when the instance is deleted. Then, the port details are actually written to the db and the port is leaked.

Openstack-infra's nodepool was unable to boot instances recently, and found several hundred ports in this state.

Solutions
=========

Neither nova nor neutron has enough information to determine which ports are leaked - so a periodic task in either of those two services would not be possible.

A user can free up their ports with a script like https://gist.github.com/moorryan/93fa4be65fc5ea60b3ed - and I think an operator could do the same. But there is a risk with this script that instances/ports which are currently being created could be wrongly identified. So care is needed.

Neutron synchronizing get_ports calls with create_port (nb I don't know the neutron codebase to know how feasible this is).

Revision history for this message
Matthew Gilliard (matthew-gilliard-u) wrote :

As a logged-in user this script will clean up ports allocated to deleted instances: https://gist.github.com/moorryan/93fa4be65fc5ea60b3ed But we should be preventing it from happenning in the first place.

Revision history for this message
Matthew Gilliard (matthew-gilliard-u) wrote :

During the flow here, where the instance never makes it to ACTIVE, it is deleted immediately: https://github.com/openstack-infra/nodepool/blob/master/nodepool/nodepool.py#L328-L354

Our suspicion is that neutron is reporting failure to create a port (which prevents the instance becoming ACTIVE), but then the port is actually created. The instance it's associated with is deleted though, so the port is orphan at this point.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Hi Matthew, is there an evidence that port was actually created while neutron has reported failure?
Is there any logs available for such case? That seems quite strange.

Changed in neutron:
status: New → Incomplete
Aaron Rosen (arosen)
no longer affects: neutron
tags: added: network
Revision history for this message
Aaron Rosen (arosen) wrote :

I tried to reproduce this in master by doing:

nova boot --image cirros-0.3.2-x86_64-uec --flavor 1 --nic net-id=5cb38a33-9fca-4422-b19d-47fd1acbc651 asdf; nova delete asdf;

but i was unable to reproduce this with both the nova-neutron event stuff enabled and disabled. What commit is the code you're running off of based at?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/96955

Changed in nova:
status: New → In Progress
Revision history for this message
Matthew Gilliard (matthew-gilliard-u) wrote : Re: port leak when instance is deleted before ACTIVE

Hello, sorry I didn't provide much detail in the initial bug report. I will update this morning.

description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
summary: - port leak when instance is deleted before ACTIVE
+ Neutron port leak when connection is dropped during port create in
+ instance boot.
description: updated
description: updated
description: updated
Aaron Rosen (arosen)
Changed in nova:
importance: Undecided → High
assignee: nobody → Aaron Rosen (arosen)
Revision history for this message
Mark McLoughlin (markmc) wrote :

Important to backport this to Icehouse because icehouse compute will start using this code path if you update your conductor to Juno

tags: added: icehouse-backport-potential
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Is this still actual for neutron?
I'll mark it as Incomplete meanwhile

Changed in neutron:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/98806

Changed in neutron:
assignee: nobody → Matthew Gilliard (matthew-gilliard-u)
status: Incomplete → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/96955
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a364cb00a39b31a9d5ddb30b5cc24af54e89f4a
Submitter: Jenkins
Branch: master

commit 5a364cb00a39b31a9d5ddb30b5cc24af54e89f4a
Author: Andrew Laski <email address hidden>
Date: Fri May 30 17:52:53 2014 -0400

    Cleanup allocating networks when InstanceNotFound is raised

    Allocated networks were not being cleaned up if InstanceNotFound was
    raised after the allocation request was made. This ensures that the
    cleanup happens for this case.

    Also bypasses exception logging that should not occur in this case.

    Change-Id: Icd544752444415812714d91971e45b3ae2fb68ee
    Closes-Bug: 1324934

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Matthew Gilliard (<email address hidden>) on branch: master
Review: https://review.openstack.org/98806
Reason: Not sure the nova fix will be sufficient but Mark's argument about clustered neutrons is strong enough by itself for me to abandon.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/100573

Matt Riedemann (mriedem)
tags: removed: icehouse-backport-potential
Changed in neutron:
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/100573
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d1c573652d213fe1ef960369a113ebce708e807b
Submitter: Jenkins
Branch: stable/icehouse

commit d1c573652d213fe1ef960369a113ebce708e807b
Author: Andrew Laski <email address hidden>
Date: Fri May 30 17:52:53 2014 -0400

    Cleanup allocating networks when InstanceNotFound is raised

    Allocated networks were not being cleaned up if InstanceNotFound was
    raised after the allocation request was made. This ensures that the
    cleanup happens for this case.

    Also bypasses exception logging that should not occur in this case.

    Conflicts:
     nova/tests/compute/test_compute_mgr.py

    NOTE(mriedem): The conflict is due to commit b1994711 which is on
    master but not stable/icehouse. It's part of a blueprint so that
    will not be backported.

    Change-Id: Icd544752444415812714d91971e45b3ae2fb68ee
    Closes-Bug: 1324934
    (cherry picked from commit 5a364cb00a39b31a9d5ddb30b5cc24af54e89f4a)

tags: added: in-stable-icehouse
Changed in neutron:
assignee: Matthew Gilliard (matthew-gilliard-u) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/64769
Reason: this seems abandoned

Thierry Carrez (ttx)
Changed in nova:
milestone: juno-1 → 2014.2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/135260

Changed in nova:
assignee: Aaron Rosen (arosen) → Matthew Gilliard (matthew-gilliard-u)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/135260
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.