openvswitch agent eating CPU, time spent in ip_conntrack.py

Bug #1750777 reported by Thomas Morin
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
High
Unassigned
Queens
Fix Released
High
Unassigned
neutron
Fix Released
High
Brian Haley
neutron (Ubuntu)
Fix Released
High
Unassigned
Bionic
Fix Released
High
Unassigned
Cosmic
Fix Released
High
Unassigned

Bug Description

We just ran into a case where the openvswitch agent (local dev destack, current master branch) eats 100% of CPU time.

Pyflame profiling show the time being largely spent in neutron.agent.linux.ip_conntrack, line 95.

https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_conntrack.py#L95

The code around this line is:

        while True:
            pool.spawn_n(self._process_queue)

The documentation of eventlet.spawn_n says: "The same as spawn(), but it’s not possible to know how the function terminated (i.e. no return value or exceptions). This makes execution faster. See spawn_n for more details." I suspect that GreenPool.spawn_n may behave similarly.

It seems plausible that spawn_n is returning very quickly because of some error, and then all time is quickly spent in a short circuited while loop.

SRU details for Ubuntu:
-----------------------
[Impact]
We're cherry-picking a single bug-fix patch here from the upstream stable/queens branch as there is not currently an upstream stable point release available that includes this fix. We'd like to make sure all of our supported customers have access to this fix as there is a significant performance hit without it.

[Test Case]
The following SRU process was followed:
https://wiki.ubuntu.com/OpenStackUpdates

In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the proposed packages can be let into -updates.

The OpenStack team will be in charge of attaching the output summary of the executed tests. The OpenStack team members will not mark ‘verification-done’ until this has happened.

[Regression Potential]
In order to mitigate the regression potential, the results of the
aforementioned tests are attached to this bug.

description: updated
Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

The bug is in the code introduced by https://review.openstack.org/#/c/537654 .

I believe the importance of this bug should perhaps be raised.

Revision history for this message
Brian Haley (brian-haley) wrote :

I couldn't reproduce this with a quick stack, will try again later. I did create a test patch to test a theory, don't know if it will fix the issue however.

Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/548976

Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
status: New → In Progress
Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

I've looked at the code, and unless I read the code too quickly (which is honestly quite a possibility given the amount of snow and travelling plans clogging in my head), I think we are having a race condition:

(code at [1]:)

        while not self._queue.empty():
            update = self._queue.get()
            yield update

The race is:
- thread A entering the while because queue is not empty
- (thread A hasn't done queue.get() yet)
- thread B entering the while because queue is not empty
- thread B doing queue.get()
- thread A doing queue.get() -> but perhaps now the queue is empty

I find that the design of the multi-threading here could perhaps be simpler: instead of looping on the creation of short-lived threads that compete on emptying the queue, then are recreated by the while loop at [2], I wonder if it wouldn't be better to have long-lived threads that block on a queue.get(True) (True means blocking). Doing a blocking queue.get() solves the race condition above, and since the threads don't die shortly after stopping to work, they don't need to be recreated by a spawn_n inside a loop.

Given than I'm no concurrency issue expert, I'm might as well have missed something important.

[1] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_conntrack.py#L54-L56
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/ip_conntrack.py#L94

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/549331

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

one detail in the above is wrong: queue.get() is blocking by default
but I think the rest of the comment still holds

I've pushed a change trying to implement what I hope could be a simplification

Revision history for this message
Brian Haley (brian-haley) wrote :

So one thing I think I noticed is that queue.get() can throw an exception, even though it's blocking. The reason is that the Queue (actually LightQueue) class is actually based on collections.queue, and from it's doc page (get is implemented by popleft):

popleft()
    Remove and return an element from the left side of the deque. If no elements are present, raises an IndexError.

So there is a possible race there. I'll update my change based on that.

Revision history for this message
Brian Haley (brian-haley) wrote :

s/collections.queue/collections.deque

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/548976
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=4c8b97eca32c9c2beadf95fef14ed5b7d8981c5a
Submitter: Zuul
Branch: master

commit 4c8b97eca32c9c2beadf95fef14ed5b7d8981c5a
Author: Brian Haley <email address hidden>
Date: Thu Mar 1 15:42:59 2018 +0000

    Do not start conntrack worker thread from __init__

    Instead, start it when the first entry is being added to
    the queue. Also, log any exceptions just in case get()
    throws something so we can do further debugging.

    Changed class from Queue to LightQueue was done after going
    through the eventlet.queue code looking at usage, since
    it's a little smaller and should be faster.

    Change-Id: Ie84be88382f327ebe312bf17ec2dc5c80a8de35f
    Closes-bug: 1750777

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/550598

Revision history for this message
Brian Haley (brian-haley) wrote :

Miguel was able to reproduce this by spawning a VM and then destroying it, top would then show the agent in a tight loop. I was able to do the same and found that even with my fix that got merged, the agent could still go into a tight loop. Thomas' change is better, https://review.openstack.org/#/c/549331/ - I just updated it based on additional testing and have proposed a revert of my change that merged.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/queens)

Change abandoned by Brian Haley (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/550598

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/549331
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=35c225aaa37ee685000bfa07a02e83481485138d
Submitter: Zuul
Branch: master

commit 35c225aaa37ee685000bfa07a02e83481485138d
Author: Thomas Morin <email address hidden>
Date: Sat Mar 3 00:00:47 2018 +0000

    Remove race and simplify conntrack state management

    This change:
    * avoids creating lots of short-lived threads
      (because: why would we ?)
    * adds a try/except block to be sure we can log
      any issue that would otherwise be hidden by
      spawn_n.
    * changes class from Queue to LightQueue after
      eventlet.queue code inspection, since it's a
      little smaller and should be faster

    Change-Id: Ic348c08af375099a919116188ae17d2017695ecb
    Closes-Bug: 1750777

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/554258

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/554258
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cca870abbc4f8a784c9cbed817ff307ab3094daa
Submitter: Zuul
Branch: stable/queens

commit cca870abbc4f8a784c9cbed817ff307ab3094daa
Author: Thomas Morin <email address hidden>
Date: Sat Mar 3 00:00:47 2018 +0000

    Remove race and simplify conntrack state management

    This change:
    * avoids creating lots of short-lived threads
      (because: why would we ?)
    * adds a try/except block to be sure we can log
      any issue that would otherwise be hidden by
      spawn_n.
    * changes class from Queue to LightQueue after
      eventlet.queue code inspection, since it's a
      little smaller and should be faster

    Change-Id: Ic348c08af375099a919116188ae17d2017695ecb
    Closes-Bug: 1750777
    (cherry picked from commit 35c225aaa37ee685000bfa07a02e83481485138d)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b1

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Would there be a possibility to do a new neutron release in Queens?
This bug is only in stable branch, not the 12.0.1 release - and most deployment project install latest release... which makes most of new deployments useless... (using 100% cpu)

Revision history for this message
Brian Haley (brian-haley) wrote :

Michal - I don't understand, this is fixed in the stable/queens branch now. Where are you getting your packages from?

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

I'm using kolla-ansible centos/binary which installs neutron 12.0.1 from RDO (latest release, not the stable git repo - released 27 days ago) - which doesn't have the patch for 100% cpu usage (not counting other patches between 12.0.1 release and stable/queens branch)

./bugs-fixed-since.py --repo ../neutron --start 12.0.1 | ./annotate-lp-bugs.py neutron | wc -l
      61

./bugs-fixed-since.py --repo ../neutron --start 12.0.1 | ./annotate-lp-bugs.py neutron | grep 1750777
https://bugs.launchpad.net/bugs/1750777 "openvswitch agent eating CPU, time spent in ip_conntrack.py" (High,Fix Released) [in-stable-queens] [brian-haley]

Revision history for this message
Brian Haley (brian-haley) wrote :

Ok, now it makes sense to me. I have asked to create a new tag for stable/queens, will discuss at next Monday's neutron meeting since we should probably be doing it on a regular cadence, https://wiki.openstack.org/wiki/Network/Meetings#On_Demand_Agenda

Revision history for this message
Brian Haley (brian-haley) wrote :

https://review.openstack.org/#/c/564748/ has been proposed to tag neutron 12.0.2

Changed in neutron (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in neutron (Ubuntu Bionic):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Brian Haley (brian-haley) wrote :

Corey - I was going to reply to your upstream announcement to pick this up, we're just waiting on the 12.0.2 tag to merge.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks Brian. I'll cherry-pick it for now and pick up the stable release when that's available.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

New versions of neutron for queens has been uploaded to the cosmic and bionic unapproved queues where they are awaiting acceptance:

https://launchpad.net/ubuntu/cosmic/+queue?queue_state=1&queue_text=
https://launchpad.net/ubuntu/bionic/+queue?queue_state=1&queue_text=

description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Thomas, or anyone else affected,

Accepted neutron into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:12.0.1-0ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Thomas, or anyone else affected,

Accepted neutron into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:12.0.1-0ubuntu2

---------------
neutron (2:12.0.1-0ubuntu2) cosmic; urgency=medium

  * d/p/remove-race-and-simplify-conntrack-state-management.patch:
    Cherry-picked from upstream stable/queens branch to prevent
    ovs-agent from eating up CPU (LP: #1750777).

 -- Corey Bryant <email address hidden> Wed, 02 May 2018 15:00:58 -0400

Changed in neutron (Ubuntu Cosmic):
status: Triaged → Fix Released
Revision history for this message
Trent Lloyd (lathiat) wrote :

Verified on xenial-queens; fresh nova-lxd install from conjure-up was exhibiting the behavior, restarting the service did not help. Upgraded package from queens-proposed and the issue no longer occurs.

tags: added: verification-queens-done
removed: verification-queens-needed
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Regression tests have passed successfully for queens-proposed and bionic-proposed:

queens-proposed

======
Totals
======
Ran: 91 tests in 1308.6992 sec.
 - Passed: 83
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 502.7083 sec.

bionic-proposed:

======
Totals
======
Ran: 91 tests in 1229.6402 sec.
 - Passed: 83
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 497.9889 sec.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Thanks for testing Trent!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.2

This issue was fixed in the openstack/neutron 12.0.2 release.

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:12.0.1-0ubuntu1.1

---------------
neutron (2:12.0.1-0ubuntu1.1) bionic; urgency=medium

  * d/p/remove-race-and-simplify-conntrack-state-management.patch:
    Cherry-picked from upstream stable/queens branch to prevent
    ovs-agent from eating up CPU (LP: #1750777).
  * d/gbp.conf: Create stable/queens branch.

 -- Corey Bryant <email address hidden> Wed, 02 May 2018 15:00:58 -0400

Changed in neutron (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package neutron - 2:12.0.1-0ubuntu1.1~cloud0
---------------

 neutron (2:12.0.1-0ubuntu1.1~cloud0) xenial-queens; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:12.0.1-0ubuntu1.1) bionic; urgency=medium
 .
   * d/p/remove-race-and-simplify-conntrack-state-management.patch:
     Cherry-picked from upstream stable/queens branch to prevent
     ovs-agent from eating up CPU (LP: #1750777).
   * d/gbp.conf: Create stable/queens branch.

tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
Changed in cloud-archive:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.