Race between neutron-server and l3-agent

Bug #1353953 reported by Cian O'Driscoll
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Derek Higgins
tripleo
Fix Released
High
Ben Nemec

Bug Description

http://logs.openstack.org/58/87758/24/check-tripleo/check-tripleo-novabm-overcloud-f20-nonha/848e217/console.html

2014-08-07 10:35:52.753 | + wait_for 30 10 ping -c 1 192.0.2.46
2014-08-07 10:42:23.169 | Timing out after 300 seconds:
2014-08-07 10:42:23.169 | COMMAND=ping -c 1 192.0.2.46
2014-08-07 10:42:23.169 | OUTPUT=PING 192.0.2.46 (192.0.2.46) 56(84) bytes of data.
2014-08-07 10:42:23.169 | From 192.0.2.46 icmp_seq=1 Destination Host Unreachable
2014-08-07 10:42:23.169 |
2014-08-07 10:42:23.169 | --- 192.0.2.46 ping statistics

looks like neutron dhcp agent issues

http://logs.openstack.org/58/87758/24/check-tripleo/check-tripleo-novabm-overcloud-f20-nonha/848e217/logs/overcloud-controller0_logs/neutron-dhcp-agent.txt.gz

Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv sudo[14027]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qdhcp-09fcf8a1-ffd3-4f99-869a-8b227de009f6 ip link set tap7e59533d-32 up
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: 2014-08-07 10:31:10.476 12316 ERROR neutron.agent.linux.utils [req-fdebecfd-81d2-48c2-8765-7545e1e9dbb1 None]
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-09fcf8a1-ffd3-4f99-869a-8b227de009f6', 'ip', 'link', 'set', 'tap7e59533d-32', 'up']
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Exit code: 1
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Stdout: ''
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Stderr: 'Cannot open network namespace "qdhcp-09fcf8a1-ffd3-4f99-869a-8b227de009f6": No such file or directory\n'
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv sudo[14032]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec qdhcp-09fcf8a1-ffd3-4f99-869a-8b227de009f6 ip -o link show tap7e59533d-32
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: 2014-08-07 10:31:10.596 12316 ERROR neutron.agent.linux.utils [req-fdebecfd-81d2-48c2-8765-7545e1e9dbb1 None]
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Command: ['sudo', '/usr/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qdhcp-09fcf8a1-ffd3-4f99-869a-8b227de009f6', 'ip', '-o', 'link', 'show', 'tap7e59533d-32']
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Exit code: 1
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Stdout: ''
Aug 07 10:31:10 overcloud-controller0-s42ewsqhiswv neutron-dhcp-agent[12316]: Stderr: 'Cannot open network namespace "qdhcp-09fcf8a1-ffd3-4f99-869a-8b227de009f6": No such file or directory\n'

Changed in tripleo:
status: New → Confirmed
Revision history for this message
Cian O'Driscoll (dricco) wrote :

"No such file or directory" might mean the system is overloaded

Revision history for this message
Cian O'Driscoll (dricco) wrote :

https://review.openstack.org/#/c/112599/ didn't fix the issue. Could still possible be due to system load

Revision history for this message
Ben Nemec (bnemec) wrote :

I think the namespace issue may be a red herring. I'm seeing that in a successful run too: http://logs.openstack.org/21/110121/6/check-tripleo/check-tripleo-novabm-overcloud-f20-nonha/c07a46b/logs/overcloud-controller0_logs/neutron-dhcp-agent.txt.gz#_Aug_06_22_10_33

It's probably Neutron checking to see if the namespace exists or something.

Ben Nemec (bnemec)
Changed in tripleo:
importance: Undecided → Critical
Revision history for this message
Ben Nemec (bnemec) wrote :
Revision history for this message
Ben Nemec (bnemec) wrote :

Okay, on a run where neutron-l3-agent doesn't end up dead, os-collect-config ran twice. The first time it started l3-agent it died, but the second time it started successfully: http://logs.openstack.org/74/100374/3/check-tripleo/check-tripleo-novabm-overcloud-f20-nonha/95605a2/logs/overcloud-controller0_logs/os-collect-config.txt.gz

On a bad run, o-c-c only runs once and l3-agent remains dead: http://logs.openstack.org/69/111369/6/check-tripleo/check-tripleo-novabm-overcloud-f20-nonha/772ef25/logs/overcloud-controller0_logs/os-collect-config.txt.gz

Maybe a race with rabbitmq or one of the other neutron services that it seems to be trying to talk to?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (master)

Fix proposed to branch: master
Review: https://review.openstack.org/112674

Changed in tripleo:
assignee: nobody → Ben Nemec (bnemec)
status: Confirmed → In Progress
Revision history for this message
Ben Nemec (bnemec) wrote : Re: guest instance ping timeout on overcloud (Upstream CI failure)

Okay, so the sequence of events from what I can see goes like this:

Successful run
* os-collect-config runs
* l3-agent starts, sends rpc, dies because neutron-server isn't available to respond
* neutron-server starts
* os-collect-config runs again
* l3-agent starts, sends rpc, gets response from neutron-server that is still running from previous o-c-c run
* neutron-server restarts with no ill effects
* happiness

Unsuccessful run
* os-collect-config runs
* l3-agent starts, sends rpc, dies same as above
* neutron-server starts
* For whatever reason os-collect-config doesn't run a second time, so l3-agent is never restarted with neutron-server running and thus overcloud networking is hosed

Ben Nemec (bnemec)
summary: - guest instance ping timeout on overcloud (Upstream CI failure)
+ Race between neutron-server and l3-agent
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (master)

Reviewed: https://review.openstack.org/112674
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=8322ffdc8a35d7875b78a9e265e8ca70c1ee936e
Submitter: Jenkins
Branch: master

commit 8322ffdc8a35d7875b78a9e265e8ca70c1ee936e
Author: Ben Nemec <email address hidden>
Date: Thu Aug 7 15:06:57 2014 -0500

    Try to start neutron-server first

    neutron-l3-agent fails to start if neutron-server does not respond
    to an rpc message within one minute. Because of the order our
    neutron scripts run in, this often happens. To work around
    the problem, this change moves the neutron-server script from
    the 80 level to 79 so it will be started before the other neutron
    services.

    Ultimately this needs to be solved in neutron-l3-agent, but right
    now this is blocking our CI so we need to address it immediately.

    Change-Id: I604adab055fe9c1b0d7bee9cc30ac79afb2d2315
    Partial-Bug: 1353953

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-image-elements (master)

Change abandoned by Ben Nemec (<email address hidden>) on branch: master
Review: https://review.openstack.org/100374

Revision history for this message
Erik Colnick (erikcolnick) wrote :

If the issue is that the l3-agent is timing out waiting on an rpc-response, have you tried increasing the rpc_response_timeout value in neutron.conf to something higher?

Ben Nemec (bnemec)
Changed in tripleo:
importance: Critical → High
Revision history for this message
Ben Nemec (bnemec) wrote :

Moving to High as we have a workaround in place, although we are still hitting this on occasion in CI.

I don't think changing the rpc timeout is a solution to this either. We don't know how far apart the l3-agent and server are going to start, so the rpc timeout would have to be extremely high. l3-agent should continue retrying to contact neutron-server until it is successful. There's really no reason I can see for that to be a hard failure like this.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Folks, what kind of fix are you expecting from neutron side?
Marking Incomplete for neutron

Changed in neutron:
status: New → Incomplete
Revision history for this message
Ben Nemec (bnemec) wrote :

I think what we're looking for is l3-agent to retry the rpc until it succeeds instead of exiting immediately just because the first call timed out. That would basically eliminate any possible timing issues like this between the two services. Is there any reason l3-agent can't do that?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/121492

Changed in neutron:
assignee: nobody → Derek Higgins (derekh)
status: Incomplete → In Progress
Changed in neutron:
milestone: none → juno-rc1
milestone: juno-rc1 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/121492
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e7f0b56d74fbfbb08a3b7a0d2da4cefb6fe2aa67
Submitter: Jenkins
Branch: master

commit e7f0b56d74fbfbb08a3b7a0d2da4cefb6fe2aa67
Author: Derek Higgins <email address hidden>
Date: Fri Sep 12 16:31:44 2014 +0100

    Retry getting the list of service plugins

    On systems that start both neutron-server and neutron-l3-agent together,
    there is a chance that the first call to neutron will timeout. Retry upto
    4 more times to avoid the l3 agent exiting on startup.

    This should make the l3 agent a little more robust on startup but still
    not ideal, ideally it wouldn't exit and retry periodically.

    Change-Id: I2171a164f3f77bccd89895d73c1c8d67f7190488
    Closes-Bug: #1353953
    Closes-Bug: #1368152
    Closes-Bug: #1368795

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → juno-rc2
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (proposed/juno)

Fix proposed to branch: proposed/juno
Review: https://review.openstack.org/126903

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (proposed/juno)

Reviewed: https://review.openstack.org/126903
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6e79981b7caadbbbb2119461034dfe7b4d1c1a64
Submitter: Jenkins
Branch: proposed/juno

commit 6e79981b7caadbbbb2119461034dfe7b4d1c1a64
Author: Derek Higgins <email address hidden>
Date: Fri Sep 12 16:31:44 2014 +0100

    Retry getting the list of service plugins

    On systems that start both neutron-server and neutron-l3-agent together,
    there is a chance that the first call to neutron will timeout. Retry upto
    4 more times to avoid the l3 agent exiting on startup.

    This should make the l3 agent a little more robust on startup but still
    not ideal, ideally it wouldn't exit and retry periodically.

    Change-Id: I2171a164f3f77bccd89895d73c1c8d67f7190488
    Closes-Bug: #1353953
    Closes-Bug: #1368152
    Closes-Bug: #1368795
    (cherry picked from commit e7f0b56d74fbfbb08a3b7a0d2da4cefb6fe2aa67)

Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-rc2 → 2014.2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/128913

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/lbaasv2)

Fix proposed to branch: feature/lbaasv2
Review: https://review.openstack.org/130864

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/lbaasv2)
Download full text (72.6 KiB)

Reviewed: https://review.openstack.org/130864
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c089154a94e5872efc95eab33d3d0c9de8619fe4
Submitter: Jenkins
Branch: feature/lbaasv2

commit 62588957fbeccfb4f80eaa72bef2b86b6f08dcf8
Author: Kevin Benton <email address hidden>
Date: Wed Oct 22 13:04:03 2014 -0700

    Big Switch: Switch to TLSv1 in server manager

    Switch to TLSv1 for the connections to the backend
    controllers. The default SSLv3 is no longer considered
    secure.

    TLSv1 was chosen over .1 or .2 because the .1 and .2 weren't
    added until python 2.7.9 so TLSv1 is the only compatible option
    for py26.

    Closes-Bug: #1384487
    Change-Id: I68bd72fc4d90a102003d9ce48c47a4a6a3dd6e03

commit 17204e8f02fdad046dabdb8b31397289d72c877b
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Oct 22 06:20:15 2014 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I58db0476c810aa901463b07c42182eef0adb5114

commit d712663b99520e6d26269b0ca193527603178742
Author: Carl Baldwin <email address hidden>
Date: Mon Oct 20 21:48:42 2014 +0000

    Move disabling of metadata and ipv6_ra to _destroy_router_namespace

    I noticed that disable_ipv6_ra is called from the wrong place and that
    in some cases it was called with a bogus router_id because the code
    made an incorrect assumption about the context. In other case, it was
    never called because _destroy_router_namespace was being called
    directly. This patch moves the disabling of metadata and ipv6_ra in
    to _destroy_router_namespace to ensure they get called correctly and
    avoid duplication.

    Change-Id: Ia76a5ff4200df072b60481f2ee49286b78ece6c4
    Closes-Bug: #1383495

commit f82a5117f6f484a649eadff4b0e6be9a5a4d18bb
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Oct 21 12:11:19 2014 +0000

    Updated from global requirements

    Change-Id: Idcbd730f5c781d21ea75e7bfb15959c8f517980f

commit be6bd82d43fbcb8d1512d8eb5b7a106332364c31
Author: Angus Lees <email address hidden>
Date: Mon Aug 25 12:14:29 2014 +1000

    Remove duplicate import of constants module

    .. and enable corresponding pylint check now the only offending instance
    is fixed.

    Change-Id: I35a12ace46c872446b8c87d0aacce45e94d71bae

commit 9902400039018d77aa3034147cfb24ca4b2353f6
Author: rajeev <email address hidden>
Date: Mon Oct 13 16:25:36 2014 -0400

    Fix race condition on processing DVR floating IPs

    Fip namespace and agent gateway port can be shared by multiple dvr routers.
    This change uses a set as the control variable for these shared resources
    and ensures that Test and Set operation on the control variable are
    performed atomically so that race conditions do not occur among
    multiple threads processing floating IPs.
    Limitation: The scope of this change is limited to addressing the race
    condition described in the bug report. It may not address other issues
    such as pre-existing issue wit...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)
Download full text (7.4 KiB)

Reviewed: https://review.openstack.org/128913
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=71df7c80b9efa84f2ef87a2299600066816870b4
Submitter: Jenkins
Branch: master

commit b28eda57223e492924edb731e24c2e4f64cc0de5
Author: Carl Baldwin <email address hidden>
Date: Wed Oct 8 03:22:49 2014 +0000

    Remove two sets that are not referenced

    The code no longer references the updated_routers and removed_routers
    sets. This should have been cleaned up before but was missed.

    Closes-bug: #1232525

    Change-Id: I0396e13d2f7c3789928e0c6a4c0a071b02d5ff17
    (cherry picked from commit edb26bfcddf9d9a0e95955a6590d11fa7245ea2b)

commit 9cce0bfdb713c2b975b289d90de6d57b68ca3854
Author: Mark McClain <email address hidden>
Date: Thu Oct 9 13:29:48 2014 +0000

    Add Juno release milestone

    Change-Id: Iea584b00329d9474c14847db958f8743d4058525
    Closes-Bug: #1378855
    (cherry picked from commit 4e8a5b7de71ba6f8c050c424613c025310498940)

commit 8e76cccb1ed9a248439b1188d1d805649169e46b
Author: Mark McClain <email address hidden>
Date: Wed Oct 8 18:49:20 2014 +0000

    Add database relationship between router and ports

    Add an explicit schema relationship between a router and its ports. This
    change ensures referential integrity among the entities and prevents orphaned
    ports.

    Change-Id: I09e8a694cdff7f64a642a39b45cbd12422132806
    Closes-Bug: #1378866
    (cherry picked from commit 93012915a3445a8ac8a0b30b702df30febbbb728)

commit 5610343d5aab876480cbe15c8d77631e67d6142f
Author: Henry Gessau <email address hidden>
Date: Tue Oct 7 20:38:38 2014 -0400

    Disable PUT for IPv6 subnet attributes

    In Juno we are not ready for allowing the IPv6 attributes on a subnet
    to be updated after the subnet is created, because:
    - The implementation for supporting updates is incomplete.
    - Perceived lack of usefulness, no good use cases known yet.
    - Allowing updates causes more complexity in the code.
    - Have not tested that radvd, dhcp, etc. behave OK after update.

    Therefore, for now, we set 'allow_put' to False for the two IPv6
    attributes, ipv6_ra_mode and ipv6_address_mode. This prevents the
    modes from being updated via the PUT:subnets API.

    Closes-bug: #1378952

    Change-Id: Id6ce894d223c91421b62f82d266cfc15fa63ed0e
    (cherry picked from commit 8a08a3cb47d0dd69d4aa2e8fa661d04054fe95ae)

commit 54be5a9e977ea344cc53addb87635ddba0cfd815
Author: Sean M. Collins <email address hidden>
Date: Mon Oct 6 15:47:24 2014 -0400

    Skip IPv6 Tests in the OpenContrail plugin

    Similar to the way we are skipping tests in the OneConvergence plugin,
    introduced by Kevin Benton in 9294de441e684a81f6e802ba0564083f1ad319d6.

    Partial-Bug: #1378952

    Change-Id: I1650b0708af73ce63e92c55bc842607bb69efe60
    (cherry picked from commit 67962943969bc737a3f680a0defc2fc9df03c429)

commit aefc12ec552afe32f0d1d6f7c8c588afac956988
Author: Ihar Hrachyshka <email address hidden>
Date: Thu Aug 7 22:27:23 2014 +0200

    Removed kombu from requirements

    Since we've replaced oslo-incubator RPC layer with...

Read more...

Ben Nemec (bnemec)
Changed in tripleo:
status: In Progress → Confirmed
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.