Gate failures when DHCP lease cannot be acquired

Bug #1532809 reported by Davanum Srinivas (DIMS)
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Davanum Srinivas (DIMS)
Kilo
Fix Released
High
Sean Dague
Liberty
Fix Released
High
Sean Dague

Bug Description

Example from:
http://logs.openstack.org/97/265697/1/check/gate-grenade-dsvm/6eeced7/console.html#_2016-01-11_07_42_30_838

Logstash query:
message:"No lease, failing" AND voting:1

dhcp_release for an ip/mac does not seem to reach dnsmasq (or it fails to act on it - "unknown lease") as i don't see entries in syslog for it.

Logs from nova network:
dims@dims-mac:~/junk/6eeced7$ grep dhcp_release old/screen-n-net.txt.gz | grep 10.1.0.3 | grep CMD
2016-01-11 07:25:35.548 DEBUG oslo_concurrency.processutils [req-62aaa0b9-093e-4f28-805d-d4bf3008bfe6 tempest-ServersTestJSON-1206086292 tempest-ServersTestJSON-1551541405] CMD "sudo nova-rootwrap /etc/nova/rootwrap.conf dhcp_release br100 10.1.0.3 fa:16:3e:32:51:c3" returned: 0 in 0.117s execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:297
2016-01-11 07:25:51.259 DEBUG oslo_concurrency.processutils [req-31115ffa-8d43-4621-bb2e-351d6cd4bcef tempest-ServerActionsTestJSON-357128318 tempest-ServerActionsTestJSON-854742699] CMD "sudo nova-rootwrap /etc/nova/rootwrap.conf dhcp_release br100 10.1.0.3 fa:16:3e:a4:f0:11" returned: 0 in 0.108s execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:297
2016-01-11 07:26:35.357 DEBUG oslo_concurrency.processutils [req-c32a216e-d909-41a0-a0bc-d5eb7a21c048 tempest-TestVolumeBootPattern-46217374 tempest-TestVolumeBootPattern-1056816637] CMD "sudo nova-rootwrap /etc/nova/rootwrap.conf dhcp_release br100 10.1.0.3 fa:16:3e:ed:de:f6" returned: 0 in 0.110s execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:297

Logs from syslog:
dims@dims-mac:~/junk$ grep 10.1.0.3 syslog.txt.gz
Jan 11 07:25:35 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: DHCPRELEASE(br100) 10.1.0.3 fa:16:3e:32:51:c3 unknown lease
Jan 11 07:25:51 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: DHCPRELEASE(br100) 10.1.0.3 fa:16:3e:a4:f0:11 unknown lease
Jan 11 07:26:24 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: DHCPOFFER(br100) 10.1.0.3 fa:16:3e:ed:de:f6
Jan 11 07:26:24 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: DHCPREQUEST(br100) 10.1.0.3 fa:16:3e:ed:de:f6
Jan 11 07:26:24 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: DHCPACK(br100) 10.1.0.3 fa:16:3e:ed:de:f6 tempest
Jan 11 07:27:34 devstack-trusty-rax-iad-7090830 object-auditor: Object audit (ALL). Since Mon Jan 11 07:27:34 2016: Locally: 1 passed, 0 quarantined, 0 errors files/sec: 2.03 , bytes/sec: 10119063.16, Total time: 0.49, Auditing time: 0.00, Rate: 0.00
Jan 11 07:39:12 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: not using configured address 10.1.0.3 because it is leased to fa:16:3e:ed:de:f6
Jan 11 07:40:12 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: not using configured address 10.1.0.3 because it is leased to fa:16:3e:ed:de:f6
Jan 11 07:41:12 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: not using configured address 10.1.0.3 because it is leased to fa:16:3e:ed:de:f6
Jan 11 07:42:26 devstack-trusty-rax-iad-7090830 dnsmasq-dhcp[7798]: DHCPRELEASE(br100) 10.1.0.3 fa:16:3e:fe:1f:36 unknown lease

Net, The test that runs the ssh against the vm fails with the "No lease, failing" in its serial console

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :
tags: added: testing
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Here's a lead on the gate-grenade-dsvm - http://paste.openstack.org/show/484637/

We "kill -HUP 21237" the dnsmasq and immediately issue a dhcp_release, you can see in syslog.txt that the HUP reaches dnsmasq while it is reloading. As a result screen-n-dhcp does not show a 'del' at all (nor is there a DHCPRELEASE in syslog)

In fact during the reload dnsmasq sends a 'old' to the dhcp script. So we probably have to wait for the reload to finish (sleep?) before we call dhcp_release.

Note that we can't switch order of HUP and release because it causes other problems :
http://git.openstack.org/cgit/openstack/nova/tree/nova/network/manager.py#n1048

Changed in nova:
assignee: nobody → Davanum Srinivas (DIMS) (dims-v)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/271132

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/liberty)

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/271132

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/268796

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/273023
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=34e6c0fb15a9954247b0b7d8170a62bd37e95df1
Submitter: Jenkins
Branch: master

commit 34e6c0fb15a9954247b0b7d8170a62bd37e95df1
Author: Sean Dague <email address hidden>
Date: Wed Jan 27 12:53:11 2016 +0000

    fast exit dhcpbridge on 'old'

    When dnsmasq sends a lease change event ('old'), we do
    nothing. However we wait quite a long time before we do nothing, and
    have built up rpc connections to conductor by then. We should exit
    early in these cases instead to cause less backup getting dnsmasq
    lease events processed.

    Related-Bug: #1532809

    Change-Id: I1539fb9eb8570a9efbaa95a1272ac9ac571c365a

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/273042
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=32b4ce0e92b333413163f31f2962d109941ee362
Submitter: Jenkins
Branch: master

commit 32b4ce0e92b333413163f31f2962d109941ee362
Author: Sean Dague <email address hidden>
Date: Wed Jan 27 13:02:35 2016 +0000

    Spread allocations of fixed ips

    The default fixed_ip allocation always does .first() out of the fixed
    ip pool. This leads to a pathological situation that in a rapidly
    changing environment we're constantly recycling the same small number
    of ips out of the pool. Given other racey behavior in dnsmasq around
    processing DHCP Release requests, we can start to get DHCP fails when
    a DHCP release is missed, we immediately recycle the IP and give it to
    a new guest.

    We can mitigate this by sorting the results by updated_at. This
    means we'll favor least recently used fixed ips.

    Depends-On: I27f73c1edf12218818c4d279efbd9fef5cdef672

    Related-Bug: #1532809

    Change-Id: I3c83bd68a0e2bbbcdd6d955722dbc9f9fc528113

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/278991

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/278992

Revision history for this message
Matt Riedemann (mriedem) wrote :

Marked as high priority since it's nova-net specific at this point and that's what the related fixes are for.

Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/liberty)

Reviewed: https://review.openstack.org/278991
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=65d0ef8b69a03f13267ffdbbc1b3a8c68062208a
Submitter: Jenkins
Branch: stable/liberty

commit 65d0ef8b69a03f13267ffdbbc1b3a8c68062208a
Author: Sean Dague <email address hidden>
Date: Wed Jan 27 13:02:35 2016 +0000

    Spread allocations of fixed ips

    The default fixed_ip allocation always does .first() out of the fixed
    ip pool. This leads to a pathological situation that in a rapidly
    changing environment we're constantly recycling the same small number
    of ips out of the pool. Given other racey behavior in dnsmasq around
    processing DHCP Release requests, we can start to get DHCP fails when
    a DHCP release is missed, we immediately recycle the IP and give it to
    a new guest.

    We can mitigate this by sorting the results by updated_at. This
    means we'll favor least recently used fixed ips.

    Depends-On: I27f73c1edf12218818c4d279efbd9fef5cdef672

    Related-Bug: #1532809

    Change-Id: I3c83bd68a0e2bbbcdd6d955722dbc9f9fc528113
    (cherry picked from commit 32b4ce0e92b333413163f31f2962d109941ee362)

tags: added: in-stable-liberty
Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/kilo)

Reviewed: https://review.openstack.org/278992
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2b3e6e736123d7eaf03b8a13cf90976a75b27877
Submitter: Jenkins
Branch: stable/kilo

commit 2b3e6e736123d7eaf03b8a13cf90976a75b27877
Author: Sean Dague <email address hidden>
Date: Wed Jan 27 13:02:35 2016 +0000

    Spread allocations of fixed ips

    The default fixed_ip allocation always does .first() out of the fixed
    ip pool. This leads to a pathological situation that in a rapidly
    changing environment we're constantly recycling the same small number
    of ips out of the pool. Given other racey behavior in dnsmasq around
    processing DHCP Release requests, we can start to get DHCP fails when
    a DHCP release is missed, we immediately recycle the IP and give it to
    a new guest.

    We can mitigate this by sorting the results by updated_at. This
    means we'll favor least recently used fixed ips.

    Depends-On: I27f73c1edf12218818c4d279efbd9fef5cdef672

    Related-Bug: #1532809

    Change-Id: I3c83bd68a0e2bbbcdd6d955722dbc9f9fc528113
    (cherry picked from commit 32b4ce0e92b333413163f31f2962d109941ee362)
    (cherry picked from commit 65d0ef8b69a03f13267ffdbbc1b3a8c68062208a)

tags: added: in-stable-kilo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.