gate-tempest-dsvm-large-ops fails to deallocate instance network due to rpc timeout

Bug #1491949 reported by Matt Riedemann
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Medium
Unassigned
devstack
Fix Released
Undecided
Matt Riedemann

Bug Description

http://logs.openstack.org/96/219696/4/check/gate-tempest-dsvm-large-ops/158f061/logs/screen-n-cpu-1.txt.gz?level=TRACE

2015-09-03 15:11:10.090 ERROR nova.compute.manager [req-ae96c425-a199-472f-a0db-e1b48147bb4c tempest-TestLargeOpsScenario-1690771764 tempest-TestLargeOpsScenario-1474206998] [instance: 195361d7-95c3-4740-825b-1acab707969e] Failed to deallocate network for instance.
2015-09-03 15:11:11.051 ERROR nova.compute.manager [req-ae96c425-a199-472f-a0db-e1b48147bb4c tempest-TestLargeOpsScenario-1690771764 tempest-TestLargeOpsScenario-1474206998] [instance: 195361d7-95c3-4740-825b-1acab707969e] Setting instance vm_state to ERROR
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] Traceback (most recent call last):
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/compute/manager.py", line 2396, in do_terminate_instance
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] self._delete_instance(context, instance, bdms, quotas)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/hooks.py", line 149, in inner
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] rv = f(*args, **kwargs)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/compute/manager.py", line 2375, in _delete_instance
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] quotas.rollback()
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] six.reraise(self.type_, self.value, self.tb)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/compute/manager.py", line 2338, in _delete_instance
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] self._shutdown_instance(context, instance, bdms)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/compute/manager.py", line 2265, in _shutdown_instance
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] self._try_deallocate_network(context, instance, requested_networks)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/compute/manager.py", line 2194, in _try_deallocate_network
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] self._set_instance_obj_error_state(context, instance)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] six.reraise(self.type_, self.value, self.tb)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/compute/manager.py", line 2189, in _try_deallocate_network
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] self._deallocate_network(context, instance, requested_networks)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/compute/manager.py", line 1812, in _deallocate_network
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] context, instance, requested_networks=requested_networks)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/network/api.py", line 49, in wrapped
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] return func(self, context, *args, **kwargs)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/network/api.py", line 297, in deallocate_for_instance
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] requested_networks=requested_networks)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/opt/stack/new/nova/nova/network/rpcapi.py", line 175, in deallocate_for_instance
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] return cctxt.call(ctxt, 'deallocate_for_instance', **kwargs)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] retry=self.retry)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] timeout=timeout, retry=retry)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 431, in send
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] retry=retry)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 420, in _send
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] result = self._waiter.wait(msg_id, timeout)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 318, in wait
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] message = self.waiters.get(msg_id, timeout=timeout)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 223, in get
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] 'to message ID %s' % msg_id)
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e] MessagingTimeout: Timed out waiting for a reply to message ID 20d504c45f1c45e980c0194ac492d531
2015-09-03 15:11:11.051 22635 ERROR nova.compute.manager [instance: 195361d7-95c3-4740-825b-1acab707969e]

I see this in the n-net logs which I think we only do if something else fails:

http://logs.openstack.org/96/219696/4/check/gate-tempest-dsvm-large-ops/158f061/logs/screen-n-net.txt.gz#_2015-09-03_15_11_10_664

2015-09-03 15:11:10.664 DEBUG nova.network.manager [req-ae96c425-a199-472f-a0db-e1b48147bb4c tempest-TestLargeOpsScenario-1690771764 tempest-TestLargeOpsScenario-1474206998] [instance: 195361d7-95c3-4740-825b-1acab707969e] Explicitly disassociating fixed IP %s from instance. deallocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:1077

https://github.com/openstack/nova/blob/master/nova/network/manager.py#L1068

                    # NOTE(yufang521247): This is probably a failed dhcp fixed
                    # ip. DHCPRELEASE packet sent to dnsmasq would not trigger
                    # dhcp-bridge to run. Thus it is better to disassociate
                    # such fixed ip here.
                    fixed_ip_ref = objects.FixedIP.get_by_address(
                        context, address)
                    if (instance_uuid == fixed_ip_ref.instance_uuid and
                            not fixed_ip_ref.leased):
                        LOG.debug('Explicitly disassociating fixed IP %s from '
                                  'instance.', instance_uuid=instance_uuid)
                        fixed_ip_ref.disassociate()

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Given the logstash hits, this just started in the last 48 hours or so, since 9/3, so looking at what changed:

https://github.com/openstack/nova/commits/master/nova/network

There were 2 neutron changes and then this for nova-network:

https://review.openstack.org/#/c/197608/

^ adds more logging, and it's in the release_fixed_ip method which is directly related to whether or not we hit the code block in:

https://github.com/openstack/nova/blob/master/nova/network/manager.py#L1068

Revision history for this message
Matt Riedemann (mriedem) wrote :

This devstack change to turn on multihost=true is what triggered the bug:

https://review.openstack.org/#/c/218860/

The revert is here:

https://review.openstack.org/#/c/220525/

Changed in devstack:
status: New → In Progress
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

(9:21:42 AM) sdague: so, honestly, the thing is that we're driving an unrealistic throughput on one node there
(9:21:54 AM) sdague: because we are using fakevirt
(9:22:39 AM) sdague: I suspect starting 175 vms on one node at once is probably not a thing with an actual hypervisor :)

Changed in nova:
status: Confirmed → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack (master)

Reviewed: https://review.openstack.org/220525
Committed: https://git.openstack.org/cgit/openstack-dev/devstack/commit/?id=975243189216561f66ca91520495e0c6e2f747e2
Submitter: Jenkins
Branch: master

commit 975243189216561f66ca91520495e0c6e2f747e2
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 4 14:15:27 2015 +0000

    Revert "turn multi host true for nova network by default"

    This reverts commit 2e1a91c50b73ca7f46871d3a906ade93bbcac6a7

    It looks like this introduced race bug 1491949 in the
    gate-tempest-dsvm-large-ops job causing rpc timeouts when
    deallocating network information for an instance,
    specifically around the dnsmasq callback to release the
    fixed IP that the instance was using which triggers the
    disassociation between the fixed IP and the instance in the
    nova database.

    Change-Id: I163cdeea75e92485f241647c69aea0d7456c3258
    Closes-Bug: #1491949

Changed in devstack:
status: In Progress → Fix Released
Matt Riedemann (mriedem)
Changed in nova:
status: Invalid → Confirmed
importance: High → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (7.7 KiB)

This is still happening, or happening again:

http://logs.openstack.org/97/241397/1/check/gate-tempest-dsvm-large-ops/be8ca35/logs/screen-n-cpu-1.txt.gz#_2015-11-04_14_46_57_639

2015-11-04 14:46:57.639 ERROR nova.compute.manager [req-a36cc677-c460-4bae-bac5-93f5b82b8265 tempest-TestLargeOpsScenario-294201607 tempest-TestLargeOpsScenario-519728524] [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] Setting instance vm_state to ERROR
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] Traceback (most recent call last):
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/opt/stack/new/nova/nova/compute/manager.py", line 2433, in do_terminate_instance
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] self._delete_instance(context, instance, bdms, quotas)
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/opt/stack/new/nova/nova/hooks.py", line 149, in inner
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] rv = f(*args, **kwargs)
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/opt/stack/new/nova/nova/compute/manager.py", line 2412, in _delete_instance
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] quotas.rollback()
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 197, in __exit__
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] six.reraise(self.type_, self.value, self.tb)
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/opt/stack/new/nova/nova/compute/manager.py", line 2375, in _delete_instance
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] self._shutdown_instance(context, instance, bdms)
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/opt/stack/new/nova/nova/compute/manager.py", line 2295, in _shutdown_instance
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] self._try_deallocate_network(context, instance, requested_networks)
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/opt/stack/new/nova/nova/compute/manager.py", line 2218, in _try_deallocate_network
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] self._set_instance_obj_error_state(context, instance)
2015-11-04 14:46:57.639 9190 ERROR nova.compute.manager [instance: b0c2c6b3-8638-43d6-8e69-5ec1862bf6de] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 197, in __exit__
2015-11-04 14:46:57...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Thomas Maddox (thomas-maddox) wrote :

I suspect this isn't just happening for the noted test in this bug: http://logs.openstack.org/11/215311/6/check/gate-grenade-dsvm/2d9a906/logs/new/screen-n-cpu.txt.gz#_2015-12-09_18_39_40_829 I see a similar problem with a grenade test as well.

Revision history for this message
Sean Dague (sdague) wrote :

This job has been deleted

Changed in nova:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.