test_reboot_deleted_server fails with 409 "Cannot 'reboot' instance while it is in vm_state building"

Bug #1660878 reported by Matt Riedemann
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

http://logs.openstack.org/91/426991/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/f218227/console.html#_2017-02-01_02_06_33_592237

2017-02-01 02:06:33.592237 | tempest.api.compute.servers.test_servers_negative.ServersNegativeTestJSON.test_reboot_deleted_server[id-581a397d-5eab-486f-9cf9-1014bbd4c984,negative]
2017-02-01 02:06:33.592305 | ------------------------------------------------------------------------------------------------------------------------------------------------------
2017-02-01 02:06:33.592321 |
2017-02-01 02:06:33.592340 | Captured traceback:
2017-02-01 02:06:33.592367 | ~~~~~~~~~~~~~~~~~~~
2017-02-01 02:06:33.592398 | Traceback (most recent call last):
2017-02-01 02:06:33.592453 | File "tempest/api/compute/servers/test_servers_negative.py", line 190, in test_reboot_deleted_server
2017-02-01 02:06:33.593010 | server['id'], type='SOFT')
2017-02-01 02:06:33.593072 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 485, in assertRaises
2017-02-01 02:06:33.593110 | self.assertThat(our_callable, matcher)
2017-02-01 02:06:33.593162 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 496, in assertThat
2017-02-01 02:06:33.593205 | mismatch_error = self._matchHelper(matchee, matcher, message, verbose)
2017-02-01 02:06:33.593266 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 547, in _matchHelper
2017-02-01 02:06:33.593294 | mismatch = matcher.match(matchee)
2017-02-01 02:06:33.593345 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 108, in match
2017-02-01 02:06:33.593388 | mismatch = self.exception_matcher.match(exc_info)
2017-02-01 02:06:33.593443 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/matchers/_higherorder.py", line 62, in match
2017-02-01 02:06:33.593468 | mismatch = matcher.match(matchee)
2017-02-01 02:06:33.593515 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 475, in match
2017-02-01 02:06:33.593544 | reraise(*matchee)
2017-02-01 02:06:33.593597 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 101, in match
2017-02-01 02:06:33.593618 | result = matchee()
2017-02-01 02:06:33.593667 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 1049, in __call__
2017-02-01 02:06:33.593699 | return self._callable_object(*self._args, **self._kwargs)
2017-02-01 02:06:33.593736 | File "tempest/lib/services/compute/servers_client.py", line 236, in reboot_server
2017-02-01 02:06:33.593777 | return self.action(server_id, 'reboot', **kwargs)
2017-02-01 02:06:33.593814 | File "tempest/lib/services/compute/servers_client.py", line 186, in action
2017-02-01 02:06:33.594170 | post_body)
2017-02-01 02:06:33.594219 | File "tempest/lib/common/rest_client.py", line 275, in post
2017-02-01 02:06:33.594261 | return self.request('POST', url, extra_headers, headers, body, chunked)
2017-02-01 02:06:33.594298 | File "tempest/lib/services/compute/base_compute_client.py", line 48, in request
2017-02-01 02:06:33.594328 | method, url, extra_headers, headers, body, chunked)
2017-02-01 02:06:33.594360 | File "tempest/lib/common/rest_client.py", line 663, in request
2017-02-01 02:06:33.594392 | self._error_checker(resp, resp_body)
2017-02-01 02:06:33.594449 | File "tempest/lib/common/rest_client.py", line 775, in _error_checker
2017-02-01 02:06:33.594496 | raise exceptions.Conflict(resp_body, resp=resp)
2017-02-01 02:06:33.594555 | tempest.lib.exceptions.Conflict: An object with that identifier already exists
2017-02-01 02:06:33.594641 | Details: {u'code': 409, u'message': u"Cannot 'reboot' instance d9978a80-0349-4cb8-84fa-46aa0b8eaa16 while it is in vm_state building"}

http://logs.openstack.org/91/426991/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/f218227/logs/screen-n-api.txt.gz#_2017-02-01_01_25_02_367

The test creates a server, then deletes the server and waits for it to be gone and then tries to reboot it and expects a 404 but gets a 409 because the instance is building.

We're probably hitting a window of time between when we have a build request and a real instance due to the recent change of moving instance creation from nova-api to nova-conductor.

Looks like this started on 1/27:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Cannot%20'reboot'%20instance%5C%22%20AND%20message%3A%5C%22while%20it%20is%20in%20vm_state%20building%5C%22%20AND%20tags%3A%5C%22screen-n-api.txt%5C%22&from=7d

Which is when this merged: https://review.openstack.org/#/c/319379/

Tags: cells
Matt Riedemann (mriedem)
tags: added: ocata-rc-potential
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like it's more than just that test:

http://logs.openstack.org/48/426448/2/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/ecb3d0a/console.html#_2017-01-31_14_47_05_827109

Details: {u'code': 409, u'message': u"Cannot 'rebuild' instance b7279ba1-1847-4942-912c-d1f148f00ca3 while it is in vm_state building"}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/427544

Revision history for this message
Matt Riedemann (mriedem) wrote :

I have a debug patch up which shows that before we create the instance in conductor, the build request is already deleted:

http://logs.openstack.org/44/427544/2/check/gate-tempest-dsvm-neutron-multinode-full-ubuntu-xenial-nv/f50a5ee/logs/screen-n-cond.txt#_2017-02-01_04_49_11_702

Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/427775

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/427544
Reason: We have patches up for the real fix now.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/427782
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=466769e588dc44d11987430b54ca1bd7188abffb
Submitter: Jenkins
Branch: master

commit 466769e588dc44d11987430b54ca1bd7188abffb
Author: Dan Smith <email address hidden>
Date: Wed Feb 1 07:58:25 2017 -0800

    Remove pre-cellsv2 short circuit in instance get

    This removes a couple short circuits around instance cell mapping,
    which were in place for pre-cellsv2 code. Now, they defeat some of
    our atomic logic around what should be done in and around deletes.
    These were TODOs to remove once people had to be on cellsv2, which
    is now the case.

    Note that the api samples tests were depending on these in order to
    work at all (which is a good demonstration of why they need to
    be removed). This patch also extends the SingleCellSimple fixture
    with a couple more mocks to make them happily take the same path
    as the rest of the code. Since the default in that fixture is
    now to make instances look mapped all the time, we also need a flag
    so the tests that check for yet-to-be-mapped instances can still
    function.

    Related-Bug: #1660878
    Change-Id: I7eb4b41fd2f33e8a63e329adbafc47d149e31552

Revision history for this message
Steve Lewis (steve-lewis) wrote :

I'm still seeing a significant failure rate in the gate due to this bug over last 8 hours. Is there a reasonable chance https://review.openstack.org/#/c/427775/ would help?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/427775
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8ba92778fe14b47ad4ff5b53022e0550a93f37d3
Submitter: Jenkins
Branch: master

commit 8ba92778fe14b47ad4ff5b53022e0550a93f37d3
Author: Matt Riedemann <email address hidden>
Date: Wed Feb 1 10:35:32 2017 -0500

    Ensure build request exists before creating instance

    When creating instances in conductor, the build requests are
    coming from the compute API and might be stale by the time
    the instance is created, i.e. the build request might have
    been deleted from the database before the instance is actually
    created in a cell.

    This is trivial to recreate; all you need to do is create a
    server and then immediately delete it, then try to perform
    some kind of action on the server expecting it to be deleted
    but the action might not return a 404 for a missing instance.
    We're seeing this in Tempest runs where the expected 404 for
    the deleted instance is a 409 because the test is trying to
    perform an action on a server while it's building, which is
    generally not allowed.

    This fixes the issue by making a last-second check to make
    sure the build request still exists before the instance is
    created in a cell.

    Change-Id: I6c32d5a4086a227d59ad7b1f6f50e7e532c74c84
    Closes-Bug: #1660878

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem)
tags: removed: ocata-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0rc1

This issue was fixed in the openstack/nova 15.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.