test_list_servers_filter_by_error_status intermittently fails with MismatchError on no servers in response

Bug #1660747 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
melanie witt
tempest
Opinion
Undecided
Unassigned

Bug Description

Seen here:

http://logs.openstack.org/59/424759/12/gate/gate-tempest-dsvm-cells-ubuntu-xenial/d7b1311/console.html#_2017-01-31_17_48_34_663273

2017-01-31 17:48:34.663337 | Captured traceback:
2017-01-31 17:48:34.663348 | ~~~~~~~~~~~~~~~~~~~
2017-01-31 17:48:34.663363 | Traceback (most recent call last):
2017-01-31 17:48:34.663393 | File "tempest/api/compute/admin/test_servers.py", line 59, in test_list_servers_filter_by_error_status
2017-01-31 17:48:34.663414 | self.assertIn(self.s1_id, map(lambda x: x['id'], servers))
2017-01-31 17:48:34.663448 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 417, in assertIn
2017-01-31 17:48:34.663468 | self.assertThat(haystack, Contains(needle), message)
2017-01-31 17:48:34.663502 | File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/testtools/testcase.py", line 498, in assertThat
2017-01-31 17:48:34.663515 | raise mismatch_error
2017-01-31 17:48:34.663542 | testtools.matchers._impl.MismatchError: u'108b4797-74fd-4a00-912a-b7fe0e142888' not in []

This test resets the state on a server to ERROR:

2017-01-31 17:48:34.663649 | 2017-01-31 17:28:39,375 504 INFO [tempest.lib.common.rest_client] Request (ServersAdminTestJSON:test_list_servers_filter_by_error_status): 202 POST http://10.23.154.32:8774/v2.1/servers/108b4797-74fd-4a00-912a-b7fe0e142888/action 0.142s
2017-01-31 17:48:34.663695 | 2017-01-31 17:28:39,376 504 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-01-31 17:48:34.663714 | Body: {"os-resetState": {"state": "error"}}

Then tries to list servers by that status and expects to get that one back:

2017-01-31 17:48:34.663883 | 2017-01-31 17:28:39,556 504 INFO [tempest.lib.common.rest_client] Request (ServersAdminTestJSON:test_list_servers_filter_by_error_status): 200 GET http://10.23.154.32:8774/v2.1/servers?status=error 0.179s
2017-01-31 17:48:34.663955 | 2017-01-31 17:28:39,556 504 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'X-Auth-Token': '<omitted>', 'Accept': 'application/json', 'Content-Type': 'application/json'}
2017-01-31 17:48:34.663969 | Body: None
2017-01-31 17:48:34.664078 | Response - Headers: {u'x-openstack-nova-api-version': '2.1', u'vary': 'X-OpenStack-Nova-API-Version', u'content-length': '15', 'status': '200', u'content-type': 'application/json', u'x-compute-request-id': 'req-91ef16ab-28c3-47c5-b823-6a321bde5c01', u'date': 'Tue, 31 Jan 2017 17:28:39 GMT', 'content-location': 'http://10.23.154.32:8774/v2.1/servers?status=error', u'openstack-api-version': 'compute 2.1', u'connection': 'close'}
2017-01-31 17:48:34.664094 | Body: {"servers": []}

And the list is coming back empty, intermittently, with cells v1. So there is probably some vm_state change race between the state change in the child cell and reporting that back up to the parent API cell.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Body%3A%20%7B%5C%5C%5C%22servers%5C%5C%5C%22%3A%20%5B%5D%7D%5C%22%20AND%20tags%3A%5C%22console%5C%22%20AND%20build_name%3A%5C%22gate-tempest-dsvm-cells-ubuntu-xenial%5C%22&from=7d

16 hits in 7 days, check and gate, all failures.

Tags: cells
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is potentially a Tempest bug since the os-resetState API in Nova returns a 202 response, so Tempest should probably be waiting for the server to go into error state before doing the GET query with the status filter.

Revision history for this message
melanie witt (melwitt) wrote :

As mriedem pointed out in IRC, the reset-state API simply changes the vm_state and does an instance.save() immediately and returns. So we shouldn't need to wait for an error state. In cells v1, the instance.save() will trigger a sync to the child cell to update the vm_state. This should be fine since the API reads from the top/parent, but I think we have a bug in our compute/api get_all cells v2 code where we're reading instances from child cells instead of the top/parent cell. We have conditionals where "if cells v1, read from the top" and it looks like get_all was missed.

Changed in nova:
assignee: nobody → melanie witt (melwitt)
Matt Riedemann (mriedem)
tags: added: ocata-rc-potential
removed: testing v1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/427394

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/427394
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=826df45a40599490dfa2f2209a2cd9586fbd6c89
Submitter: Jenkins
Branch: master

commit 826df45a40599490dfa2f2209a2cd9586fbd6c89
Author: melanie witt <email address hidden>
Date: Tue Jan 31 20:20:17 2017 +0000

    Read instances from API cell for cells v1

    Recent work on cells v2 has us reading instances from compute cells
    via target_cell, but that breaks the existing behavior of cells v1
    because of the syncing between API cell and compute cells. When
    state changes are effected in the API cell, there is a slight delay
    before they are reflected in the compute cell. So, reading from the
    compute cell instead of the API cell results in behavior changes.

    This adds a conditional in compute/api get_all to read instances
    from the API cell if cells v1 is enabled. We are already doing this
    for compute/api get, and get_all was missed.

    Closes-Bug: #1660747

    Change-Id: I7df5c4616ef386216c7bd7efea2be68173c61be0

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem)
tags: removed: ocata-rc-potential
Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

As Matt said, HTTP 202 means just "Accepted" for a request. That doesn't mean the operation is completed already.
However on cell-v1, the API behavior looks like "the operation is completed already" and Tempest is expecting it on the test.
So I guess Tempest side work is unnecessary now.

Changed in tempest:
status: New → Opinion
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0rc1

This issue was fixed in the openstack/nova 15.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.