test_resize_with_reschedule_then_live_migrate intermittently failing; migration is not yet complete

Bug #1762876 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

I noticed this in a stable/pike functional test job run:

http://logs.openstack.org/46/560146/2/check/nova-tox-functional/4a9d1fd/job-output.txt.gz#_2018-04-10_21_37_20_943583

2018-04-10 21:37:20.944928 | ubuntu-xenial | Captured stderr:
2018-04-10 21:37:20.944966 | ubuntu-xenial | ~~~~~~~~~~~~~~~~
2018-04-10 21:37:20.945029 | ubuntu-xenial | Traceback (most recent call last):
2018-04-10 21:37:20.945231 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
2018-04-10 21:37:20.945268 | ubuntu-xenial | timer()
2018-04-10 21:37:20.945467 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
2018-04-10 21:37:20.945513 | ubuntu-xenial | cb(*args, **kw)
2018-04-10 21:37:20.945598 | ubuntu-xenial | File "nova/utils.py", line 1030, in context_wrapper
2018-04-10 21:37:20.945650 | ubuntu-xenial | func(*args, **kwargs)
2018-04-10 21:37:20.945756 | ubuntu-xenial | File "nova/compute/manager.py", line 5620, in dispatch_live_migration
2018-04-10 21:37:20.945839 | ubuntu-xenial | self._do_live_migration(*args, **kwargs)
2018-04-10 21:37:20.945939 | ubuntu-xenial | File "nova/compute/manager.py", line 5599, in _do_live_migration
2018-04-10 21:37:20.945993 | ubuntu-xenial | clean_task_state=True)
2018-04-10 21:37:20.946194 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-04-10 21:37:20.946246 | ubuntu-xenial | self.force_reraise()
2018-04-10 21:37:20.946452 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-04-10 21:37:20.946532 | ubuntu-xenial | six.reraise(self.type_, self.value, self.tb)
2018-04-10 21:37:20.946679 | ubuntu-xenial | File "nova/compute/manager.py", line 5588, in _do_live_migration
2018-04-10 21:37:20.946764 | ubuntu-xenial | block_migration, migrate_data)
2018-04-10 21:37:20.946856 | ubuntu-xenial | File "nova/virt/fake.py", line 497, in live_migration
2018-04-10 21:37:20.946901 | ubuntu-xenial | migrate_data)
2018-04-10 21:37:20.947003 | ubuntu-xenial | File "nova/exception_wrapper.py", line 76, in wrapped
2018-04-10 21:37:20.947069 | ubuntu-xenial | function_name, call_dict, binary)
2018-04-10 21:37:20.947270 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-04-10 21:37:20.947322 | ubuntu-xenial | self.force_reraise()
2018-04-10 21:37:20.947536 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-04-10 21:37:20.947619 | ubuntu-xenial | six.reraise(self.type_, self.value, self.tb)
2018-04-10 21:37:20.947707 | ubuntu-xenial | File "nova/exception_wrapper.py", line 67, in wrapped
2018-04-10 21:37:20.947779 | ubuntu-xenial | return f(self, context, *args, **kw)
2018-04-10 21:37:20.947878 | ubuntu-xenial | File "nova/compute/manager.py", line 218, in decorated_function
2018-04-10 21:37:20.947950 | ubuntu-xenial | kwargs['instance'], e, sys.exc_info())
2018-04-10 21:37:20.948160 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-04-10 21:37:20.948214 | ubuntu-xenial | self.force_reraise()
2018-04-10 21:37:20.948433 | ubuntu-xenial | File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-04-10 21:37:20.948510 | ubuntu-xenial | six.reraise(self.type_, self.value, self.tb)
2018-04-10 21:37:20.948607 | ubuntu-xenial | File "nova/compute/manager.py", line 206, in decorated_function
2018-04-10 21:37:20.948687 | ubuntu-xenial | return function(self, context, *args, **kwargs)
2018-04-10 21:37:20.948802 | ubuntu-xenial | File "nova/compute/manager.py", line 5824, in _post_live_migration
2018-04-10 21:37:20.948856 | ubuntu-xenial | instance, source_node)
2018-04-10 21:37:20.949000 | ubuntu-xenial | File "nova/compute/resource_tracker.py", line 1294, in delete_allocation_for_migrated_instance
2018-04-10 21:37:20.949111 | ubuntu-xenial | self._delete_allocation_for_moved_instance(instance, node, 'migrated')
2018-04-10 21:37:20.949244 | ubuntu-xenial | File "nova/compute/resource_tracker.py", line 1302, in _delete_allocation_for_moved_instance
2018-04-10 21:37:20.949319 | ubuntu-xenial | cn_uuid = self.compute_nodes[node].uuid
2018-04-10 21:37:20.949363 | ubuntu-xenial | KeyError: u'host3'

But I see it's also happening on master:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22cn_uuid%20%3D%20self.compute_nodes%5Bnode%5D.uuid%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

Revision history for this message
Matt Riedemann (mriedem) wrote :

Seeing it here too on master but without a KeyError:

http://logs.openstack.org/10/557810/1/gate/nova-tox-functional/8032a6b/testr_results.html.gz

Traceback (most recent call last):
  File "nova/tests/functional/regressions/test_bug_1718512.py", line 150, in test_resize_with_reschedule_then_live_migrate
    self.assertEqual('completed', migrations[0]['status'])
  File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/testtools/testcase.py", line 411, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/home/zuul/src/git.openstack.org/openstack/nova/.tox/functional/local/lib/python2.7/site-packages/testtools/testcase.py", line 498, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: 'completed' != u'running'

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
importance: Medium → High
summary: - KeyError during move operation functional tests
+ test_resize_with_reschedule_then_live_migrate intermittently failing;
+ migration is not yet complete
Revision history for this message
Matt Riedemann (mriedem) wrote :

The race is that post_live_migration_at_destination sets the instance task_state to None which effectively puts it back into ACTIVE status before the migration status marked 'completed' on the source node in _post_live_migration, hence the race. I seem to remember a similar race in another live migration functional test where gibi added a waiter for a notification before checking the migration status.

Revision history for this message
Matt Riedemann (mriedem) wrote :

We can't rely on the post live migration end notification for fixing this, see bug 1763051.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/560454

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/560454
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4afa8c2b97929faaf8bbf2bbaa72235fab7d3d13
Submitter: Zuul
Branch: master

commit 4afa8c2b97929faaf8bbf2bbaa72235fab7d3d13
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 10:43:34 2018 -0400

    Fix race fail in test_resize_with_reschedule_then_live_migrate

    The assertion in the test that the migration status is 'completed'
    is flawed in that it assumes when the instance status is 'ACTIVE'
    the migration is completed, which isn't True, since the instance
    status gets changed before the migration is completed, but they are
    very close in time so there is a race, which is how this test slipped
    by. This fixes the issue by polling the migration status until it
    is actually completed or we timeout.

    Change-Id: I61f745667f4c003d7e3ca6f2f9a99194930ac892
    Closes-Bug: #1762876

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/560955

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ocata)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/560955
Reason: rebase squash failure

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.