live migration failing when nova-conductor is started before other services

Bug #1598860 reported by Dave Walker
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
John Garbutt

Bug Description

Deploying OpenStack current master via kolla source, with kvm and ceph/rbd enabled and attempting to perform a live-migration, it becomes stuck in state migrating with this appearing in the nova-compute log.

Looking at nova git log, i'm seeing a bunch of live-migration changes in the last few days. I suspect there might have been a regression.

2016-07-04 10:00:15.643 1 INFO nova.compute.resource_tracker [req-1f6df1d8-a88a-4f27-8ce2-609ea25da4e2 - - - - -] Compute_service record updated for compute01:compute01
2016-07-04 10:00:56.731 1 ERROR root [req-a2b97b2c-f479-4b7c-b3d9-142c1ab1b25f b3bedc85b7674ce2b7e0589a7427dc81 22ab55c3ffe343639d872b1e4db6abb4 - - -] Original exception being dropped: ['Traceback (most recent call last):\n', ' File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/exception_wrapper.py", line 66, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/compute/utils.py", line 608, in decorated_function\n *args, **kwargs)\n', ' File "/usr/lib64/python2.7/inspect.py", line 980, in getcallargs\n \'arguments\' if num_required > 1 else \'argument\', num_total))\n', 'TypeError: live_migration() takes exactly 7 arguments (6 given)\n']
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server [req-a2b97b2c-f479-4b7c-b3d9-142c1ab1b25f b3bedc85b7674ce2b7e0589a7427dc81 22ab55c3ffe343639d872b1e4db6abb4 - - -] Exception during message handling
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/exception_wrapper.py", line 71, in wrapped
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server f, self, context, *args, **kw)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python2.7/site-packages/nova/exception_wrapper.py", line 85, in _get_call_dict
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server context, *args, **kw)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server File "/usr/lib64/python2.7/inspect.py", line 980, in getcallargs
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server 'arguments' if num_required > 1 else 'argument', num_total))
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server TypeError: live_migration() takes exactly 7 arguments (6 given)
2016-07-04 10:00:56.732 1 ERROR oslo_messaging.rpc.server

Revision history for this message
Daniel Berrange (berrange) wrote :

Can you get the original exception, not the re-wrapped exception, as this stack trace is useless.

Revision history for this message
Dave Walker (davewalker) wrote :

This seems to be a racey nova-conductor negotiating RPC version.

...
Fresh deployment
nova boot ..
nova live-migration ..
<wedged at migrating, with trace from above>

2016-07-04 09:40:43.989 1 INFO nova.compute.rpcapi [req-b498fc12-4264-4422-b7f6-6d8567787ac1 - - - - -] Automatically selected compute RPC version 4.0 from minimum service version 0
...

...
nova reset-state --active ..
restart nova-conductor(s)
nova live-migration ..
<success>

2016-07-04 11:47:29.647 1 INFO nova.compute.rpcapi [req-3d408658-67d2-401e-9f33-00bdd80f3c31 - - - - -] Automatically selected compute RPC version 4.13 from minimum service version 14
...

summary: - live migration failing
+ live migration failing when nova-conductor is started before other
+ services
Revision history for this message
John Garbutt (johngarbutt) wrote :

So I think this happens because the RPC version goes very low.

tags: added: live-migration
Revision history for this message
John Garbutt (johngarbutt) wrote :

I should be clearer, I think this is where we call with a really old version of live-migrate that no longer works, because nova-conductor thinks no min_service version is 0, so uses a version of RPC that is too old for the current version of Nova.

When nova-compute starts up layer, we should now have an up to date min_version, but looks like we cached the older one somewhere.

Although looking at the code, it doesn't seem like this should be possible. Need some more digging.

Changed in nova:
importance: Undecided → High
status: New → Triaged
assignee: nobody → John Garbutt (johngarbutt)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/339072

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/339072
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dc6b3ab6ef59b4d3f9e540f1cc07326494cd8565
Submitter: Jenkins
Branch: master

commit dc6b3ab6ef59b4d3f9e540f1cc07326494cd8565
Author: John Garbutt <email address hidden>
Date: Thu Jul 7 16:20:27 2016 +0100

    Don't cache RPC pin when service_version is 0

    Its possible for RPC calls to be made before a nova-compute process
    starts up. When this happens we cache the RPC pin of kilo. This causes
    breakages in live-migrate, but in Mitaka we only work with Liberty.

    Change-Id: I5f1e0904c34c4fe26fc75998441602925d2f0593
    Closes-Bug: #1598860

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.0.0b3

This issue was fixed in the openstack/nova 14.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.