Comment 7 for bug 1649586

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

I did some investigation, here is what I found. IMHO it fails because of rabbit issues. Let's analyze it on an instance with ID af219925-a0a1-4e1c-92e0-ff6f510d7cd1 using this run of grenade http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/

The first request of a VM happened at 03:01:15.451, the request ID is req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_01_15_451

This request ends up with a 500 error

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_537

But few milliseconds before this error we can see another request to live migrate the same VM for the second time (which is not what live migration test should do):

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_446

Look at the time - 03:02:15.446, it's more or less 1 minute after the first request. When calling urllib3 and creating retry mechanism we do not override 'total retries', we just create Retry object:

https://github.com/openstack/tempest/blob/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6/tempest/lib/common/http.py#L55

And here is the commit that changed tempest behavior, so that tempest creates PoolManager that timeouts exactly after 60 seconds:

https://github.com/openstack/tempest/commit/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6

So due to this timeout behavior and that we do not change retry logic while creating Retry object, urllib3 retries the live migration call, therefore we can see 409 CONFLICT error and failing test. This is also why we can see request ids that do not come from tempest, as those are requests generated by urllib3.

The real problem there is Rabbit and that it timeouts some requests and this is what we should fix. We can find the same timeouts in q-agt in gate-grenade-dsvm-neutron-multinode-ubuntu-xenial, but this gate is not vulnerable to this error:

http://logs.openstack.org/02/408002/2/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/0443a47/logs/subnode-2/old/screen-q-agt.txt.gz?level=TRACE