But few milliseconds before this error we can see another request to live migrate the same VM for the second time (which is not what live migration test should do):
Look at the time - 03:02:15.446, it's more or less 1 minute after the first request. When calling urllib3 and creating retry mechanism we do not override 'total retries', we just create Retry object:
So due to this timeout behavior and that we do not change retry logic while creating Retry object, urllib3 retries the live migration call, therefore we can see 409 CONFLICT error and failing test. This is also why we can see request ids that do not come from tempest, as those are requests generated by urllib3.
The real problem there is Rabbit and that it timeouts some requests and this is what we should fix. We can find the same timeouts in q-agt in gate-grenade-dsvm-neutron-multinode-ubuntu-xenial, but this gate is not vulnerable to this error:
I did some investigation, here is what I found. IMHO it fails because of rabbit issues. Let's analyze it on an instance with ID af219925- a0a1-4e1c- 92e0-ff6f510d7c d1 using this run of grenade http:// logs.openstack. org/04/ 409904/ 2/check/ gate-grenade- dsvm-neutron- multinode- live-migration- nv/78de48c/ logs/
The first request of a VM happened at 03:01:15.451, the request ID is req-e0927ad1- 3d21-42b8- a8ec-986f1ae67c 65
http:// logs.openstack. org/04/ 409904/ 2/check/ gate-grenade- dsvm-neutron- multinode- live-migration- nv/78de48c/ logs/new/ screen- n-api.txt. gz#_2016- 12-13_03_ 01_15_451
This request ends up with a 500 error
http:// logs.openstack. org/04/ 409904/ 2/check/ gate-grenade- dsvm-neutron- multinode- live-migration- nv/78de48c/ logs/new/ screen- n-api.txt. gz#_2016- 12-13_03_ 02_15_537
But few milliseconds before this error we can see another request to live migrate the same VM for the second time (which is not what live migration test should do):
http:// logs.openstack. org/04/ 409904/ 2/check/ gate-grenade- dsvm-neutron- multinode- live-migration- nv/78de48c/ logs/new/ screen- n-api.txt. gz#_2016- 12-13_03_ 02_15_446
Look at the time - 03:02:15.446, it's more or less 1 minute after the first request. When calling urllib3 and creating retry mechanism we do not override 'total retries', we just create Retry object:
https:/ /github. com/openstack/ tempest/ blob/071e94c31c e2d1549ecb34df1 fea7f5a6cd210a6 /tempest/ lib/common/ http.py# L55
And here is the commit that changed tempest behavior, so that tempest creates PoolManager that timeouts exactly after 60 seconds:
https:/ /github. com/openstack/ tempest/ commit/ 071e94c31ce2d15 49ecb34df1fea7f 5a6cd210a6
So due to this timeout behavior and that we do not change retry logic while creating Retry object, urllib3 retries the live migration call, therefore we can see 409 CONFLICT error and failing test. This is also why we can see request ids that do not come from tempest, as those are requests generated by urllib3.
The real problem there is Rabbit and that it timeouts some requests and this is what we should fix. We can find the same timeouts in q-agt in gate-grenade- dsvm-neutron- multinode- ubuntu- xenial, but this gate is not vulnerable to this error:
http:// logs.openstack. org/02/ 408002/ 2/check/ gate-grenade- dsvm-neutron- multinode- ubuntu- xenial/ 0443a47/ logs/subnode- 2/old/screen- q-agt.txt. gz?level= TRACE