Bug #1946339 “leaked threads run after test case teardown causin...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-10-12:

#1

It seems it happens pretty frequently on the gate[1]. I run the whole functional suit locally for days with random ordering but I was not able to reproduce it. So it is really hard to figure out what happens.

The first exception nova.exception.UnexpectedResourceProviderNameForPCIRequest is expected that is part of the test case. The the logs shows that we reverted the allocation in placement as the VM creation is failed due to the first exception.

The I see the following from the stack trace

LOG.warning("Failed to revert task state for instance.

So we end up at[2]. Then the compute thread seems to end. Then the parallel conductor thread times out in self.conductor_compute_rpcapi.build_instances:

File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 213, in _send
raise oslo_messaging.MessagingTimeout(
oslo_messaging.exceptions.MessagingTimeout: No reply on topic conductor

But it seems that at that point DB was already deleted by the test env leading to errors like:

sqlite3.OperationalError: no such table: instance_faults

I don't see how this can be timing related, but other then having some kind of race condition between the compute and the conductor I have no other idea.

[1] https://paste.opendev.org/show/809926/
[2] https://github.com/openstack/nova/blob/fdfdba265833d237e22676f9a223ab8ca0fe1e03/nova/compute/manager.py#L183

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-10-12:

#2

On the timing:
The successful runs on the gate takes 3-7 seconds, the failed runs takes 4-6 seconds. So there is no big timing difference.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-10-12 (last edit on 2021-10-12):

#3

This is a reschedule in the stack trace[1]:

File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 2263, in _do_build_and_run_instance

Which does not happen in a successful test case run. So I have to figure out why and how we can hit re-schedule.

https://github.com/openstack/nova/blob/fdfdba265833d237e22676f9a223ab8ca0fe1e03/nova/compute/manager.py#L2263-L2268

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2021-10-12:

#4

Download full text (13.8 KiB)

looking at the logs we booted on host1 then sheleved and then shelve offloaded.

2021-10-08 01:20:33.107782 | ubuntu-focal | 2021-10-08 01:20:30,613 INFO [nova.scheduler.host_manager] Host filter only checking host host1 and node host1
2021-10-08 01:20:33.107801 | ubuntu-focal | 2021-10-08 01:20:30,628 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 19 microversion: 1.28
2021-10-08 01:20:33.107819 | ubuntu-focal | 2021-10-08 01:20:30,676 INFO [placement.requestlog] 127.0.0.1 "PUT /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 204 len: 0 microversion: 1.36
2021-10-08 01:20:33.107837 | ubuntu-focal | 2021-10-08 01:20:30,873 INFO [placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/e11c9de0-4624-49ef-835c-a56da97a5f68" status: 200 len: 890 microversion: 1.14
2021-10-08 01:20:33.107855 | ubuntu-focal | 2021-10-08 01:20:30,881 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 644 microversion: 1.28
2021-10-08 01:20:33.107886 | ubuntu-focal | 2021-10-08 01:20:30,885 INFO [nova.compute.claims] Claim successful on node host1
2021-10-08 01:20:33.107905 | ubuntu-focal | 2021-10-08 01:20:31,138 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 1942 microversion: 2.90 time: 0.146510
2021-10-08 01:20:33.107924 | ubuntu-focal | 2021-10-08 01:20:31,145 INFO [nova.compute.manager] Took 0.00 seconds to spawn the instance on the hypervisor.
2021-10-08 01:20:33.107942 | ubuntu-focal | 2021-10-08 01:20:31,187 INFO [nova.compute.manager] Took 0.33 seconds to build instance.
2021-10-08 01:20:33.107960 | ubuntu-focal | 2021-10-08 01:20:31,231 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'host1'. Re-created its InstanceList.
2021-10-08 01:20:33.107979 | ubuntu-focal | 2021-10-08 01:20:31,677 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 2436 microversion: 2.90 time: 0.033292
2021-10-08 01:20:33.107997 | ubuntu-focal | 2021-10-08 01:20:31,688 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 548 microversion: 1.0
2021-10-08 01:20:33.108015 | ubuntu-focal | 2021-10-08 01:20:31,789 INFO [nova.api.openstack.requestlog] 127.0.0.1 "POST /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac/action" status: 202 len: 0 microversion: 2.90 time: 0.097403
2021-10-08 01:20:33.108033 | ubuntu-focal | 2021-10-08 01:20:31,825 INFO [nova.compute.manager] Shelving
2021-10-08 01:20:33.108060 | ubuntu-focal | 2021-10-08 01:20:31,828 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 2442 microversion: 2.90 time: 0.034799
2021-10-08 01:20:33.108080 | ubuntu-focal | 2021-10-08 01:20:31,885 INFO [nova.compute.manager] Shelve offloading
2021-10-08 01:20:33.108098 | ubuntu-focal | 2021-10-08 01:20:31,911 INFO [placement.requestlog] 127.0.0.1 "DELETE /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa...

looking at the logs we booted on host1 then sheleved and then shelve offloaded.

2021-10-08 01:20:33.107782 | ubuntu-focal | 2021-10-08 01:20:30,613 INFO [nova.scheduler.host_manager] Host filter only checking host host1 and node host1
2021-10-08 01:20:33.107801 | ubuntu-focal | 2021-10-08 01:20:30,628 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 19 microversion: 1.28
2021-10-08 01:20:33.107819 | ubuntu-focal | 2021-10-08 01:20:30,676 INFO [placement.requestlog] 127.0.0.1 "PUT /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 204 len: 0 microversion: 1.36
2021-10-08 01:20:33.107837 | ubuntu-focal | 2021-10-08 01:20:30,873 INFO [placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/e11c9de0-4624-49ef-835c-a56da97a5f68" status: 200 len: 890 microversion: 1.14
2021-10-08 01:20:33.107855 | ubuntu-focal | 2021-10-08 01:20:30,881 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 644 microversion: 1.28
2021-10-08 01:20:33.107886 | ubuntu-focal | 2021-10-08 01:20:30,885 INFO [nova.compute.claims] Claim successful on node host1
2021-10-08 01:20:33.107905 | ubuntu-focal | 2021-10-08 01:20:31,138 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 1942 microversion: 2.90 time: 0.146510
2021-10-08 01:20:33.107924 | ubuntu-focal | 2021-10-08 01:20:31,145 INFO [nova.compute.manager] Took 0.00 seconds to spawn the instance on the hypervisor.
2021-10-08 01:20:33.107942 | ubuntu-focal | 2021-10-08 01:20:31,187 INFO [nova.compute.manager] Took 0.33 seconds to build instance.
2021-10-08 01:20:33.107960 | ubuntu-focal | 2021-10-08 01:20:31,231 INFO [nova.scheduler.host_manager] Received an update from an unknown host 'host1'. Re-created its InstanceList.
2021-10-08 01:20:33.107979 | ubuntu-focal | 2021-10-08 01:20:31,677 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 2436 microversion: 2.90 time: 0.033292
2021-10-08 01:20:33.107997 | ubuntu-focal | 2021-10-08 01:20:31,688 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 548 microversion: 1.0
2021-10-08 01:20:33.108015 | ubuntu-focal | 2021-10-08 01:20:31,789 INFO [nova.api.openstack.requestlog] 127.0.0.1 "POST /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac/action" status: 202 len: 0 microversion: 2.90 time: 0.097403
2021-10-08 01:20:33.108033 | ubuntu-focal | 2021-10-08 01:20:31,825 INFO [nova.compute.manager] Shelving
2021-10-08 01:20:33.108060 | ubuntu-focal | 2021-10-08 01:20:31,828 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 2442 microversion: 2.90 time: 0.034799
2021-10-08 01:20:33.108080 | ubuntu-focal | 2021-10-08 01:20:31,885 INFO [nova.compute.manager] Shelve offloading
2021-10-08 01:20:33.108098 | ubuntu-focal | 2021-10-08 01:20:31,911 INFO [placement.requestlog] 127.0.0.1 "DELETE /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 204 len: 0 microversion: 1.0
2021-10-08 01:20:33.108136 | ubuntu-focal | 2021-10-08 01:20:31,912 INFO

then we try to delete the allocations in placement which fails

[nova.scheduler.client.report] Deleted allocations for instance 06aab374-ae34-4223-b31d-ba1558efa2ac
021-10-08 01:20:33.108159 | ubuntu-focal | 2021-10-08 01:20:32,358 ERROR [root] Original exception being dropped: ['Traceback (most recent call last):\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 207, in _send\n    reply, failure = reply_q.get(timeout=timeout)\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get\n    return waiter.wait()\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait\n    return get_hub().switch()\n', '  File "/home/z
2021-10-08 01:20:33.108181 | ubuntu-focal |     uul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch\n    return self.greenlet.switch()\n', '_queue.Empty\n', '\nDuring handling of the above exception, another exception occurred:\n\n', 'Traceback (most recent call last):\n', '  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 200, in decorated_function\n    return function(self, context, *args, **kwargs)\n', '  File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 2263, in _do_build_and_run_instance\n    self.compute_task_api.build_instances(context, [instance],\n', '  File "/home/zuul/src/opendev.org/openstack/nova/nova/conductor/api.py", line 119, in build_instances\n    self.conductor_compute_rpcapi.build_instances(context,\n', '  File "/home/zuul/src/opendev.org/openstack/nova/nova/conductor/rpcapi.py", line 390, in build_instances\n    cctxt.cast(context, \'build_instances\', **kwargs)\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/fixtures/_fixtures/monkeypatch.py", line 89, in avoid_get\n    return captured_method(*args, **kwargs)\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call\n    result = self.transport._send(\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send\n    return self._driver.send(target, ctxt, message,\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 222, in send\n    return self._send(target, ctxt, message, wait_for_reply, timeout,\n', '  File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py38/lib/python3.8/site-packages/oslo_messaging/_drivers/impl_fake.py", line 213, in _send\n    raise oslo_messaging.MessagingTimeout(\n', 'oslo_messaging.exceptions.MessagingTimeout: No reply on topic conductor\n']
2021-10-08 01:20:33.108212 | ubuntu-focal | 2021-10-08 01:20:32,376 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 2375 microversion: 2.90 time: 0.043254
2021-10-08 01:20:33.108232 | ubuntu-focal | 2021-10-08 01:20:32,385 WARNING [nova.compute.manager] Failed to revert task state for instance. Error: (sqlite3.OperationalError) no such table: instance_extra
2021-10-08 01:20:33.108251 | ubuntu-focal | [SQL: UPDATE instance_extra SET updated_at=?, pci_requests=? WHERE instance_extra.deleted = ? AND instance_extra.instance_uuid = ?]
2021-10-08 01:20:33.108269 | ubuntu-focal | [parameters: ('2021-10-08 01:20:32.375365', '[]', 0, '2290db63-c9e3-49b3-a608-104554813558')]
2021-10-08 01:20:33.108288 | ubuntu-focal | (Background on this error at: https://sqlalche.me/e/14/e3q8)
2021-10-08 01:20:33.108306 | ubuntu-focal | 2021-10-08 01:20:32,408 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 19 microversion: 1.0
2021-10-08 01:20:33.108333 | ubuntu-focal | 2021-10-08 01:20:32,421 INFO [placement.requestlog] 127.0.0.1 "DELETE /placement/allocations/2290db63-c9e3-49b3-a608-104554813558" status: 404 len: 228 microversion: 1.0

then we list the avaiable traits and update RP for host 1

2021-10-08 01:20:33.108352 | ubuntu-focal | 2021-10-08 01:20:32,483 INFO [placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/37a8ca51-0674-40c2-9a21-bfa91ba77e77/traits" status: 200 len: 327 microversion: 1.6
2021-10-08 01:20:33.108371 | ubuntu-focal | 2021-10-08 01:20:32,490 INFO [placement.requestlog] 127.0.0.1 "GET /placement/traits?name=in:COMPUTE_VOLUME_EXTEND,COMPUTE_ACCELERATORS,COMPUTE_STATUS_DISABLED,COMPUTE_NET_ATTACH_INTERFACE_WITH_TAG,COMPUTE_NODE,COMPUTE_NET_ATTACH_INTERFACE,COMPUTE_TRUSTED_CERTS,COMPUTE_VOLUME_MULTI_ATTACH,COMPUTE_VOLUME_ATTACH_WITH_TAG,COMPUTE_DEVICE_TAGGING,COMPUTE_IMAGE_TYPE_RAW" status: 200 len: 319 microversion: 1.6
2021-10-08 01:20:33.108392 | ubuntu-focal | 2021-10-08 01:20:32,505 INFO [placement.requestlog] 127.0.0.1 "PUT /placement/resource_providers/37a8ca51-0674-40c2-9a21-bfa91ba77e77/traits" status: 200 len: 354 microversion: 1.6
2021-10-08 01:20:33.108411 | ubuntu-focal | 2021-10-08 01:20:32,509 INFO

we can confrim it was host one by looking at the service creation

2021-10-08 01:20:33.105690 | ubuntu-focal | 2021-10-08 01:20:29,247 INFO [nova.compute.resource_tracker] Compute node record created for host1:host1 with uuid: 37a8ca51-0674-40c2-9a21-bfa91ba77e77
2021-10-08 01:20:33.105711 | ubuntu-focal | 2021-10-08 01:20:29,300 INFO [placement.requestlog] 127.0.0.1 "GET /placement/resource_providers?in_tree=37a8ca51-0674-40c2-9a21-bfa91ba77e77" status: 200 len: 26 microversion: 1.14

this is likely triggered by this put to the comptue service to disable it

[nova.api.openstack.requestlog] 127.0.0.1 "PUT /v2.1/os-services/b0d81f7c-e862-4bb9-9f1f-b60d9a57d334" status: 200 len: 246 microversion: 2.90 time: 0.097353

which woudl correspond ot this line in the test 
https://github.com/openstack/nova/blob/a8d3ab2513c39aeac3393b2154988316dfa2db3a/nova/tests/functional/test_servers_resource_request.py#L2548-L2549

the we proceed to unshelve the vm and it fails with the expected exception

2021-10-08 01:20:33.108429 | ubuntu-focal | 2021-10-08 01:20:32,551 INFO [nova.api.openstack.requestlog] 127.0.
2021-10-08 01:20:33.108448 | ubuntu-focal |     0.1 "POST /v2.1/servers/06aab374-ae34-4223-b31d-ba1558efa2ac/action" status: 202 len: 0 microversion: 2.90 time: 0.038562
2021-10-08 01:20:33.108467 | ubuntu-focal | 2021-10-08 01:20:32,689 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocation_candidates?group_policy=isolate&limit=1000&required7059503b-a648-40fd-a561-5ca769304bee=CUSTOM_PHYSNET2%2CCUSTOM_VNIC_TYPE_DIRECT&requireda1ffd1f7-8e17-4254-bdf2-f07fd9220e4b=CUSTOM_PHYSNET2%2CCUSTOM_VNIC_TYPE_NORMAL&requireda2ffa7b3-a623-4922-946c-25476efdec97=CUSTOM_VNIC_TYPE_NORMAL&resources=DISK_GB%3A20%2CMEMORY_MB%3A2048%2CVCPU%3A1&resources7059503b-a648-40fd-a561-5ca769304bee=NET_BW_EGR_KILOBIT_PER_SEC%3A10000%2CNET_BW_IGR_KILOBIT_PER_SEC%3A10000&resourcesa1ffd1f7-8e17-4254-bdf2-f07fd9220e4b=NET_BW_EGR_KILOBIT_PER_SEC%3A1000%2CNET_BW_IGR_KILOBIT_PER_SEC%3A1000&resourcesa2ffa7b3-a623-4922-946c-25476efdec97=NET_PACKET_RATE_KILOPACKET_PER_SEC%3A1000&root_required=%21COMPUTE_STATUS_DISABLED&same_subtree=7059503b-a648-40fd-a561-5ca769304bee&same_subtree=a1ffd1f7-8e17-4254-bdf2-f07fd9220e4b%2Ca2ffa7b3-a623-4922-946c-25476efdec97" status: 200 len: 3263 microversion: 1.36
2021-10-08 01:20:33.108494 | ubuntu-focal | 2021-10-08 01:20:32,719 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 19 microversion: 1.28
2021-10-08 01:20:33.108514 | ubuntu-focal | 2021-10-08 01:20:32,757 INFO [placement.requestlog] 127.0.0.1 "PUT /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 204 len: 0 microversion: 1.36
2021-10-08 01:20:33.108532 | ubuntu-focal | 2021-10-08 01:20:32,800 INFO [nova.compute.manager] Unshelving
2021-10-08 01:20:33.108550 | ubuntu-focal | 2021-10-08 01:20:32,835 INFO [placement.requestlog] 127.0.0.1 "GET /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 200 len: 644 microversion: 1.28
2021-10-08 01:20:33.108568 | ubuntu-focal | 2021-10-08 01:20:32,847 INFO [placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/e387e13f-1357-442e-bbbe-b3359eb6adec" status: 200 len: 879 microversion: 1.14
2021-10-08 01:20:33.108587 | ubuntu-focal | 2021-10-08 01:20:32,848 ERROR [nova.compute.manager] Instance failed to spawn
2021-10-08 01:20:33.108605 | ubuntu-focal | Traceback (most recent call last):
2021-10-08 01:20:33.108623 | ubuntu-focal |   File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 6651, in _unshelve_instance
2021-10-08 01:20:33.108641 | ubuntu-focal |     update(
2021-10-08 01:20:33.108659 | ubuntu-focal |   File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/utils.py", line 1533, in update_pci_request_spec_with_allocated_interface_name
2021-10-08 01:20:33.108677 | ubuntu-focal |     raise ex(
2021-10-08 01:20:33.108695 | ubuntu-focal | nova.exception.UnexpectedResourceProviderNameForPCIRequest: Resource provider e387e13f-1357-442e-bbbe-b3359eb6adec used to allocate resources for the pci request 7059503b-a648-40fd-a561-5ca769304bee does not have a properly formatted name. Expected name format is <hostname>:<agentname>:<interfacename>, but got invalid-device-rp-name
2021-10-08 01:20:33.108714 | ubuntu-focal | 2021-10-08 01:20:32,859 INFO [placement.requestlog] 127.0.0.1 "DELETE /placement/allocations/06aab374-ae34-4223-b31d-ba1558efa2ac" status: 204 len: 0 microversion: 1.0
2021-10-08 01:20:33.108732 | ubuntu-focal | 2021-10-08 01:20:32,860 INFO [nova.scheduler.client.report] Deleted allocations for instance 06aab374-ae34-4223-b31d-ba1558efa2ac
2021-10-08 01:20:33.108751 | ubuntu-focal | 2021-10-08 01:20:32,889 INFO [nova.compute.manager] Successfully reverted task state from spawning on failure for instance.

that should take use to this point in the test where we are waitign for the servier status to rever to shelve offloaded.

https://github.com/openstack/nova/blob/a8d3ab2513c39aeac3393b2154988316dfa2db3a/nova/tests/functional/test_servers_resource_request.py#L2563-L2566

i do not see these lines in teh logs

https://github.com/openstack/nova/blob/a8d3ab2513c39aeac3393b2154988316dfa2db3a/nova/tests/functional/test_servers_resource_request.py#L2568-L2574

the final check that the allcoatiosn are deleted.

so i think the test is failing between that final wait and the end of the test.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-10-12: Related fix proposed to nova (master)

#5

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/813674

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-10-13 (last edit on 2021-10-13): Re: test_unshelve_offloaded_server_with_qos_port_pci_update_fails

#6

I managed to create a stable reproduction locally. \o/

1) duplicate the failing tests

diff --git a/nova/tests/functional/test_servers_resource_request.py b/nova/tests/functional/test_servers_resource_request.py
index e2746b3669..d678067c18 100644
--- a/nova/tests/functional/test_servers_resource_request.py
+++ b/nova/tests/functional/test_servers_resource_request.py
@@ -2572,6 +2572,18 @@ class ServerMoveWithPortResourceRequestTest(
self._delete_server_and_check_allocations(
server, qos_normal_port, qos_sriov_port)

+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails0(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails1(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails2(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+ def test_unshelve_offloaded_server_with_qos_port_pci_update_fails3(self):
+ self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+

2) run tox with tox -e functional-py38 "test_description_errors|test_unshelve_offloaded_server_with_qos_port_pci_update_fails" -- --serial

This make one of the test_unshelve_offloaded_server_with_qos_port_pci_update_fails test case fail with the reported error.

What I know so far:
* nova.tests.functional.test_servers.ServersTestV219.test_description_errors() starts a new instance but does not wait for it to become ACTIVE. The test case passes and finishes.

* But the build_and_run_instance RPC call still runs in compute service in a greenlet building up the instance. The service.kill at the end of the test case does not kill the running / waiting greenlets. I proved this by dumping a gmr at the end of the test run after the service.kill was called by the Fixture.cleanup. The build_and_run_instance greenlet was visible there.

* Still other fixture cleanup drops the database from behind the compute service. This leads to ComputeNodeNotFound error. You can see this by simply adding a self.fail() at the end of test_description_errors test case and run it. There will be ComputeNodeNotFound in the logs.

* After test_description_errors passes the test executor runs the next test cases

* After ~60 seconds an RPC timeout happens in the original greenlet from test_description_errors() and it *somehow* interferes with the currently running test making that to fail.

What I don't know yet is what is the way of the interference.

I managed to create a stable reproduction locally. \o/

1) duplicate the failing tests

diff --git a/nova/tests/functional/test_servers_resource_request.py b/nova/tests/functional/test_servers_resource_request.py
index e2746b3669..d678067c18 100644
--- a/nova/tests/functional/test_servers_resource_request.py
+++ b/nova/tests/functional/test_servers_resource_request.py
@@ -2572,6 +2572,18 @@ class ServerMoveWithPortResourceRequestTest(
         self._delete_server_and_check_allocations(
             server, qos_normal_port, qos_sriov_port)
 
+    def test_unshelve_offloaded_server_with_qos_port_pci_update_fails0(self):
+        self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+    def test_unshelve_offloaded_server_with_qos_port_pci_update_fails1(self):
+        self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+    def test_unshelve_offloaded_server_with_qos_port_pci_update_fails2(self):
+        self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+
+    def test_unshelve_offloaded_server_with_qos_port_pci_update_fails3(self):
+        self.test_unshelve_offloaded_server_with_qos_port_pci_update_fails()
+