Comment 1 for bug 1545675

Revision history for this message
Nikola Đipanov (ndipanov) wrote :

So just by looking at the code - I can see that we probably want to call resource_tracker.update_usage() in the shelve_offload method so that we immediately unpin CPUs once that happens and not wait for the RT periodic task which in case of a speedy tempest test likely never happens.

This _might_ be the cause of the stack trace in the cleanup delete in case the spawn failed during unshelving for example.

Here's what would happen:

https://github.com/openstack/nova/blob/7616c88ad3a2769e9c9ee8a51ac55ddeed0bfd84/nova/compute/manager.py#L4368

When unshelving a shelve-offloaded instance, we would first run the claim_instance which does a new claim against the current state of the NumaTopology of the compute host (keep in mind that we never dropped the usage when we did shelve-offloading so we are actually leaking resources here). A successful claim here means the instance and compute node are update with the new pinning information (and we leak resources on the compute node until the next RT periodic update).

Now suppose the spawn on the next line fails for whatever reason that may or may not be related to a CPU pinning bug - this calls the __exit__ method of the claim, which in turn calls the abort() method of the claim unpinning the CPUs that were pinned during (see: https://github.com/openstack/nova/blob/7616c88ad3a2769e9c9ee8a51ac55ddeed0bfd84/nova/compute/claims.py#L121)

This will unpin the CPUs as tracked by the host NumaTopology, but it will not clear the mapping to host CPUs in the Instance object.

Finally - the test cleanup attempts to delete the instance, which attempts to unpin the already unpinned CPUs and fails.

In case the above makes sense - I think we need to do 2 things:

* Make sure offloading an instance also updates the resource_tracker in the same manner deleting it does immediately.
* Make sure that aborting the claim clears both the host field, and NUMA information of the Instance (host is problematic here as this is why the delete request ends up RPCed to the host instead of being done locally in the API, even though the instance is clearly not there since the claim failed and is being aborted).

I propose we start there and see if it fixes things.

PS - it is worth noting (if it was not clear from the text) that the above bugs impact mostly shelve functionality - nova spawn clears these things as part of the retry process so is not affected.