The allocation of VGPU has race problem

Bug #1836204 reported by Alex Xu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
High
Alex Xu

Bug Description

The vgpu is allocated by this method https://github.com/openstack/nova/blob/8260979b71b29ce2666d37b3adc7c256482aa16d/nova/virt/libvirt/driver.py#L3235

That method list the assigned mdev by listing the libvirt domain.

But if there are two concurrent request come to this method. They will see the set of assigned mdev. So they may get same free mdev also.

So there are a race window between:
https://github.com/openstack/nova/blob/8260979b71b29ce2666d37b3adc7c256482aa16d/nova/virt/libvirt/driver.py#L3235

and

We create the domain in the libvirt
https://github.com/openstack/nova/blob/8260979b71b29ce2666d37b3adc7c256482aa16d/nova/virt/libvirt/driver.py#L3241

Tags: libvirt
Alex Xu (xuhj)
Changed in nova:
assignee: nobody → Alex Xu (xuhj)
Eric Fried (efried)
Changed in nova:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Eric Fried (efried) wrote :

This is of high importance not because the race is particularly likely in current code, but we need to establish the framework to fix it so we can reuse that framework for other similar types of hardware.

In general, the fix is to claim (earmark for use by a specific instance) specific hardware artifacts [1] on the compute node in instance_claim, which is under COMPUTE_RESOURCE_SEMAPHORE. But only the virt driver can know what needs to be done to effect that claim for its specific hypervisor. And today instance_claim doesn't talk to the virt driver at all.

So the solution discussed in IRC [2] is to establish a new ComputeDriver interface, working title claim_for_instance() (and possibly a corresponding unclaim_for_instance() for rollbacks), which will be invoked from instance_claim (and _move_claim).

Using VGPUs-in-libvirt as an example, claim_for_instance would use an in-memory dict to associate a specific mdev with the specific instance for each VGPU in the allocation. This mapping could then be deleted during spawn, since the information can subsequently be gleaned from the domain XML.

[1] where "hardware" encompasses things like VFs - don't get pedantic on me
[2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-07-11.log.html#t2019-07-11T12:39:18

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/670782

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/670783

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/670784

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/670785

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/670786

melanie witt (melwitt)
tags: added: libvirt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/671222

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670786

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670785

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/671388

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670782

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670783

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670784

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/671388

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/670787

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Alex Xu (<email address hidden>) on branch: master
Review: https://review.opendev.org/671222

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.