Allocations are "doubled up" on same host resize even though there is only 1 server on the host

Bug #1790204 reported by Matt Riedemann
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
High
Unassigned

Bug Description

This is a long-standing known issue from at least Pike when the nova FilterScheduler started using placement to create allocations during server create and move (e.g. resize) operations.

In Pike, resize to the same host resulted in allocations against the compute node provider in placement to come from both the old and new flavor and were both tied to the instance as the resource consumer.

Move operations and allocation handling was improved in Queens with this blueprint:

https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/migration-allocations.html

Where the source node allocations are moved to the migration record as the consumer and the target node allocations are against the instance record consumer.

That is also true of resize to the same host, however, we still have the issue that the compute node resource provider usage is still effectively "doubled up" during the resize because it's showing usage for two flavors total when really there is only one being used.

The reported resource usage on the compute node provider during a same host resize should be the *maximum* of both the old and new flavor, not the combined aggregate.

Here is a simple recreate with devstack (created from master today):

1. we start with no resource usage on the single node provider

stack@stein:~$ openstack resource provider usage show e2bc5091-b7fd-4d18-80a8-aeecb87b0fd0
+----------------+-------+
| resource_class | usage |
+----------------+-------+
| VCPU | 0 |
| MEMORY_MB | 0 |
| DISK_GB | 0 |
+----------------+-------+

2. create a server and show there is usage:

stack@stein:~$ openstack flavor list
+----+-----------+-------+------+-----------+-------+-----------+
| ID | Name | RAM | Disk | Ephemeral | VCPUs | Is Public |
+----+-----------+-------+------+-----------+-------+-----------+
| 1 | m1.tiny | 512 | 1 | 0 | 1 | True |
| 2 | m1.small | 2048 | 20 | 0 | 1 | True |
| 3 | m1.medium | 4096 | 40 | 0 | 2 | True |
| 4 | m1.large | 8192 | 80 | 0 | 4 | True |
| 5 | m1.xlarge | 16384 | 160 | 0 | 8 | True |
| c1 | cirros256 | 256 | 0 | 0 | 1 | True |
| d1 | ds512M | 512 | 5 | 0 | 1 | True |
| d2 | ds1G | 1024 | 10 | 0 | 1 | True |
| d3 | ds2G | 2048 | 10 | 0 | 2 | True |
| d4 | ds4G | 4096 | 20 | 0 | 4 | True |
+----+-----------+-------+------+-----------+-------+-----------+

stack@stein:~$ openstack server create --flavor m1.tiny --image cirros-0.3.5-x86_64-disk resize-same-host

stack@stein:~$ openstack resource provider usage show e2bc5091-b7fd-4d18-80a8-aeecb87b0fd0
+----------------+-------+
| resource_class | usage |
+----------------+-------+
| VCPU | 1 |
| MEMORY_MB | 512 |
| DISK_GB | 1 |
+----------------+-------+

3. resize the server and check usage:

stack@stein:~$ openstack server resize resize-same-host --flavor m1.small
stack@stein:~$ openstack server list
+--------------------------------------+------------------+---------------+--------------------------------------------------------+--------------------------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------+---------------+--------------------------------------------------------+--------------------------+----------+
| d7d743d8-7561-4c9c-a7bf-e9fe1e89dea1 | resize-same-host | VERIFY_RESIZE | private=fdde:1239:d41d:0:f816:3eff:fe1f:a19, 10.0.0.13 | cirros-0.3.5-x86_64-disk | m1.small |
+--------------------------------------+------------------+---------------+--------------------------------------------------------+--------------------------+----------+
stack@stein:~$ openstack resource provider usage show e2bc5091-b7fd-4d18-80a8-aeecb87b0fd0
+----------------+-------+
| resource_class | usage |
+----------------+-------+
| VCPU | 2 |
| MEMORY_MB | 2560 |
| DISK_GB | 21 |
+----------------+-------+

And here we see the old/new flavor usage are cumulative on the single node provider.

4. confirm the resize and the usage is just the new m1.small flavor.

stack@stein:~$ openstack server resize resize-same-host --confirm
stack@stein:~$ openstack server list
+--------------------------------------+------------------+--------+--------------------------------------------------------+--------------------------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------------+--------+--------------------------------------------------------+--------------------------+----------+
| d7d743d8-7561-4c9c-a7bf-e9fe1e89dea1 | resize-same-host | ACTIVE | private=fdde:1239:d41d:0:f816:3eff:fe1f:a19, 10.0.0.13 | cirros-0.3.5-x86_64-disk | m1.small |
+--------------------------------------+------------------+--------+--------------------------------------------------------+--------------------------+----------+
stack@stein:~$ openstack resource provider usage show e2bc5091-b7fd-4d18-80a8-aeecb87b0fd0
+----------------+-------+
| resource_class | usage |
+----------------+-------+
| VCPU | 1 |
| MEMORY_MB | 2048 |
| DISK_GB | 20 |
+----------------+-------+
stack@stein:~$

===

Same-host resize is disabled by default but can be important in at least two cases:

1. Servers in an affinity (same-host) group cannot resize if they are not allowed to resize on the same host.

2. "Edge" deployment scenarios where there are 1 or 2 compute hosts means being able to resize on the same host is critical - and probably what's more critical in those edge scenarios is not reporting resource usage that is not really there, since it could result in scheduling failures to that host which otherwise would have fit.

Revision history for this message
Matt Riedemann (mriedem) wrote :

One hacky way we could handle this is in conductor, after we've moved the instance allocations for the old_flavor to the migration record, if the selected host is the same host the instance is already one, we just fix the allocations so they are the max of the two flavors - but we'd need to sort out if we still use 2 consumers or only 1 - it might make sense to only have the instance consumer for the same-host resize case, but there is logic in the nova-compute service since queens that expects the source node allocations to be tracked by the migration record consumer, so those would have to be audited so they don't blow up now depending on what conductor does.

Another wrinkle that we have to worry about is resize can reschedule if the selected host fails the resize. So we could have a case where the scheduler picks 3 hosts:

1. first selected host is the same host, but fails, so we reschedule to host 2
2. second host fails, we reschedule to host 3
3. the resize passes on the 3rd host (2nd alternate)

In those cases, the alternative hosts are *not* the same host so how would we deal with the allocations then because the old flavor allocations still need to be on the source host and the new flavor allocations need to be on the destination host.

Changed in nova:
assignee: nobody → Zhenyu Zheng (zhengzhenyu)
Revision history for this message
s10 (vlad-esten) wrote :

Same host migration is important in another case:
Big VM with large root disk (1000GB or more, 50% or more of the total available disk) can't be resized to the same host, leading to the unnecessary cold migration of a large amount of disk data to another host, even if we increase only memory/vcpu count for this instance.

Matt Riedemann (mriedem)
Changed in nova:
assignee: Zhenyu Zheng (zhengzhenyu) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/619123

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looking back over the history on this, it's my fault it looks like:

https://review.openstack.org/#/c/490085/

It also looks like we had some conversation about doing a max of the allocations rather than a sum:

https://review.openstack.org/#/c/490085/1/nova/tests/unit/scheduler/client/test_report.py@533

And decided "do the sum since it's simpler". Well, that's great. :)

There was also a note added to the code that dealt with this in pike:

https://review.openstack.org/#/c/490085/7/nova/scheduler/client/report.py@224

"""
# Note that we sum the allocations rather than take the max per
# resource class between the current and new allocations because
# the compute node/resource tracker is going to adjust for
# decrementing any old allocations as necessary, the scheduler
# shouldn't make assumptions about that.
"""

I believe that's referring to the fact that in Pike, if you still had at least one Ocata compute in the deployment, the resource tracker in the nova-compute service would delete the old allocations, or at least overwrite the allocations using the new flavor, so it sort of healed itself. However, once all of your computes are upgraded that is no longer true (the compute doesn't PUT allocations anymore since that's the job of the scheduler). Additionally, we have since dropped that old compute compatibility code.

Changed in nova:
importance: Medium → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

Also, the hack idea I had in comment 1:

"One hacky way we could handle this is in conductor, after we've moved the instance allocations for the old_flavor to the migration record, if the selected host is the same host the instance is already one, we just fix the allocations so they are the max of the two flavors"

Doesn't really work for what I was thinking. I was thinking conductor could swap the allocations for the old flavor from the instance to the migration consumer, get the dest host from the scheduler (which would also create the allocations for the instance against the dest host using the new flavor), and then in conductor if we see that the selected host is the same host, we'd fix the allocations. However, fixing the allocations in conductor is too late if we've already failed with a NoValidHost because placement thinks there are no resources available...

So we either need to fix the allocations that we swap to the migration record *before* calling the scheduler to allocate for the new flavor, or somehow hack something into RequestSpec to tell the scheduler the actual resource amount we want to allocate for the instance.

Given I think the end result we want from the resize is that the new flavor allocations exist with the instance, we probably don't want to mess with those. So that leaves fixing the allocations we move to the migration record before calling the scheduler - but then we have to adjust those if we are not actually resizing to the same host...yuck.

For example, if we start with old flavor:

VCPU = 4, MEMORY_MB = 2048, DISK_GB = 20

And resize with new flavor:

VCPU = 2, MEMORY_MB = 4096, DISK_GB = 20

We want to end up with the max of those values allocated against the provider, so:

VCPU = 4, MEMORY_MB = 4096, DISK_GB = 20

And we want new flavor to be on the instance consumer, which leaves these amounts on the migration consumer before scheduling (max(old-new, 0)):

VCPU = 2, MEMORY_MB = 0, DISK_GB = 0

If the scheduler returns a host that is not the same host that the instance is on, then we need to fix the migration allocations back to the old_flavor (which is risky if something took that space in the meantime...).

So probably the grossest but safest way to fix this is at the point of claiming the dest host allocations for the instance with the new flavor, in the scheduler...which means passing something through on the request spec for that case to adjust the allocations if and only if the source and dest are the same provider.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Another idea is simply don't keep any allocations on the migration record if we know we're doing same host resize, leave the max allocations on the instance record. That would have implications for how the allocations are cleaned up on revert or failure in the same host resize case, and older computes wouldn't have that logic...

Revision history for this message
Matt Riedemann (mriedem) wrote :

Another tricky thing to keep in mind is the new flavor is used to update the instance.vcpus and instance.memory_mb values:

https://github.com/openstack/nova/blob/905e25a63d3ba25cfbdf492891ac8864fed609ab/nova/compute/manager.py#L4419

https://github.com/openstack/nova/blob/905e25a63d3ba25cfbdf492891ac8864fed609ab/nova/compute/manager.py#L4472

And those are used when counting quotas:

https://github.com/openstack/nova/blob/905e25a63d3ba25cfbdf492891ac8864fed609ab/nova/objects/instance.py#L1527

I'm not sure that directly affects the allocations really since it's different data in a different database, but reminds me that this might have implications if we start counting quota usage from placement:

https://review.openstack.org/#/c/509042

Although with that proposal we're counting usage from the project_id, not the consumer, so as long as the migration/instance consumer allocations for that project are accurate (max of old/new flavors) we should be OK.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (3.2 KiB)

I was starting to hack up a solution where conductor would detect the possibility of doing a same-host resize and *not* swap the source node old_flavor allocations to the migration record, which I thought would then get us to the point where the scheduler calls the _move_operation_alloc_request method during claim_resources. However, we don't even get that far because the GET /allocation_candidates request to placement filters out the only host (in this case the provider has DISK_GB=1028 and that's also what is currently allocated to the server with the old_flavor, and even though the new_flavor disk size doesn't change, the scheduler is requesting 1028 more).

So clearly we have to do something before we ever call GET /allocation_candidates to either (1) change the amount of resource we request, or (2) free up resources on the source host before calling the scheduler.

1. For the former, I think that would mean mutating the RequestSpec.flavor to be the max of the old/new flavor values, so using the example from comment 5 that would be:

VCPU = 2, MEMORY_MB = 0, DISK_GB = 0

However, if the scheduler picks a host that is *not* the same host, we'd be claiming the wrong thing on that host. So that won't work, unless we stash off that override in the request spec and *only* use it if we determine, from the scheduler before doing the claim, that we're going to be claiming on the same host. That's pretty gross.

2. For the latter, conductor could drop the old_flavor allocations held by the server on the source compute node before calling the scheduler. Then once we get the selected destination, if it's the same host, we're good. If it's *not* the same host, we need to put the old_flavor allocations on the source host, tracked by the migration record. This could fail if something claimed resources on the source node while scheduling the resize, but that seems like a pretty small window to happen. Additionally, if we did fail to re-claim the old_flavor resources on the source host, we could potentially ignore that host (same host) and pick the next alternate (which we'd know isn't the same host). If the only host available was the same host, then I guess we lose and just have to fail the resize until more resources are freed up - which seems natural and OK.

Option 2 there sounds pretty good, because we:

a) would only be claiming new_flavor values on the dest host (which is already what we do today)
b) would be able to re-claim the allocations for the old flavor potentially if the scheduler picked another host (keep our spot on the source host)
c) if we are resizing to the same host, and the migration record doesn't hold allocations, I believe the compute code before Stein will still handle that case gracefully and not try to swap anything back from the migration to the instance on revert (or failure):

https://github.com/openstack/nova/blob/bc0a5d0355311641daa87b46e311ae101f1817ad/nova/compute/manager.py#L4189

Things might get hairy with reschedules so that might need to be dealt with separately in conductor but we'll see - need to add tests for those wrinkles (selected host is same host and resize fails so we reschedule to another host, are allo...

Read more...

Revision history for this message
Jay Pipes (jaypipes) wrote :

Why don't we just change the whole resize logic to always try resizing to the same host first (without ever touching the scheduler) and then only call the scheduler if the resize-on-same-host failed?

That way, we try the easy route first -- without ever touching allocation candidates or any of that mess, and we simply update the host's allocations for that instance using PUT /allocations/{instance_uuid}. That will return a 409 Conflict if the results of that updated allocation would exceed the provider's inventory (accounting for allocation ratios) and therefore we wouldn't need to involve the scheduler at all for same-host resizes.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/619123
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0ce581369838dbe6909bdd8847aefac2aff1177e
Submitter: Zuul
Branch: master

commit 0ce581369838dbe6909bdd8847aefac2aff1177e
Author: Matt Riedemann <email address hidden>
Date: Tue Nov 20 19:04:50 2018 -0500

    Add functional regression recreate test for bug 1790204

    Since Pike, the FilterScheduler "claims" resources in
    placement via allocations based on the flavor during
    scheduling. For a same-host resize, the old and new flavor
    are summed to get the overall allocation against the single
    compute node resource provider, which can cause NoValidHost
    failures during scheduling because the sum is over capacity
    for the resource provider even though the new flavor alone
    may be OK.

    This adds a functional regression test to recreate the bug.

    Change-Id: I036a5ceabe88dcc1fd85c09472481de7d02edf5f
    Related-Bug: #1790204

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Jay, that's not a bad idea. While working through the various scenarios, possible solutions and issues with them, I thought about something like that as well, but for whatever reason figured that would be too big of a change since historically same-host resize has been pushed off as a test-only thing and not something people should be using in production, but clearly there are production cases for it as noted (resize to same host for affinity server groups and edge scenarios).

Having said all that, I have not had the time to get back to trying to prototype a solution for any of this.

Revision history for this message
Matt Riedemann (mriedem) wrote :

For my own notes, I was wondering how the ResourceTracker claims code in the nova-compute service handles a same-host resize (before placement) and this is the code that calculates the usage for same-host resize during the resize_claim on the host:

https://github.com/openstack/nova/blob/e3c24da89aa3e6462f1b07e00659c87f252ba4ba/nova/compute/resource_tracker.py#L1048-L1073

The key part is this:

https://github.com/openstack/nova/blob/e3c24da89aa3e6462f1b07e00659c87f252ba4ba/nova/compute/resource_tracker.py#L1053

Which means usage for that instance on that host is reported using the new_flavor only, however, during the update_available_resource periodic task if the instance has already been resized and is sitting in 'VERIFY_RESIZE' status the old_flavor will also be accounted:

https://github.com/openstack/nova/blob/e3c24da89aa3e6462f1b07e00659c87f252ba4ba/nova/compute/resource_tracker.py#L1070

Which means, essentially, the resource tracker claims code would also "double up" the resource usage from both flavors on the same host, which could over-commit that host and make the scheduler ignore it for scheduling new instances even though it's not really at capacity - thus making this a latent issue (not a regression in Pike when the scheduler started creating allocations in placement). To be sure, I'd need to run the functional regression test from I036a5ceabe88dcc1fd85c09472481de7d02edf5f in stable/ocata (without placement in the scheduler) or maybe even mitaka.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/638700

Revision history for this message
Matt Riedemann (mriedem) wrote :

Another issue with the approach in comment 9 is that we need to get alternative hosts from the scheduler in case the resize to the same host fails (assuming there would be alternative hosts).

So I think the flow would be something like this in conductor:

1. Are we doing a resize (based on Migration.migration_type) and if so, are we allowed to resize to the same host? We determine the latter from the RequestSpec.ignore_hosts field which is populated by the API if we can't do a resize to the same host:

https://github.com/openstack/nova/blob/f58cdcd58dc555b5b8635907987510f4970eae58/nova/compute/api.py#L3552

2. If the conditions in #1 are True, then we get the max of the allocations for the resources in the old and new flavor and try to PUT /allocations/{consumer_id} for the current instance.node compute node provider.

2a. If that PUT allocations works, then we check if we can get alternates (based on the max_attempts config) from the scheduler (and need to adjust the RequestSpec to ignore the host we already claimed resources for). This could result in NoValidHost if there are no more hosts available like in a small edge site and then we just don't have any alternates. Also in this case we would not swap the existing instance old_flavor allocations to the migration record, the instance would continue to contain the max() allocations for the old/new flavor on the same host. This would have implications for the revert allocations code in the compute service which expects there to be allocations on the migration record (maybe fixed here though? https://review.openstack.org/#/c/636412/) and would also have implications for rescheduling if the same-host resize fails and we need to reschedule to an alternate, in that case we need to swap the allocations (but conductor already does that it looks like, so maybe that's not an issue).

2b. If the PUT allocation fails, then we know we can't resize to that host and just add it to the RequestSpec.ignore_hosts field when we call the scheduler (we might not even need to do that since the scheduler would sum the old/new flavor and the allocations will fail for that host).

One issue with doing this and bypassing the scheduler for the same host would be anything in the new flavor that is not accounted for in placement, like NUMA or PCI requests, could fail during the resize_claim on the compute whereas they could have failed earlier during scheduling, but maybe that's a trade-off we have to live with for the time being.

--

Alternatives to this are simply drop that "double up" code in the scheduler and always claim using the new flavor allocations but at this point I'm not even sure if _move_operation_alloc_request does anything because since Queens we have moved the source node allocations to the migration record and the instance, during scheduling to a dest host, does not have allocations, meaning we could still fail the claim since the migration (old_flavor) + instance (new_flavor) on the same host is a sum and could fail.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh another thing, the new flavor could have new required traits which might not fit with the current host, so bypassing GET /allocation_candidates could mean we resize to the same host and violate the required traits on the new flavor...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/638791

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)
Download full text (3.3 KiB)

Reviewed: https://review.openstack.org/638700
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4363b10f5b9eaa7be2df36a94b6bbad5f4674c57
Submitter: Zuul
Branch: master

commit 4363b10f5b9eaa7be2df36a94b6bbad5f4674c57
Author: Matt Riedemann <email address hidden>
Date: Fri Feb 22 11:04:14 2019 -0500

    Remove misleading code from _move_operation_alloc_request()

    Change I1c9442eed850a3eb7ac9871fafcb0ae93ba8117c in Pike
    made the scheduler "double up" the old_flavor and new_flavor
    allocations when resizing to the same host. If there were
    Ocata computes still in the deployment at the time, which would
    be normal in Pike for rolling upgrades, the ResourceTracker
    would overwrite the instance allocations in placement based on
    the instance.flavor when the update_available_resource periodic
    task would run.

    If that periodic ran after scheduling but before finish_resize(),
    the old_flavor would be used to report allocations. If the periodic
    ran after finish_resize(), the new_flavor would be used to report
    allocations.

    That Ocata-compute auto-heal code was removed in Rocky with change
    I39d93dbf8552605e34b9f146e3613e6af62a1774, but should have effectively
    been vestigial since Queens when nova-compute should be at most N-1 so
    there should be no Ocata compute services.

    This change removes the misleading Pike-era code in the
    _move_operation_alloc_request() which sums the allocations for the
    old and new flavor when resizing to the same host since:

    1. The compute service no longer does what the comment says.
    2. Since Queens, conductor swaps the instance-held allocations
       on the source node to the migration record and the
       _move_operation_alloc_request method is only called if the
       instance has allocations, which it won't during resize. So the
       only time _move_operation_alloc_request is called now is during
       an evacuate because conductor doesn't do the allocation swap in
       that case. And since you can't evacuate to the same host, the
       elif block in _move_operation_alloc_request is dead code.

    Note that change I1c9442eed850a3eb7ac9871fafcb0ae93ba8117c was
    effectively a change in behavior for resize to the same host
    because the scheduler sums the old/new flavor resource allocations
    which could result in a false NoValidHost error when in reality
    the total allocation for the instance is going to be the maximum
    of the resource class allocations between the old and new flavor,
    e.g. if the compute has 8 VCPUs total and the instance is using
    6 VCPUs with the old flavor, and then resized to a new flavor with
    8 VCPUs, the scheduler is trying to "claim" 14 VCPUs when really the
    instance will only use 8 and that causes a NoValidHost error.
    Comparing to the pre-Pike scheduling and ResourceTracker.resize_claim
    behavior for resize to the same host, the scheduler would only filter
    on the new flavor resources and the resize_claim would only claim
    based on the new_flavor resources as well. The update_available_resource
    periodic would account for o...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/645954

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/645954
Reason: I'm not actively working on this. There are ideas in here and comments along with https://review.opendev.org/#/c/638791/ which could probably be formed into a solution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/638791
Reason: I'm not actively working on this. There are ideas in here and comments along with https://review.opendev.org/#/c/645954/ which could probably be formed into a solution.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Per comment 12, this is a latent issue and a duplicate of bug 1609193 which goes back to at least Newton but I'm sure was always an issue for resize on the same host.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.