Check compute_id existence when nova-compute reports info to placement

Bug #1817833 reported by xulei
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Matt Riedemann

Bug Description

Description
===========
According to https://bugs.launchpad.net/nova/+bug/1756179, Currently we delete a nova-compute service, will delete compute_node records, resource provider records and host mapping records in DB. I found if deleting service when nova-compute service is active, it's no problem for deleting compute_node records and resource_provider records in DB, but nova-compute will continue to report the old resource_provider uuid. So when we restart nova-compute to recover service, will rasie ResourceProviderCreationFailed.

Steps to reproduce
==================
1. Check enviroment and resource_provider table.
# nova service-list | grep 'nova-compute'
| 3d9092b0-e164-4094-8672-1c855971218d | nova-compute | devstack-q | nova | enabled | up |
MariaDB [placement]> select uuid,name from resource_providers;
+--------------------------------------+------------+
| uuid | name |
+--------------------------------------+------------+
| edfff022-c19f-4720-85f9-fd947ae36b07 | devstack-q |
+--------------------------------------+------------+

2. Deleting a compute service when nova-compute process is running, check resource_provider table.
# nova service-delete 3d9092b0-e164-4094-8672-1c855971218d
MariaDB [placement]> select * from resource_providers;
Empty set (0.00 sec)

3. Wait a minute, restart nova-compute process.
# systemctl restart devstack@n-cpu

Expected result
===============
nova-compute work properly and report to resource_provider with new uuid.

Actual result
===============
nova-compute raise 409 when creae a new uuid resource_provider, and report 'No resource provider with uuid 52943fd2-d700-416f-9e16-7fe4744979b3 found'.

I found if nova-compute running, it will resume the old uuid to resource_providers when this uuid is gone. So
current resource_provider uuid in DB is still 'edfff022-c19f-4720-85f9-fd947ae36b07'. Then nova-compute will try to create a new resource provider with name 'devstack-q'. Unfortunately, the name column in tables is unique.

So I think we should check compute_id existence first, then update resource_provider_tree. If not exist, rasie ComputeHostNotFound instead of reporting.

xulei (605423512-j)
Changed in nova:
assignee: nobody → xulei (605423512-j)
tags: added: placement
Revision history for this message
Matt Riedemann (mriedem) wrote :

Which release are you testing this against? master (stein)?

Revision history for this message
xulei (605423512-j) wrote :

I found problems in Pike (our product based on Pike), and also affect master branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/641899

Changed in nova:
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

The docs explicitly say that the nova-compute service needs to be stopped before you delete the resource:

https://developer.openstack.org/api-ref/compute/?expanded=delete-compute-service-detail#delete-compute-service

Otherwise the running compute service will try to recreate the compute_nodes table and resource providers records.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Having said that, I see something missed in the fix for bug 1756179 here:

https://github.com/openstack/nova/blob/b9bcbab86b8314fbaaeb2d2af6282d4a612aeb8d/nova/api/openstack/compute/services.py#L270

That does not account for ironic where the compute service could be managing more than one node and will only delete the resource provider in placement for the first compute node in the list:

https://github.com/openstack/nova/blob/b9bcbab86b8314fbaaeb2d2af6282d4a612aeb8d/nova/objects/service.py#L313

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ah there is a bug for the issue I mentioned in comment 5:

https://bugs.launchpad.net/nova/+bug/1811726

Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1829479 might be related somehow.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/663737

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/663737
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2629d65fbc15d8698f98117e0d6072810f70da03
Submitter: Zuul
Branch: master

commit 2629d65fbc15d8698f98117e0d6072810f70da03
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/678100

Changed in nova:
assignee: xulei (605423512-j) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
melanie witt (melwitt) wrote :

FWIW I just tried the recreate steps in QUEENS and did not run into the reported error:

http://paste.openstack.org/show/786525

TL;DR I deleted a compute service while the nova-compute service was running. There were errors while it was still running (ServiceNotFound: Service 5 could not be found.) but as soon as I restarted the nova-compute service, everything recovered 100%. There was no failure to create the resource provider.

Revision history for this message
melanie witt (melwitt) wrote :

Note that during my recreate, I did not have any instances or old evacuations or unconfirmed migrations related to the compute service I deleted.

Revision history for this message
melanie witt (melwitt) wrote :

Another note for my QUEENS recreate, after restarting nova-compute process, I tried to boot an instance and it went to ERROR state with:

{"message": "Host 'ubuntu-xenial' is not mapped to any cell", "code": 400, "created": "2019-11-22T04:10:57Z"}

So then I did:

$ nova-manage cell_v2 discover_hosts

and tried to boot an instance again, and it worked.

Revision history for this message
melanie witt (melwitt) wrote :

The more I re-read the bug report in comment 0, the more I don't understand how the bug report could be correct. I also couldn't reproduce the reported behavior on stable/queens.

The bug reporter shows that the resource_providers table is empty after the service deletion (even while the nova-compute process is running).

So, when starting the nova-compute process again, how could it possibly get a collision in the resource_providers table and raise ResourceProviderCreationFailed?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/695932

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/695932
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b18e42d20bd7d341e713292bdb179ae8e5530d33
Submitter: Zuul
Branch: stable/stein

commit b18e42d20bd7d341e713292bdb179ae8e5530d33
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/698106

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/698106
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6eda7409fff75449c97843b2d6ead0b3267a1099
Submitter: Zuul
Branch: stable/rocky

commit 6eda7409fff75449c97843b2d6ead0b3267a1099
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    NOTE(mriedem): Note that in this backport a simple version of
    assertFlavorMatchesUsage is added since the original version from
    change If6aa37d9b6b48791e070799ab026c816fda4441c is not in Rocky.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)
    (cherry picked from commit b18e42d20bd7d341e713292bdb179ae8e5530d33)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/699538

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/699698

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/699698
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=23ca5e5ac9b90ff45074ae9171f63ca060ebcedd
Submitter: Zuul
Branch: stable/queens

commit 23ca5e5ac9b90ff45074ae9171f63ca060ebcedd
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Iea283322124cb35fc0bc6d25f35548621e8c8c2f in Queens so the
    change to ProviderUsageBaseTestCase is made in test_servers.py
    rather than integrated_helpers.py.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)
    (cherry picked from commit b18e42d20bd7d341e713292bdb179ae8e5530d33)
    (cherry picked from commit 6eda7409fff75449c97843b2d6ead0b3267a1099)

tags: added: in-stable-queens
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.