placement api fails when nova tries to delete resource allocation after failed evacuation

Bug #1714924 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Unassigned

Bug Description

During the investigation of bug #1713783 (After failed evacuation the recovered source compute tries to delete the instance) we noticed that nova tries to PUT an empty '{}' allocations for an instance and that fails on the placement side. As bug #1713783 now seems to be solved by keeping the existing behavior, i.e. deleting the instance after failed evacuation. We have to fix nova not to try to PUT the allocation but actually DELETE the allocation of the instance in placement.

The problem can be reproduced with the regression test proposed in https://review.openstack.org/#/c/498482/ which produce the following stack trace:

    2017-09-04 11:33:24,989 INFO [nova.service] Starting compute node (version 16.0.1)
    2017-09-04 11:33:25,005 INFO [nova.compute.manager] Deleting instance as it has been evacuated from this host
    2017-09-04 11:33:25,060 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/allocations/4d056cc0-b227-4f79-baab-f3e3cd1a6d00" status: 200 len: 134 microversion: 1.0
    2017-09-04 11:33:25,066 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "PUT /placement/allocations/4d056cc0-b227-4f79-baab-f3e3cd1a6d00" status: 400 len: 651 microversion: 1.10
    2017-09-04 11:33:25,067 WARNING [nova.scheduler.client.report] Failed to save allocation for 4d056cc0-b227-4f79-baab-f3e3cd1a6d00. Got HTTP 400: <html>
     <head>
      <title>400 Bad Request</title>
     </head>
     <body>
      <h1>400 Bad Request</h1>
      The server could not comply with the request since it is either malformed or otherwise incorrect.<br /><br />
    JSON does not validate: {} does not have enough properties

    Failed validating 'minProperties' in schema['properties']['allocations']['items']['properties']['resources']:
        {'additionalProperties': False,
         'minProperties': 1,
         'patternProperties': {'^[0-9A-Z_]+$': {'minimum': 1,
                                                'type': 'integer'}},
         'type': 'object'}

    On instance['allocations'][0]['resources']:
        {}

     </body>
    </html>
    2017-09-04 11:33:25,067 ERROR [nova.compute.resource_tracker] Failed to clean allocation of evacuated instance on the source node bc78aa7f-07df-4e06-bb77-71aec7d92e5c

tags: added: evac
tags: added: evacuate placement
removed: evac
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Sean Dague (sdague)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

Yeah this used to cause a 500 in the placement API but that was changed to a 400:

https://review.openstack.org/#/c/499270/

Good spot on where we're doing this incorrectly though during evacuation.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looking at jobs where this shows up in CI:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Failed%20to%20clean%20allocation%20of%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

For example:

http://logs.openstack.org/26/577926/1/check/legacy-grenade-dsvm-neutron-multinode-live-migration/13d3684/logs/new/screen-n-cpu.txt#_2018-06-25_21_31_21_929

It's only showing up in stable/pike live migration + grenade jobs and likely due to a race when the resource tracker runs and sees there is an ocata compute so it does auto-heal for allocations and removes the allocations from the other host:

http://logs.openstack.org/26/577926/1/check/legacy-grenade-dsvm-neutron-multinode-live-migration/13d3684/logs/new/screen-n-cpu.txt#_2018-06-25_21_31_21_201

2018-06-25 21:31:21.201 21212 DEBUG nova.compute.resource_tracker [req-e64b2af6-e08b-49c6-b9c5-82d6b9f2454a tempest-LiveMigrationRemoteConsolesV26Test-1813726258 tempest-LiveMigrationRemoteConsolesV26Test-1813726258] We're on a compute host from Nova version >=16 (Pike or later) in a deployment with at least one compute host version <16 (Ocata or earlier). Will auto-correct allocations to handle Ocata-style assumptions. _update_usage_from_instances /opt/stack/new/nova/nova/compute/resource_tracker.py:1204

The only time the error shows up on master branch jobs is in actual failures, so it's probably not worth investigating this on master for live migration since we have migration-based allocations since Queens:

https://specs.openstack.org/openstack/nova-specs/specs/queens/implemented/migration-allocations.html

And in stable/queens jobs we wouldn't have any <Pike computes to trigger the auto-heal code because the scheduler in pike handles allocations, not the computes/resource tracker.

I'm not sure if this is still an issue for evacuate and restoring an evacuated node.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

At least the test test_evacuate_with_no_compute that showed this stack trace does not produce the same trace any more. So I think the evacuate part also OK on master.

Changed in nova:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.