Bug #1867380 “nova-live-migration and nova-grenade-multinode fai...” : Bugs : OpenStack Compute (nova)

OpenStack Infra (hudson-openstack) on 2020-03-13

Changed in nova:
assignee:	nobody → Lee Yarwood (lyarwood)
status:	New → In Progress

melanie witt (melwitt) on 2020-03-14

tags:	added: live-migration testing
Changed in nova:
importance:	Undecided → Medium

melanie witt (melwitt) on 2020-03-16

tags:

added: gate-failure

Revision history for this message

melanie witt (melwitt) wrote on 2020-03-18:

#1

I think I just hit this on the master branch on the nova-grenade-multinode job [1].

The error in job-output.txt was:

tempest.api.compute.admin.test_live_migration_negative.LiveMigrationNegativeTest.test_invalid_host_for_migration [7.432917s] ... FAILED

and

tempest.exceptions.BuildErrorException: Server e6d27a14-ee54-47b0-b44e-3d8db0d99e85 failed to build and is in ERROR status

I traced the server and found it was scheduled in screen-n-sch.txt:

Mar 18 21:02:56.722990 ubuntu-bionic-ovh-gra1-0015309365 nova-scheduler[15403]: DEBUG nova.scheduler.manager [None req-58795901-f2c5-4175-a590-c487e68f209d tempest-LiveMigrationNegativeTest-1769972548 tempest-LiveMigrationNegativeTest-1769972548] Starting to schedule for instances: ['e6d27a14-ee54-47b0-b44e-3d8db0d99e85'] {{(pid=16667) select_destinations /opt/stack/new/nova/nova/scheduler/manager.py:134}}

Mar 18 21:02:57.031615 ubuntu-bionic-ovh-gra1-0015309365 nova-scheduler[15403]: DEBUG nova.scheduler.utils [None req-58795901-f2c5-4175-a590-c487e68f209d tempest-LiveMigrationNegativeTest-1769972548 tempest-LiveMigrationNegativeTest-1769972548] Attempting to claim resources in the placement API for instance e6d27a14-ee54-47b0-b44e-3d8db0d99e85 {{(pid=16667) claim_resources /opt/stack/new/nova/nova/scheduler/utils.py:1175}}

Mar 18 21:02:57.490996 ubuntu-bionic-ovh-gra1-0015309365 nova-scheduler[15403]: DEBUG nova.scheduler.filter_scheduler [None req-58795901-f2c5-4175-a590-c487e68f209d tempest-LiveMigrationNegativeTest-1769972548 tempest-LiveMigrationNegativeTest-1769972548] [instance: e6d27a14-ee54-47b0-b44e-3d8db0d99e85] Selected host: (ubuntu-bionic-ovh-gra1-0015309367, ubuntu-bionic-ovh-gra1-0015309367) ram: 7273MB disk: 51200MB io_ops: 0 instances: 0 {{(pid=16667) _consume_selected_host /opt/stack/new/nova/nova/scheduler/filter_scheduler.py:354}}

But then when I went to go find it in nova-compute, I found this in screen-n-cpu.txt on the subnode:

Mar 18 21:03:01.566901 ubuntu-bionic-ovh-gra1-0015309367 nova-compute[3783]: DEBUG nova.compute.manager [None req-001a485d-3f4a-43fa-8719-77d0f433b609 None None] [instance: e6d27a14-ee54-47b0-b44e-3d8db0d99e85] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted /opt/stack/old/nova/nova/compute/manager.py:1441}}

The server never got a chance to finish building because nova-compute was starting up (init_host) (!!) right in the middle of the build.

Looking back at job-output.txt, I see the last messages were about checking and restarting nova-compute:

2020-03-18 21:02:32.510701 | primary | 2020-03-18 21:02:32.510 | check compute processes before restart

So it's trying to run the test before nova-compute has finished starting and come back up.

[1] https://zuul.opendev.org/t/openstack/build/2caa70137d4f438b90cdd679d99ebe05