nova-live-migration and nova-grenade-multinode fail due to n-cpu restarting slowly after being reconfigured for ceph

Bug #1867380 reported by Lee Yarwood
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Lee Yarwood
Pike
Fix Released
Medium
Lee Yarwood
Queens
Fix Released
Medium
Lee Yarwood
Rocky
Fix Released
Medium
Lee Yarwood
Stein
Fix Released
Medium
Lee Yarwood
Train
Fix Released
Medium
Lee Yarwood

Bug Description

Description
===========

$subject, it appears the current check of using grep to find active n-cpu processes isn't enough and we actually need to wait for the services to report as UP before starting to run Tempest.

In the following we can see Tempest starting at 2020-03-13 13:01:19.528 while n-cpu within the instance isn't marked as UP for another ~20 seconds:

https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/job-output.txt#6305

https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/screen-n-cpu.txt#3825

https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/subnode-2/screen-n-cpu.txt#3534

I've only seen this on stable/pike at present but it could potentially hit all branches with slow enough CI nodes.

Steps to reproduce
==================
Run nova-live-migration on slow CI nodes.

Expected result
===============
nova/tests/live_migration/hooks/ceph.sh waits until hosts are marked as UP before running Tempest.

Actual result
=============
nova/tests/live_migration/hooks/ceph.sh checks for running n-cpu processes and then immediately starts Tempest.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   stable/pike but you be present on other branches with slow enough CI nodes.

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt / KVM.

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

Mar 13 13:01:39.170201 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 74932102-3737-4f8f-9002-763b2d580c3a] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
Mar 13 13:01:39.255008 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 042afab0-fbef-4506-84e2-1f54cb9d67ca] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
Mar 13 13:01:39.322508 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: cc293f53-7428-4e66-9841-20cce219e24f] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: New → In Progress
melanie witt (melwitt)
tags: added: live-migration testing
Changed in nova:
importance: Undecided → Medium
melanie witt (melwitt)
tags: added: gate-failure
Revision history for this message
melanie witt (melwitt) wrote :

I think I just hit this on the master branch on the nova-grenade-multinode job [1].

The error in job-output.txt was:

tempest.api.compute.admin.test_live_migration_negative.LiveMigrationNegativeTest.test_invalid_host_for_migration [7.432917s] ... FAILED

and

tempest.exceptions.BuildErrorException: Server e6d27a14-ee54-47b0-b44e-3d8db0d99e85 failed to build and is in ERROR status

I traced the server and found it was scheduled in screen-n-sch.txt:

Mar 18 21:02:56.722990 ubuntu-bionic-ovh-gra1-0015309365 nova-scheduler[15403]: DEBUG nova.scheduler.manager [None req-58795901-f2c5-4175-a590-c487e68f209d tempest-LiveMigrationNegativeTest-1769972548 tempest-LiveMigrationNegativeTest-1769972548] Starting to schedule for instances: ['e6d27a14-ee54-47b0-b44e-3d8db0d99e85'] {{(pid=16667) select_destinations /opt/stack/new/nova/nova/scheduler/manager.py:134}}

Mar 18 21:02:57.031615 ubuntu-bionic-ovh-gra1-0015309365 nova-scheduler[15403]: DEBUG nova.scheduler.utils [None req-58795901-f2c5-4175-a590-c487e68f209d tempest-LiveMigrationNegativeTest-1769972548 tempest-LiveMigrationNegativeTest-1769972548] Attempting to claim resources in the placement API for instance e6d27a14-ee54-47b0-b44e-3d8db0d99e85 {{(pid=16667) claim_resources /opt/stack/new/nova/nova/scheduler/utils.py:1175}}

Mar 18 21:02:57.490996 ubuntu-bionic-ovh-gra1-0015309365 nova-scheduler[15403]: DEBUG nova.scheduler.filter_scheduler [None req-58795901-f2c5-4175-a590-c487e68f209d tempest-LiveMigrationNegativeTest-1769972548 tempest-LiveMigrationNegativeTest-1769972548] [instance: e6d27a14-ee54-47b0-b44e-3d8db0d99e85] Selected host: (ubuntu-bionic-ovh-gra1-0015309367, ubuntu-bionic-ovh-gra1-0015309367) ram: 7273MB disk: 51200MB io_ops: 0 instances: 0 {{(pid=16667) _consume_selected_host /opt/stack/new/nova/nova/scheduler/filter_scheduler.py:354}}

But then when I went to go find it in nova-compute, I found this in screen-n-cpu.txt on the subnode:

Mar 18 21:03:01.566901 ubuntu-bionic-ovh-gra1-0015309367 nova-compute[3783]: DEBUG nova.compute.manager [None req-001a485d-3f4a-43fa-8719-77d0f433b609 None None] [instance: e6d27a14-ee54-47b0-b44e-3d8db0d99e85] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=3783) _error_out_instances_whose_build_was_interrupted /opt/stack/old/nova/nova/compute/manager.py:1441}}

The server never got a chance to finish building because nova-compute was starting up (init_host) (!!) right in the middle of the build.

Looking back at job-output.txt, I see the last messages were about checking and restarting nova-compute:

2020-03-18 21:02:32.510701 | primary | 2020-03-18 21:02:32.510 | check compute processes before restart

So it's trying to run the test before nova-compute has finished starting and come back up.

[1] https://zuul.opendev.org/t/openstack/build/2caa70137d4f438b90cdd679d99ebe05

Changed in nova:
importance: Medium → High
summary: - nova-live-migration fails due to n-cpu restarting slowly after being
- reconfigured for ceph
+ nova-live-migration and nova-grenade-multinode fail due to n-cpu
+ restarting slowly after being reconfigured for ceph
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/713035
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e23c3c2c8df3843c5853c87ef684bd21c4af95d8
Submitter: Zuul
Branch: master

commit e23c3c2c8df3843c5853c87ef684bd21c4af95d8
Author: Lee Yarwood <email address hidden>
Date: Fri Mar 13 16:51:01 2020 +0000

    nova-live-migration: Wait for n-cpu services to come up after configuring Ceph

    Previously the ceph.sh script used during the nova-live-migration job
    would only grep for a `compute` process when checking if the services
    had been restarted. This check was bogus and would always return 0 as it
    would always match itself. For example:

    2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
    29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
           aux | grep compute
    2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
    29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute

    Failures of this job were seen on the stable/pike branch where slower CI
    nodes appeared to struggle to allow Libvirt to report to n-cpu in time
    before Tempest was started. This in-turn caused instance build failures
    and the overall failure of the job.

    This change resolves this issue by switching to pgrep and ensuring
    n-cpu services are reported as fully up after a cold restart before
    starting the Tempest test run.

    Closes-Bug: 1867380
    Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/713836

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/713837

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713840

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/713845

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/713836
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=70447bca2f4f33c6872eaf94a2e4351bb257c22a
Submitter: Zuul
Branch: stable/train

commit 70447bca2f4f33c6872eaf94a2e4351bb257c22a
Author: Lee Yarwood <email address hidden>
Date: Fri Mar 13 16:51:01 2020 +0000

    nova-live-migration: Wait for n-cpu services to come up after configuring Ceph

    Previously the ceph.sh script used during the nova-live-migration job
    would only grep for a `compute` process when checking if the services
    had been restarted. This check was bogus and would always return 0 as it
    would always match itself. For example:

    2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
    29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
           aux | grep compute
    2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
    29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute

    Failures of this job were seen on the stable/pike branch where slower CI
    nodes appeared to struggle to allow Libvirt to report to n-cpu in time
    before Tempest was started. This in-turn caused instance build failures
    and the overall failure of the job.

    This change resolves this issue by switching to pgrep and ensuring
    n-cpu services are reported as fully up after a cold restart before
    starting the Tempest test run.

    Closes-Bug: 1867380
    Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7
    (cherry picked from commit e23c3c2c8df3843c5853c87ef684bd21c4af95d8)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/713837
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=373c4ffde2053c7ff11bd38339b88d144cd442f2
Submitter: Zuul
Branch: stable/stein

commit 373c4ffde2053c7ff11bd38339b88d144cd442f2
Author: Lee Yarwood <email address hidden>
Date: Fri Mar 13 16:51:01 2020 +0000

    nova-live-migration: Wait for n-cpu services to come up after configuring Ceph

    Previously the ceph.sh script used during the nova-live-migration job
    would only grep for a `compute` process when checking if the services
    had been restarted. This check was bogus and would always return 0 as it
    would always match itself. For example:

    2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
    29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
           aux | grep compute
    2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
    29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute

    Failures of this job were seen on the stable/pike branch where slower CI
    nodes appeared to struggle to allow Libvirt to report to n-cpu in time
    before Tempest was started. This in-turn caused instance build failures
    and the overall failure of the job.

    This change resolves this issue by switching to pgrep and ensuring
    n-cpu services are reported as fully up after a cold restart before
    starting the Tempest test run.

    Closes-Bug: 1867380
    Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7
    (cherry picked from commit e23c3c2c8df3843c5853c87ef684bd21c4af95d8)
    (cherry picked from commit 70447bca2f4f33c6872eaf94a2e4351bb257c22a)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/713840
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=63ed32ef49adcb6830ef3b5329a561542bddf656
Submitter: Zuul
Branch: stable/rocky

commit 63ed32ef49adcb6830ef3b5329a561542bddf656
Author: Lee Yarwood <email address hidden>
Date: Fri Mar 13 16:51:01 2020 +0000

    nova-live-migration: Wait for n-cpu services to come up after configuring Ceph

    Previously the ceph.sh script used during the nova-live-migration job
    would only grep for a `compute` process when checking if the services
    had been restarted. This check was bogus and would always return 0 as it
    would always match itself. For example:

    2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
    29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
           aux | grep compute
    2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
    29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute

    Failures of this job were seen on the stable/pike branch where slower CI
    nodes appeared to struggle to allow Libvirt to report to n-cpu in time
    before Tempest was started. This in-turn caused instance build failures
    and the overall failure of the job.

    This change resolves this issue by switching to pgrep and ensuring
    n-cpu services are reported as fully up after a cold restart before
    starting the Tempest test run.

    Closes-Bug: 1867380
    Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7
    (cherry picked from commit e23c3c2c8df3843c5853c87ef684bd21c4af95d8)
    (cherry picked from commit 70447bca2f4f33c6872eaf94a2e4351bb257c22a)
    (cherry picked from commit 373c4ffde2053c7ff11bd38339b88d144cd442f2)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/713845
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0718015f3fd2899720613bfef789f7023f112e30
Submitter: Zuul
Branch: stable/queens

commit 0718015f3fd2899720613bfef789f7023f112e30
Author: Lee Yarwood <email address hidden>
Date: Fri Mar 13 16:51:01 2020 +0000

    nova-live-migration: Wait for n-cpu services to come up after configuring Ceph

    Previously the ceph.sh script used during the nova-live-migration job
    would only grep for a `compute` process when checking if the services
    had been restarted. This check was bogus and would always return 0 as it
    would always match itself. For example:

    2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
    29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
           aux | grep compute
    2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
    29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute

    Failures of this job were seen on the stable/pike branch where slower CI
    nodes appeared to struggle to allow Libvirt to report to n-cpu in time
    before Tempest was started. This in-turn caused instance build failures
    and the overall failure of the job.

    This change resolves this issue by switching to pgrep and ensuring
    n-cpu services are reported as fully up after a cold restart before
    starting the Tempest test run.

    Closes-Bug: 1867380
    Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7
    (cherry picked from commit e23c3c2c8df3843c5853c87ef684bd21c4af95d8)
    (cherry picked from commit 70447bca2f4f33c6872eaf94a2e4351bb257c22a)
    (cherry picked from commit 373c4ffde2053c7ff11bd38339b88d144cd442f2)
    (cherry picked from commit 63ed32ef49adcb6830ef3b5329a561542bddf656)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.opendev.org/713036
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=58cee69b2ff3e5451bbb4c500b4a985012020df1
Submitter: Zuul
Branch: stable/pike

commit 58cee69b2ff3e5451bbb4c500b4a985012020df1
Author: Matt Riedemann <email address hidden>
Date: Wed Feb 6 15:50:58 2019 -0500

    nova-live-migration: Wait for n-cpu services to come up after configuring Ceph

    Previously the ceph.sh script used during the nova-live-migration job
    would only grep for a `compute` process when checking if the services
    had been restarted. This check was bogus and would always return 0 as it
    would always match itself. For example:

    2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
    29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
           aux | grep compute
    2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
    29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute

    Failures of this job were seen on the stable/pike branch where slower CI
    nodes appeared to struggle to allow Libvirt to report to n-cpu in time
    before Tempest was started. This in-turn caused instance build failures
    and the overall failure of the job.

    This change resolves this issue by switching to pgrep and ensuring
    n-cpu services are reported as fully up after a cold restart before
    starting the Tempest test run.

    NOTE(lyarwood): The following change is squashed here to avoid endless
    retries in the gate due to bug #1867380.

    Replace ansible --sudo with --become in live_migration/hooks scripts

    Ansible deprecated --sudo in 1.9 so this change replaces
    it with --become.

    NOTE(lyarwood): Conflict due to
    Ifbadce909393268b340b7a08c78a6faa2d7888b2 not being present in
    stable/pike.

    Conflicts:
        nova/tests/live_migration/hooks/ceph.sh

    Change-Id: I40f40766a7b84423c1dcf9d5ed58476b86d61cc4
    (cherry picked from commit 7f16800f71f6124736382be51d9da234800f7618)
    (cherry picked from commit 18931544d8a57953c6ce9ee4bf4bcc7a4e9e4295)
    (cherry picked from commit 1a09f753559aa7ed617192853215c5b0ace7756a)

    Closes-Bug: 1867380
    Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7
    (cherry picked from commit e23c3c2c8df3843c5853c87ef684bd21c4af95d8)
    (cherry picked from commit 70447bca2f4f33c6872eaf94a2e4351bb257c22a)
    (cherry picked from commit 373c4ffde2053c7ff11bd38339b88d144cd442f2)
    (cherry picked from commit 63ed32ef49adcb6830ef3b5329a561542bddf656)
    (cherry picked from commit 0718015f3fd2899720613bfef789f7023f112e30)

tags: added: in-stable-pike
Revision history for this message
melanie witt (melwitt) wrote :

I realized we never had a e-r query for this, so if we ever need one, this might work (need to double check that it wouldn't return any successful runs before using):

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Instance%20spawn%20was%20interrupted%20before%20instance_claim%2C%20setting%20instance%20to%20ERROR%20state%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova pike-eol

This issue was fixed in the openstack/nova pike-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova rocky-eol

This issue was fixed in the openstack/nova rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.