Bug #1859766 “functional tests intermittently fails with “ReadOn...” : Bugs : OpenStack Compute (nova)

Balazs Gibizer (balazs-gibizer) on 2020-01-15

tags:

added: testing

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2020-01-15:

#1

I suspect that the problem is twofold. First we have a functional test change[1] in train and master that was not backported to older branches. This change removes a global state hack from the fake virt driver. Second we most probably backported tests from master (and train) to stain and rocky that does not properly set up the global state as on master it is not need, but it is needed on stain and back.

At the moment I didn't find what is the offending backport. But from the failed test log I see that a test case starts up a compute service on host1[2] and then later the nova code tries to access a compute service on host2 [3] that was never started.

[1] https://review.opendev.org/#/c/656709/
[2] https://zuul.opendev.org/t/openstack/build/23411923cde347efb03906d0537237bf/log/job-output.txt#3137
[3] https://zuul.opendev.org/t/openstack/build/23411923cde347efb03906d0537237bf/log/job-output.txt#3203

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-15: Related fix proposed to nova (stable/stein)

#2

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/702700

Revision history for this message

Matt Riedemann (mriedem) wrote on 2020-01-15:

#3

Maybe related to https://review.opendev.org/#/c/676280/ somehow?

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2020-01-16:

#4

As far as I see nothing visible in the logs about a fails compute node create and and without such failure I don't see how https://review.opendev.org/#/c/676280/ can affect us.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2020-01-16:

#5

I managed to get one reproduction (in 27 test full functional run) with some extra logging[1]. It seems that a ResourceTracker from an earlier test leaks to the later test as the both the python object id of the ResourceTracker and the uuid of the compute node is the same in both tests.

Now based on that the leaking test case calls nova.test.TestCase.restart_compute_service() I suspect both [2] and [3] needs to be backported to get rid of the problem. But I'm only like 75% sure as the problem is really hard to reproduce.

[1] https://paste.ubuntu.com/p/Jw9bc3yKxW/
[2] https://review.opendev.org/#/c/656709/
[3] https://review.opendev.org/#/c/512552/

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2020-01-17:

#6

One more reproduction with more trace (needed around 200 functional test run to reproduce): http://paste.openstack.org/show/788538/

Now I'm pretty confident that the ResourceTracker created by the test case test_migrate_disk_and_power_off_crash_finish_revert_migration during restart_compute_service() [1] call survives the test case and gets activated later by another test case. (Look at rt object id in the logs, they are the same, and the compute uuid is the same too). This upgrade my 75% confidence to 95% regarding the need of [3].

[1] https://github.com/openstack/nova/blob/2fceddc451bd021f137c6756c5314e08eb4443de/nova/test.py#L425
[2] https://review.opendev.org/#/c/656709/
[3] https://review.opendev.org/#/c/512552

Balazs Gibizer (balazs-gibizer) on 2020-01-17

Changed in nova:
status:	New → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-17:

#7

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/703102

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-17: Fix proposed to nova (stable/stein)

#8

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/703103

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-20: Related fix merged to nova (stable/stein)

#9

Reviewed: https://review.opendev.org/703102
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=23d3b8fbc9d793b4faba472ec6a0f7b95be1e70f
Submitter: Zuul
Branch: stable/stein

commit 23d3b8fbc9d793b4faba472ec6a0f7b95be1e70f
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:06:52 2019 +0200

Remove global state from the FakeDriver

    The virt driver FakeDriver used in both the functional and in the unit
    test used a global state to configure the host and node names the driver
    reports. This was hard to use when more then one compute service is started.
    Also global state is dangerous.

    It turned out that only a set of unit tests are using multiple nodes per
    compute the rest of the tests can simply use host=<hostname>,
    nodes=[<hostname>] setup.

So this removes the global state.

    On stable/stein:
    * Changes in functional/regressions/test_bug_1764556.py just removes the
      usage of the global state as that test was backported before the patch
      being backported was introduced

    * Changes in functional/regressions/test_bug_1839560.py is a re-backport
      of If822509e906d5094f13a8700b2b9ed3c40580431 adapted to the current
      patch.

Related-Bug: #1859766

Change-Id: I2cf2fcbaebc706f897ce5dfbff47d32117064f9c
(cherry picked from commit b5666fb49239dbed6777481be63276db35ed8118)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-24: Change abandoned on nova (stable/stein)

#10

Change abandoned by Balazs Gibizer (<email address hidden>) on branch: stable/stein
Review: https://review.opendev.org/702700

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-21: Fix merged to nova (stable/stein)

#11

Reviewed: https://review.opendev.org/703103
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b874c409c11b5d83508d2f0276a9a648f72192a4
Submitter: Zuul
Branch: stable/stein

commit b874c409c11b5d83508d2f0276a9a648f72192a4
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

Related-Bug: #1724172

On stable/stein:

Closes-Bug: #1859766

Note: mock package import added to nova/test.py (due to not having patch
Ibe7cb29620f06d31059f2a5f94ca180b8671046e in stable/stein)

Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1
(cherry picked from commit 2794748d9c58623045023f34c7793c58ce41447c)

Reviewed:  https://review.opendev.org/703103
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b874c409c11b5d83508d2f0276a9a648f72192a4
Submitter: Zuul
Branch:    stable/stein

commit b874c409c11b5d83508d2f0276a9a648f72192a4
Author: Balazs Gibizer <balazs.gibizer@ericsson.com>
Date:   Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env
    
    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.
    
    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.
    
    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.
    
    Related-Bug: #1724172
    
    On stable/stein:
    
    Closes-Bug: #1859766
    
    Note: mock package import added to nova/test.py (due to not having patch
    Ibe7cb29620f06d31059f2a5f94ca180b8671046e in stable/stein)
    
    Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1
    (cherry picked from commit 2794748d9c58623045023f34c7793c58ce41447c)

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2020-03-06:

#12

This is still happening on stable/rocky here is a recent occurrence [1]. I'm going to backport the fixes.

[1] https://b317bafa0db642d61e9b-babe6b68bec526aecbe80deed799da2b.ssl.cf2.rackcdn.com/710248/1/check/cross-nova-functional/b02641a/testr_results.html

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-12: Related fix proposed to nova (stable/rocky)

#13

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/712751

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-13: Fix proposed to nova (stable/rocky)

#14

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713033

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-13: Related fix merged to nova (stable/rocky)

#15

Download full text (3.6 KiB)

Reviewed: https://review.opendev.org/712751
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=930bf0ae1be07744183c40c6a20d0518a352947f
Submitter: Zuul
Branch: stable/rocky

commit 930bf0ae1be07744183c40c6a20d0518a352947f
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:06:52 2019 +0200

Remove global state from the FakeDriver

    The virt driver FakeDriver used in both the functional and in the unit
    test used a global state to configure the host and node names the driver
    reports. This was hard to use when more then one compute service is started.
    Also global state is dangerous.

    It turned out that only a set of unit tests are using multiple nodes per
    compute the rest of the tests can simply use host=<hostname>,
    nodes=[<hostname>] setup.

So this removes the global state.

    Conflicts:
        nova/tests/functional/compute/test_live_migration.py
        nova/tests/functional/integrated_helpers.py
        nova/tests/functional/regressions/test_bug_1671648.py
        nova/tests/functional/regressions/test_bug_1702454.py
        nova/tests/functional/regressions/test_bug_1718455.py
        nova/tests/functional/regressions/test_bug_1718512.py
        nova/tests/functional/regressions/test_bug_1719730.py
        nova/tests/functional/regressions/test_bug_1735407.py
        nova/tests/functional/regressions/test_bug_1741307.py
        nova/tests/functional/regressions/test_bug_1746483.py
        nova/tests/functional/regressions/test_bug_1764883.py
        nova/tests/functional/regressions/test_bug_1781710.py
        nova/tests/functional/regressions/test_bug_1784353.py
        nova/tests/functional/regressions/test_bug_1797580.py
        nova/tests/functional/test_aggregates.py
        nova/tests/functional/test_server_group.py
        nova/tests/functional/test_servers.py
        nova/tests/unit/compute/test_compute_mgr.py

    Note(elod.illes): cause of conflicts of the files:
    * test_live_migration.py:
        Patch Icb0bdaf454935b3713c35339394d260b33520de5 had to be
        refactored during backport to Rocky as uuidsentinel moved to oslo
        in Stein.
    * integrated_helpers.py:
        Patches I4c12502c86c7ac27369d119e0f97768cf41695b5 and
        I7f5f08691ca3f73073c66c29dddb996fb2c2b266 were only added in Stein.
    * test_bug_1671648.py, test_bug_1702454.py, test_bug_1718455.py,
      test_bug_1718512.py, test_bug_1719730.py, test_bug_1735407.py,
      test_bug_1741307.py, test_bug_1746483.py, test_bug_1764883.py,
      test_bug_1781710.py, test_bug_1784353.py, test_bug_1797580.py,
      test_servers.py:
        Duplicate cleanup removal patch
        Iaae6fc4a66145576f4a4fc1cea452ef6acbadb15 is not backported to
        Rocky.
    * test_aggregates.py:
        Patches Ic55b88e7ad21ab5b7ad063eac743ff9406aae559 and
        Ic55b88e7ad21ab5b7ad063eac743ff9406aae559 were not backported to
        Rocky.
    * test_server_group.py:
        Iaae6fc4a66145576f4a4fc1cea452ef6acbadb15 (Duplicate cleanup
        removal) and Ie07b419732e0832a9b9d16565f6c9d00ba85d654 (Add
        functional test for live migrate with anti-affinity group) pat...

Reviewed:  https://review.opendev.org/712751
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=930bf0ae1be07744183c40c6a20d0518a352947f
Submitter: Zuul
Branch:    stable/rocky

commit 930bf0ae1be07744183c40c6a20d0518a352947f
Author: Balazs Gibizer <balazs.gibizer@ericsson.com>
Date:   Wed May 1 23:06:52 2019 +0200

Remove global state from the FakeDriver
    
    The virt driver FakeDriver used in both the functional and in the unit
    test used a global state to configure the host and node names the driver
    reports. This was hard to use when more then one compute service is started.
    Also global state is dangerous.
    
    It turned out that only a set of unit tests are using multiple nodes per
    compute the rest of the tests can simply use host=<hostname>,
    nodes=[<hostname>] setup.
    
    So this removes the global state.
    
    Conflicts:
        nova/tests/functional/compute/test_live_migration.py
        nova/tests/functional/integrated_helpers.py
        nova/tests/functional/regressions/test_bug_1671648.py
        nova/tests/functional/regressions/test_bug_1702454.py
        nova/tests/functional/regressions/test_bug_1718455.py
        nova/tests/functional/regressions/test_bug_1718512.py
        nova/tests/functional/regressions/test_bug_1719730.py
        nova/tests/functional/regressions/test_bug_1735407.py
        nova/tests/functional/regressions/test_bug_1741307.py
        nova/tests/functional/regressions/test_bug_1746483.py
        nova/tests/functional/regressions/test_bug_1764883.py
        nova/tests/functional/regressions/test_bug_1781710.py
        nova/tests/functional/regressions/test_bug_1784353.py
        nova/tests/functional/regressions/test_bug_1797580.py
        nova/tests/functional/test_aggregates.py
        nova/tests/functional/test_server_group.py
        nova/tests/functional/test_servers.py
        nova/tests/unit/compute/test_compute_mgr.py
    
    Note(elod.illes): cause of conflicts of the files:
    * test_live_migration.py:
        Patch Icb0bdaf454935b3713c35339394d260b33520de5 had to be
        refactored during backport to Rocky as uuidsentinel moved to oslo
        in Stein.
    * integrated_helpers.py:
        Patches I4c12502c86c7ac27369d119e0f97768cf41695b5 and
        I7f5f08691ca3f73073c66c29dddb996fb2c2b266 were only added in Stein.
    * test_bug_1671648.py, test_bug_1702454.py, test_bug_1718455.py,
      test_bug_1718512.py, test_bug_1719730.py, test_bug_1735407.py,
      test_bug_1741307.py, test_bug_1746483.py, test_bug_1764883.py,
      test_bug_1781710.py, test_bug_1784353.py, test_bug_1797580.py,
      test_servers.py:
        Duplicate cleanup removal patch
        Iaae6fc4a66145576f4a4fc1cea452ef6acbadb15 is not backported to
        Rocky.
    * test_aggregates.py:
        Patches Ic55b88e7ad21ab5b7ad063eac743ff9406aae559 and
        Ic55b88e7ad21ab5b7ad063eac743ff9406aae559 were not backported to
        Rocky.
    * test_server_group.py:
        Iaae6fc4a66145576f4a4fc1cea452ef6acbadb15 (Duplicate cleanup
        removal) and Ie07b419732e0832a9b9d16565f6c9d00ba85d654 (Add
        functional test for live migrate with anti-affinity group) patches
        were only added in Stein and were not backported to Rocky.
    * test_compute_mgr.py:
        Patch I17543ecb572934ecc7d0bbc7a4ad2f537fa499bc (Raise
        InstanceFaultRollback for UnableToMigrateToSelf from _prep_resize)
        isn't backported to Rocky.
    
    Related-Bug: #1859766
    
    Change-Id: I2cf2fcbaebc706f897ce5dfbff47d32117064f9c
    (cherry picked from commit b5666fb49239dbed6777481be63276db35ed8118)
    (cherry picked from commit 23d3b8fbc9d793b4faba472ec6a0f7b95be1e70f)

tags:

added: in-stable-rocky

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-14: Fix merged to nova (stable/rocky)

#16

Download full text (3.5 KiB)

Reviewed: https://review.opendev.org/713033
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=53a893f7c97e35de3e9ac26101827cdb43ed35cc
Submitter: Zuul
Branch: stable/rocky

commit 53a893f7c97e35de3e9ac26101827cdb43ed35cc
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

Related-Bug: #1724172

On stable/stein:

Closes-Bug: #1859766

    Conflicts:
        doc/notification_samples/libvirt-connect-error.json
        nova/test.py
        nova/tests/functional/libvirt/test_reshape.py
        nova/tests/functional/test_servers.py

    NOTE(elod.illes): files conflicts details:
    * libvirt-connect-error.json:
      File added only in Stein with libvirt.error notification
      transformation patch I7d2287ce06d77c0afdef0ea8bdfb70f6c52d3c50
    * test.py:
      Patches Iecf4dcf8e648c9191bf8846428683ec81812c026 (Remove patching
      the mock lib) and Ibb8c12fb2799bb5ceb9e3d72a2b86dbb4f14451e (Use a
      static resource tracker in compute manager) were not backported to
      Rocky
    * test_reshape.py:
      File added only in Stein in the frame of 'Handling Reshaped Provider
      Trees' feature, with patch Ide797ebf7790d69042ae275ebec6ced3fa4787b6
    * test_servers.py:
      Patch I7cbd5d9fb875ebf72995362e0b6693492ce32051 (Reject forced move
      wit...

	Status	Importance	Assigned to
OpenStack Compute (nova)	Invalid	Undecided	Unassigned
Rocky	Fix Released	Medium	Elod Illes
Stein	Fix Released	Medium	Elod Illes

OpenStack Compute (nova)

functional tests intermittently fails with "ReadOnlyFieldError: Cannot modify readonly field uuid"

Bug Description

Other bug subscribers

Remote bug watches