test_live_migration_with_trunk tempest test fails due to port remains in down state

Bug #1940425 reported by Balazs Gibizer
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
High
Unassigned
neutron
Confirmed
Critical
Unassigned
os-vif
Incomplete
Undecided
Unassigned

Bug Description

Example failure is in [1]:

2021-08-18 10:40:52,334 124842 DEBUG [tempest.lib.common.utils.test_utils] Call _is_port_status_active returns false in 60.000000 seconds
}}}

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 89, in wrapper
    return func(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/api/compute/admin/test_live_migration.py", line 281, in test_live_migration_with_trunk
    self.assertTrue(
  File "/usr/lib/python3.8/unittest/case.py", line 765, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true

Please note that a similar bug was reported and fixed previously https://bugs.launchpad.net/tempest/+bug/1924258 It seems that fix did not fully solved the issue.

It is not super frequent I saw 4 occasions in the last 30 days [2].

[1] https://zuul.opendev.org/t/openstack/build/fdbda223dc10456db58f922b6435f680/logs
[2] https://paste.opendev.org/show/808166/

tags: added: tempest trunk
tags: added: gate-failure
Changed in neutron:
importance: Undecided → High
Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Changed in nova:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :
Revision history for this message
Miguel Lavalle (minsel) wrote (last edit ):

I did some initial fact finding:

1) The failing job is https://zuul.opendev.org/t/openstack/builds?job_name=nova-ovs-hybrid-plug

2) The failing assertion in the Tempest test case is https://github.com/openstack/tempest/blob/f08cc686ae5267d0dc71341d6e6b4c1514193856/tempest/api/compute/admin/test_live_migration.py#L286. After instance migration, associated trunk subport fails to become active

3) The failing job was succeeding most of the time up until 2022-07-21. After that day, the job fails most of the time

4) Doing a quick search in gerrit, nothing obviously related merged in Neutron master just before the above mentioned date. Did something related to live migrations changed in Nova?

Changed in neutron:
status: New → Confirmed
Changed in neutron:
importance: High → Critical
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

We have a lot of failures like that in the neutron check queue as well:

    https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_89d/850223/3/check/neutron-ovs-tempest-multinode-full/89d1378/testr_results.html

    https://b56ae66d8d1228f47a8b-8be9667f5530aa8ec0c6e8c86da1b76d.ssl.cf1.rackcdn.com/850226/6/check/neutron-ovs-tempest-dvr-ha-multinode-full/f738635/testr_results.html

    https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_43f/850609/3/check/neutron-ovs-tempest-multinode-full/43f10e5/testr_results.html

    https://27bd7d97843089d69a7c-16390f05fde22eb723929856dcc38fcd.ssl.cf1.rackcdn.com/840416/24/check/neutron-ovs-tempest-multinode-full/26ed361/testr_results.html

    https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_4d5/840415/24/check/neutron-ovs-tempest-multinode-full/4d56614/testr_results.html

    https://0102a3b5a62ee8d841df-584a141890d4c711c7adef07d7bdf32d.ssl.cf5.rackcdn.com/840415/24/check/neutron-ovs-tempest-dvr-ha-multinode-full/329b407/testr_results.html

    https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_480/839479/24/check/neutron-ovs-tempest-multinode-full/480d4eb/testr_results.html

    https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_432/840417/24/check/neutron-ovs-tempest-multinode-full/4323d88/testr_results.html

    https://4e41c5437696205a2f2d-98cbd17e2930261ab1c83c6342d64f26.ssl.cf5.rackcdn.com/840420/24/check/neutron-ovs-tempest-multinode-full/f1509c0/testr_results.html

    https://6ad887810b840559f848-66ce5f117c645ca152390f12473225b2.ssl.cf2.rackcdn.com/840420/24/check/neutron-ovs-tempest-dvr-ha-multinode-full/4b92bdd/testr_results.html

    https://ec212fcfd4eeb40bf2b7-31f75471aa3a64e7f8f76bc64a08f268.ssl.cf2.rackcdn.com/840419/24/check/neutron-ovs-tempest-multinode-full/fc8df74/testr_results.html

    https://492ab3cad961fd6f54a2-d71f4126f88f4263fd488933444cea49.ssl.cf2.rackcdn.com/840419/24/check/neutron-ovs-tempest-dvr-ha-multinode-full/0266a0d/testr_results.html

Revision history for this message
Lajos Katona (lajos-katona) wrote :

Just checking what was released, and the os-vif 3.0.0 is suspicious:
https://review.opendev.org/c/openstack/releases/+/849544

$ git log --oneline --no-merges 2.8.0..3.0.0
771dfff update ci since linuxbridge is now experimental
1651a73 Drop lower-constraints.txt and its testing
75b290f Delete trunk bridges to avoid race with Neutron
a12edbf update job template to zed
9ace551 Check for hybrid plugging in OVS
95fbe6a Change minversion of tox to 3.18.0

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

> 3) The failing job was succeeding most of the time up until 2022-07-21.
> After that day, the job fails most of the time

The last thing we merged in nova happened on 19th of July. So that rules out a possible nova change making this more frequent.

Revision history for this message
Lajos Katona (lajos-katona) wrote :

Perhaps we have to focus on https://review.opendev.org/c/openstack/neutron/+/837780 as that patch was waiting for the os-vif release, see the relevant Neutron meeting discussion:
https://meetings.opendev.org/meetings/networking/2022/networking.2022-07-12-14.00.log.html#l-84

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

On the Nova side, the last patch marged on July 19th as Gibi said and wasn't related to live migrations.
https://review.opendev.org/c/openstack/nova/+/830645

I'm currently preparing a DNM patch that will test os-vif 2.8.0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/850998

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/851003

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

As we have proof of the issue being due to os-vif 3.0.0 release, changing the Nova status to Invalid.

Changed in nova:
status: Confirmed → Invalid
Revision history for this message
sean mooney (sean-k-mooney) wrote :

Setting this to incomplete for os-vif while we discuss this more.

form my perspective this looks like a neutron bug.

os-vif is now deleting the bridge but it only does that after unplugging the ports on ovs.
so at the time we delete the bridge the trunk port (the parent and the subports) have been unplugged form ovs
by os-vif as it always did. so neutron should not be managing those port or depending on the bridge anymore.

the l2 agent should be tolerant to the bridge not exiting and handle that gracefully.

so to me this looked like a prexisign bug in neutron which os-vif is not exposing.
not an error in the logic of the os-vif change or a failure to consider upgrades.

we can perhaps workaround this via a more complicated arrangement but if we make any change to os-vif for this i think i would prefer to keep this as simple as possible meaning no smart just a config option.

that said my preference would be to fix this in neutron only.

Changed in os-vif:
status: New → Incomplete
Revision history for this message
Miguel Lavalle (minsel) wrote :

I agree with Sean in #14. In fact, I've been working on this Neutron patch https://review.opendev.org/c/openstack/neutron/+/837780 to do exactly what Sean proposes: Neutron should be able to handle the fact that now os-vif deletes the trunk bridge. I even think that we don't need this bug in Neutron. The work been done is covered by https://bugs.launchpad.net/neutron/+bug/1869244

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Sylvain Bauza <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/850998
Reason: not needed.

Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/851003

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/865658

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

FWIW, even if the problem is a bit different, https://review.opendev.org/c/openstack/neutron/+/865424 which closes https://bugs.launchpad.net/neutron/+bug/1997025 seems to also fix this other bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Sylvain Bauza <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/865658
Reason: Given https://bugs.launchpad.net/neutron/+bug/1997025 was fixed, I did a recheck on https://review.opendev.org/c/openstack/nova/+/838976/9#message-be9d569065efd62272182fda830cf8b02e6c1641 and now I get success for both jobs :

* nova-next https://zuul.opendev.org/t/openstack/build/478d0f159b774bbe9848e8c44d97137b : SUCCESS in 1h 32m 09s
* nova-ovs-hybrid-plug https://zuul.opendev.org/t/openstack/build/1176147b5a9247b49b4e0e455e7c6483 : SUCCESS in 51m 44s

Accordingly, dropping this patch.

Changed in neutron:
status: Confirmed → Fix Released
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

New occurence of the issue https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_558/867769/11/gate/neutron-ovs-tempest-multinode-full/558cfa3/testr_results.html

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 89, in wrapper
    return func(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)
  File "/opt/stack/tempest/tempest/api/compute/admin/test_live_migration.py", line 286, in test_live_migration_with_trunk
    self.assertTrue(
  File "/usr/lib/python3.10/unittest/case.py", line 687, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true

Changed in neutron:
status: Fix Released → Confirmed
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Download full text (3.5 KiB)

Hello Slawek:

In those new logs, the destination port binding, that should be updated by Nova, is not sent [1]. We can see all events related to the OVS agents; for example when the port is deleted from the source agent:
11816:Jan 11 18:30:16.756385 ubuntu-jammy-inmotion-iad3-0032701749 neutron-server[55411]: DEBUG neutron.plugins.ml2.rpc [None req-f8d8130a-4979-4abc-b93b-bf722c7985de None None] Device 9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa no longer exists at agent ovs-agent-ubuntu-jammy-inmotion-iad3-0032701750 {{(pid=55411) update_device_down /opt/stack/neutron/neutron/plugins/ml2/rpc.py:259}}
11823:Jan 11 18:30:16.836648 ubuntu-jammy-inmotion-iad3-0032701749 neutron-server[55411]: DEBUG neutron.plugins.ml2.plugin [None req-f8d8130a-4979-4abc-b93b-bf722c7985de None None] Current status of the port 9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa is: ACTIVE; New status is: DOWN {{(pid=55411) _update_individual_port_db_status /opt/stack/neutron/neutron/plugins/ml2/plugin.py:2338}}

Or when the same port is bound to the destination OVS agent, just after the previous messages:
11825:Jan 11 18:30:16.854061 ubuntu-jammy-inmotion-iad3-0032701749 neutron-server[55411]: DEBUG neutron.plugins.ml2.rpc [None req-b9822962-cf31-4229-81ea-d1d4b008a142 None None] Device 9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa up at agent ovs-agent-ubuntu-jammy-inmotion-iad3-0032701749 {{(pid=55411) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:296}}
11827:Jan 11 18:30:16.883051 ubuntu-jammy-inmotion-iad3-0032701749 neutron-server[55411]: DEBUG neutron.notifiers.nova [None req-f8d8130a-4979-4abc-b93b-bf722c7985de None None] device_id is not set on port 9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa yet. {{(pid=55411) _can_notify /opt/stack/neutron/neutron/notifiers/nova.py:201}}
11829:Jan 11 18:30:16.892256 ubuntu-jammy-inmotion-iad3-0032701749 neutron-server[55411]: DEBUG neutron.plugins.ml2.plugin [None req-b9822962-cf31-4229-81ea-d1d4b008a142 None None] The host ubuntu-jammy-inmotion-iad3-0032701749 is not matching for port 9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa host ubuntu-jammy-inmotion-iad3-0032701750! {{(pid=55411) port_bound_to_host /opt/stack/neutron/neutron/plugins/ml2/plugin.py:2436}}
11830:Jan 11 18:30:16.892256 ubuntu-jammy-inmotion-iad3-0032701749 neutron-server[55411]: DEBUG neutron.plugins.ml2.rpc [None req-b9822962-cf31-4229-81ea-d1d4b008a142 None None] Device 9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa not bound to the agent host ubuntu-jammy-inmotion-iad3-0032701749 {{(pid=55411) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:304}}
11832:Jan 11 18:30:16.912446 ubuntu-jammy-inmotion-iad3-0032701749 neutron-server[55411]: DEBUG neutron.plugins.ml2.db [None req-f8d8130a-4979-4abc-b93b-bf722c7985de None None] For port 9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa, host ubuntu-jammy-inmotion-iad3-0032701750, got binding levels [PortBindingLevel(driver='openvswitch',host='ubuntu-jammy-inmotion-iad3-0032701750',level=0,port_id=9f5bf1ec-480c-412b-9a4e-b2aa0356b4fa,segment=NetworkSegment(b71f2156-82ee-4e1c-ab74-d5cbdae3d781),segment_id=b71f2156-82ee-4e1c-ab74-d5cbdae3d781)] {{(pid=55411) get_binding_level_objs /opt/stack/neutron/neutron/plugins/ml2/db.py:74}}

However the...

Read more...

Revision history for this message
sean mooney (sean-k-mooney) wrote :

https://review.opendev.org/c/openstack/neutron/+/865424 just disabled the test so that didn't fix anything
https://review.opendev.org/c/openstack/neutron/+/837780 was the actual patch that should have addressed the race between the bridge delete and the neutron agent.

regarding rodlfos latest comment

the destination port binding is created but not activated in pre live migration and it is activated onlyif the migration successeed in post live migration.

the trunk port should be in the active state in pre live migration.

the neutron l2 agent is expected to wire up the trunk on any host with a port biding associated with it active or inactive. there should be no depency on the port binding being active and the port status being set to active.

so if the neutron l2 agent is not setting the port stust to active and sending the network vif plugged event after the port bidnign is created and we plug the port in pre live migration that is a neutron bug.

Revision history for this message
Amit Uniyal (auniyal) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/895655

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/895655
Committed: https://opendev.org/openstack/nova/commit/c290a6ed75c9a2dc54bd348faebbc5dcacf90fa6
Submitter: "Zuul (22348)"
Branch: master

commit c290a6ed75c9a2dc54bd348faebbc5dcacf90fa6
Author: Sean Mooney <email address hidden>
Date: Mon Sep 18 13:47:01 2023 +0100

    disable ovn based testing of test_live_migration_with_trunk

    due to bug #1940425 where ml2/ovn is not correctly configuring the
    active status on trunk port we see test_live_migration_with_trunk
    fail more often then not.

    This was previously disbaled in tempest and then fixed in neutorn.
    the tempest skip was then reverted and so was the neutron fix as
    it broke somethign else... so this is failing in our gate again.

    This change skips test_live_migration_with_trunk on all jobs that
    are using ml2/ovn but keeps it enabled on the hybrid plug job
    which uses ml2/ovs.

    Related-Bug: #1940425
    Change-Id: I0a8dd6e6e30526aa2841b4db67ed9affed166fd8

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.