Live-migration double binding doesn't work with OVN

Bug #1834045 reported by Maciej Jozefczyk
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned
networking-ovn
Fix Released
Undecided
Unassigned
neutron
Fix Released
Medium
Miguel Lavalle
neutron (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

For ml2/OVN live-migration doesn't work. After spending some time debugging this issue I found that its potentially more complicated and not related to OVN intself.

Here is the full story behind not working live-migration while using OVN in latest u/s master.

To speedup live-migration double-binding was introduced in neutron [1] and nova [2]. It implements this blueprint [3]. In short words it creates double binding (ACTIVE and INACTIVE) to verify if network bind is possible to be done on destination host and then starts live-migration (to not waste time in case of rollback).
This mechanism started to be default in Stein [4]. So before actual qemu live-migration neutron should send 'network-vif-plugged' to nova and then migration is being run.

While using OVN this mechanism doesn't work. Notification 'network-vif-plugged' is not being send so live-migration is stuck at the beginning.

Lets check how those notifications are send. On every change of 'status' field (sqlalchemy event) in neutron.ports row [5] function [6] is executed and it is responsible for sending 'network-vif-unplugged' and 'network-vif-plugged' notifications.

During pre_live_migration tasks two bindings and bindings levels are created. At the end of this process I found that commit_port_binding() is executed [7]. At this time neutron port status in the db is DOWN.
I found that at the end of commit_port_binding() [8] after neutron_lib.callbacks.registry notification is send the port status moves to UP. For ml2/OVN it stays DOWN. This is the first difference that I found between ml2/ovs and ml2/ovn.

After a bit digging I figured out how 'network-vif-plugged' is triggered in ml2/ovs.
Lets see how this is done.

1. On list of registered callbacks in ml2/ovs [8] we have configured callback from class ovo_rpc._ObjectChangeHandler [9] and at the end of commit_port_binding() this callback is used.

-------------------------------------------------------------
neutron.plugins.ml2.ovo_rpc._ObjectChangeHandler.handle_event
-------------------------------------------------------------

2. It is responsible for pushing new port object revisions to agents, like:

----------------------------------------------------------------------------
Jun 24 10:01:01 test-migrate-1 neutron-server[3685]: DEBUG neutron.api.rpc.handlers.resources_rpc [None req-1430f349-d644-4d33-8833-90fad0124dcd service neutron] Pushing event updated for resources: {'Port': ['ID=3704a567-ef4c-4f6d-9557-a1191de07c4a,revision_number=10']} {{(pid=3697) push /opt/stack/neutron/neutron/api/rpc/handlers/resources_rpc.py:243}}
----------------------------------------------------------------------------

3. OVS agent consumes it and sends back RPC to the neutron server that port is actually UP (on source node!):
------------------------------------------------------------------------------------------------------------
Jun 24 10:01:01 test-migrate-1 neutron-openvswitch-agent[18660]: DEBUG neutron.agent.resource_cache [None req-1430f349-d644-4d33-8833-90fad0124dcd service neutron] Resource Port 3704a567-ef4c-4f6d-9557-a1191de07c4a updated (revision_number 8->10). Old fields: {'status': u'ACTIVE', 'bindings': [PortBinding(host='test-migrate-1',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={},status='INACTIVE',vif_details={"port_filter": true, "bridge_name": "br-int", "datapath_type": "system", "ovs_hybrid_plug": false},vif_type='ovs',vnic_type='normal'), PortBinding(host='test-migrate-2',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={"migrating_to": "test-migrate-1"},status='ACTIVE',vif_details={"port_filter": true, "bridge_name": "br-int", "datapath_type": "system", "ovs_hybrid_plug": false},vif_type='ovs',vnic_type='normal')], 'binding_levels': [PortBindingLevel(driver='openvswitch',host='test-migrate-1',level=0,port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,segment=NetworkSegment(c6866834-4577-497f-a6c8-ff9724a82e59),segment_id=c6866834-4577-497f-a6c8-ff9724a82e59), PortBindingLevel(driver='openvswitch',host='test-migrate-2',level=0,port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,segment=NetworkSegment(c6866834-4577-497f-a6c8-ff9724a82e59),segment_id=c6866834-4577-497f-a6c8-ff9724a82e59)]} New fields: {'status': u'DOWN', 'bindings': [PortBinding(host='test-migrate-1',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={},status='ACTIVE',vif_details={"port_filter": true, "bridge_name": "br-int", "datapath_type": "system", "ovs_hybrid_plug": false},vif_type='ovs',vnic_type='normal'), PortBinding(host='test-migrate-2',port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,profile={"migrating_to": "test-migrate-1"},status='INACTIVE',vif_details=None,vif_type='unbound',vnic_type='normal')], 'binding_levels': [PortBindingLevel(driver='openvswitch',host='test-migrate-1',level=0,port_id=3704a567-ef4c-4f6d-9557-a1191de07c4a,segment=NetworkSegment(c6866834-4577-497f-a6c8-ff9724a82e59),segment_id=c6866834-4577-497f-a6c8-ff9724a82e59)]} {{(pi
Jun 24 10:01:01 test-migrate-1 neutron-openvswitch-agent[18660]: d=18660) record_resource_update /opt/stack/neutron/neutron/agent/resource_cache.py:186}}
...

Jun 24 10:01:02 test-migrate-1 neutron-openvswitch-agent[18660]: DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-9daaf112-57f4-49bb-8390-4b65a5c5e674 None None] Setting status for 3704a567-ef4c-4f6d-9557-a1191de07c4a to UP {{(pid=18660) _bind_devices /opt/stack/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1088}}
------------------------------------------------------------------------------------------------------------

4. Neutron server consumes it:
------------------------------------------------------------------------------------------------------------
Jun 24 10:01:02 test-migrate-1 neutron-server[3685]: DEBUG neutron.plugins.ml2.rpc [None req-62e69669-fa7e-4f70-9e38-38cb3e2c30a7 None None] Device 3704a567-ef4c-4f6d-9557-a1191de07c4a up at agent ovs-agent-test-migrate-1 {{(pid=3698) update_device_up /opt/stack/neutron/neutron/plugins/ml2/rpc.py:269}}
...
Jun 24 10:01:02 test-migrate-1 neutron-server[3685]: DEBUG neutron.db.provisioning_blocks [None req-62e69669-fa7e-4f70-9e38-38cb3e2c30a7 None None] Provisioning for port 3704a567-ef4c-4f6d-9557-a1191de07c4a completed by entity L2. {{(pid=3698) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:133}}
...
Jun 24 10:01:02 test-migrate-1 neutron-server[3685]: DEBUG neutron.db.provisioning_blocks [None req-62e69669-fa7e-4f70-9e38-38cb3e2c30a7 None None] Provisioning complete for port 3704a567-ef4c-4f6d-9557-a1191de07c4a triggered by entity L2. {{(pid=3698) provisioning_complete /opt/stack/neutron/neutron/db/provisioning_blocks.py:140}}
------------------------------------------------------------------------------------------------------------

and then generates internal event "PROVISIONING_COMPLETE" [10]. This event is consumed by [11] and port_provisioned() updates port status in the DB to UP [12]. At the end it emits notification 'network-vif-plugged' and nova continues migration.

In ml2/ovn we don't have agents, so we don't use ovo_rpc. That's why migration for ml2/ovn doesn't work.

It looks like general bug somewhere between nova and neutron. Neutron shouldn't send notification 'network-vif-plug' during configuration of double binding from source host like it is now (paragraph 3.)
Maybe we could consider using some more sophisticated names, like 'neutron-vif-inactive-binding-set'?
Maybe nova could watch for inactive binding being created [13] and then start live-migration
instead waiting for neutron notification?

Thanks,
Maciej

[1] https://review.opendev.org/#/q/topic:bp/live-migration-portbinding+(status:open+OR+status:merged)
[2] https://review.opendev.org/#/c/558001/
[3] https://blueprints.launchpad.net/nova/+spec/neutron-new-port-binding-api
[4] https://review.opendev.org/#/c/635360/
[5] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/db/db_base_plugin_v2.py#L173
[6] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/notifiers/nova.py#L182
[7] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L505
[8] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L713
[9] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/ovo_rpc.py#L51
[10] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/db/provisioning_blocks.py#L140
[11] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L285
[12] https://github.com/openstack/neutron/blob/0e2508c8b1a3706a2ade0517f5c5359af2f8bc78/neutron/plugins/ml2/plugin.py#L316
[13] https://specs.openstack.org/openstack/neutron-specs/specs/backlog/pike/portbinding_information_for_nova.html#list-bindings

Revision history for this message
Miguel Lavalle (minsel) wrote :

1) I am very glad multiple port binding is being considered for use by networking-ovn.

2) It is true that during implementation the existence of agents was assumed. That wasn't an oversight. A careful read of the spec shows that the functionality was specified for an agents based implementation. To confirm this, just look at the state diagram here: https://specs.openstack.org/openstack/neutron-specs/specs/backlog/pike/portbinding_information_for_nova.html#activate-rpc-port-update-delete. In that sense, instead of a bug, this should be seen as "Multiple Port Bindings Phase II"

3) From the Neutron implementation perspective, it was never assumed that Nova was going to use the 'network-vif-plugged' event to move to the actual migration stage. That is a decision that was made on the Nova side. In Neutron the only assumption that was made (according to the spec) was that Nova would request the creation of an inactive binding and that upon the completion of it, Nova would proceed with the migration. I agree that the chosen way seems odd.

4) It maybe the case that Neutron agent in the source host is sending a port UP message. That is more an oversight than anything else. During implementation of multiple port binding, the strategy was to be as least intrusive as possible with what was already in place. This strategy was adopted given the fact that port binding is such a fundamental Neutron functionality. Having said that, I don't think the agent sending port UP is the major issue in this bug report. While we may optimize it, it is irrelevant from the point of view of OVN, since OVN doesn't have agents. The core of the issue is how to communicate with Nova properly once the inactive binding has been created, so the migration can continue

Changed in neutron (Ubuntu):
assignee: nobody → Miguel Lavalle (minsel)
tags: added: live-migration
Miguel Lavalle (minsel)
Changed in neutron (Ubuntu):
assignee: Miguel Lavalle (minsel) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to networking-ovn (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/673803

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/673884

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to networking-ovn (master)

Reviewed: https://review.opendev.org/673803
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=229f894ee6aff6b5da8cc399575537feed1c6a49
Submitter: Zuul
Branch: master

commit 229f894ee6aff6b5da8cc399575537feed1c6a49
Author: Maciej Józefczyk <email address hidden>
Date: Wed Jul 31 12:21:42 2019 +0000

    Update port_status to ACTIVE during live-migration

    During live-migration nova waits for 'network-vif-plugged' event
    in order to proceeed with live-migration [1]. While using OVN
    it never happened because the original design assumes usage
    of neutron agents.

    This patch updates port status from DOWN to ACTIVE during
    pre_live_migration state, which in fact for Neutron/ML2
    is done by neutron-ovs-agent, just only to emit
    'network-vif-plugged' notification and allow nova
    to perform live-migration.

    [1] https://review.opendev.org/#/c/558001/

    Change-Id: Ib9fe6e1bfea1d5f62b2f2b6fdb12d16878108c3f
    Related-Bug: 1834045

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/673884
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=6466771aec7c4171ff96f43cdb9a5a506a8e302b
Submitter: Zuul
Branch: master

commit 6466771aec7c4171ff96f43cdb9a5a506a8e302b
Author: Maciej Józefczyk <email address hidden>
Date: Wed Jul 31 19:42:55 2019 +0200

    Enable live-migration tempest test for OVN

    Change-Id: If23d22d6ae5aa6a0e62bae12b2340f3c5c6d798e
    Related-Bug: 1834045

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to networking-ovn (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/675830

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to networking-ovn (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/675831

Miguel Lavalle (minsel)
Changed in neutron:
assignee: nobody → Miguel Lavalle (minsel)
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to networking-ovn (stable/stein)

Reviewed: https://review.opendev.org/675830
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=43a83fb76d15abbef4a8b50cd142e2bae8e168f7
Submitter: Zuul
Branch: stable/stein

commit 43a83fb76d15abbef4a8b50cd142e2bae8e168f7
Author: Maciej Józefczyk <email address hidden>
Date: Wed Jul 31 12:21:42 2019 +0000

    Update port_status to ACTIVE during live-migration

    During live-migration nova waits for 'network-vif-plugged' event
    in order to proceeed with live-migration [1]. While using OVN
    it never happened because the original design assumes usage
    of neutron agents.

    This patch updates port status from DOWN to ACTIVE during
    pre_live_migration state, which in fact for Neutron/ML2
    is done by neutron-ovs-agent, just only to emit
    'network-vif-plugged' notification and allow nova
    to perform live-migration.

    [1] https://review.opendev.org/#/c/558001/

    Change-Id: Ib9fe6e1bfea1d5f62b2f2b6fdb12d16878108c3f
    Related-Bug: 1834045
    (cherry picked from commit 229f894ee6aff6b5da8cc399575537feed1c6a49)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to networking-ovn (stable/rocky)

Reviewed: https://review.opendev.org/675831
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=8e2fd40035467d1fe05a58ce6812436fe99b5040
Submitter: Zuul
Branch: stable/rocky

commit 8e2fd40035467d1fe05a58ce6812436fe99b5040
Author: Maciej Józefczyk <email address hidden>
Date: Wed Jul 31 12:21:42 2019 +0000

    Update port_status to ACTIVE during live-migration

    During live-migration nova waits for 'network-vif-plugged' event
    in order to proceeed with live-migration [1]. While using OVN
    it never happened because the original design assumes usage
    of neutron agents.

    This patch updates port status from DOWN to ACTIVE during
    pre_live_migration state, which in fact for Neutron/ML2
    is done by neutron-ovs-agent, just only to emit
    'network-vif-plugged' notification and allow nova
    to perform live-migration.

    [1] https://review.opendev.org/#/c/558001/

    Change-Id: Ib9fe6e1bfea1d5f62b2f2b6fdb12d16878108c3f
    Related-Bug: 1834045
    (cherry picked from commit 229f894ee6aff6b5da8cc399575537feed1c6a49)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to networking-ovn (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/677698

tags: added: networking-ovn-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to networking-ovn (stable/queens)

Reviewed: https://review.opendev.org/677698
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=3166e8206c7b5f5e2ae64f66c960c5800aa65e75
Submitter: Zuul
Branch: stable/queens

commit 3166e8206c7b5f5e2ae64f66c960c5800aa65e75
Author: Maciej Józefczyk <email address hidden>
Date: Wed Jul 31 12:21:42 2019 +0000

    Update port_status to ACTIVE during live-migration

    During live-migration nova waits for 'network-vif-plugged' event
    in order to proceeed with live-migration [1]. While using OVN
    it never happened because the original design assumes usage
    of neutron agents.

    This patch updates port status from DOWN to ACTIVE during
    pre_live_migration state, which in fact for Neutron/ML2
    is done by neutron-ovs-agent, just only to emit
    'network-vif-plugged' notification and allow nova
    to perform live-migration.

    [1] https://review.opendev.org/#/c/558001/

    Change-Id: Ib9fe6e1bfea1d5f62b2f2b6fdb12d16878108c3f
    Related-Bug: 1834045
    (cherry picked from commit 229f894ee6aff6b5da8cc399575537feed1c6a49)

tags: added: in-stable-queens
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Do we need anything else top of https://review.opendev.org/673803 to make live migration with OVN work? Reading the commit message for the patch it feels that the issue is resolved. Please put this back to New if you disagree.

Changed in nova:
status: New → Incomplete
Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :

This bug can be closed as https://review.opendev.org/673803 workaround the issue for OVN.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
Revision history for this message
Bartosz Bezak (bbezak) wrote :

please close this bug, as it creates unnecessary confusion ;)

Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :
Changed in neutron (Ubuntu):
status: Confirmed → Fix Released
Changed in neutron:
status: New → Fix Committed
Revision history for this message
Maciej Jozefczyk (maciejjozefczyk) wrote :
Changed in networking-ovn:
status: New → Fix Released
Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.