HA network tenant network fails upon router delete

Bug #1732543 reported by Steven Davis
62
This bug affects 10 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Unassigned

Bug Description

Openstack version: Pike
Openvswitch version: 2.7

Let's say I have a Openstack project where I've created 2 routers (R1 & R2). Both routers are configured as L3-HA on pair of network nodes. Each of the 2 routers has an Active and a Passive namespace on each network node. Neutron creates a unique HA network for each project that allows the Active router to send VRRP messages to the Passive router. When using vxlan for the "tenant_network_type" a vxlan vni is assigned to the said HA network that allows that VRRP east/west traffic make it between the 2 network nodes.

The assigned vni is discovered using "openstack network show UUID"

Now, if we delete, for example, router R2, R1 will still need the HA network with its associated vni so VRRP communication can continue to work. The nature of this discovered bug is that if 1 router (either R1 or R2, doesn't matter) is deleted the vni gets removed from a HA network entirely. At this point, the remaining router (R1) will continue to work, despite the HA network not functioning any more.

After the network nodes get restarted though, the broken config is loaded and the routers cease to function per the HA network lacking a vni assignment.

This problem didn't start happening until we upgraded to Pike.

See attached for proposed fix.

Revision history for this message
Steven Davis (sterdnotshaken) wrote :
Revision history for this message
Brian Haley (brian-haley) wrote :

Thanks for the bug report. Can you send out a formal patch?

https://docs.openstack.org/neutron/pike/contributor/effective_neutron.html

Changed in neutron:
importance: Undecided → High
status: New → Confirmed
tags: added: l3-ha
Revision history for this message
Steven Davis (sterdnotshaken) wrote :

Hey Brian,

I'm pretty new to bug reporting. Having read through your link, I'm not quite clear on what you mean by sending out a formal patch. Could you please clarify?

Thanks!

Steve

description: updated
Revision history for this message
Brian Haley (brian-haley) wrote :

In neutron, as with most (all?) Openstack components, developers make changes in a git repository, commit it, then run 'git review' to have the review sent out. An automated testing infrastructure will verify it, and other developers can give comments and/or approve the change.

Revision history for this message
Steven Davis (sterdnotshaken) wrote :
Revision history for this message
Brian Haley (brian-haley) wrote :

I'm hoping you saw the response to your pull request? Below...

Thank you for contributing to openstack/neutron!

openstack/neutron uses Gerrit for code review.

If you have never contributed to OpenStack before make sure you have read the
getting started documentation:
http://docs.openstack.org/infra/manual/developers.html#getting-started

Otherwise please visit
http://docs.openstack.org/infra/manual/developers.html#development-workflow
and follow the instructions there to upload your change to Gerrit.

Revision history for this message
sunzuohua (zuohuasun) wrote :

@Steven Davis, the patch you provide will fix this bug, but will result in a failure of other function:
When deleting the external network,floating ips that not associated should be cleaned up automatically.

Changed in neutron:
assignee: nobody → Steven Davis (sterdnotshaken)
Revision history for this message
Florian Haas (fghaas) wrote :
Download full text (4.6 KiB)

I'd like to give this a bump because this is a truly debilitating issue for those dealing with it.

With access to the Neutron database, admins can easily verify whether they are affected by this issue.

First, verify that no HA router network exists for your tenant:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
Empty set (0.00 sec)

Next, create a router. If Neutron is configured with HA routers enabled, this will create entries in both the networks and networksegments table:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| id | name | network_type | segmentation_id |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| 35974f3d-2328-44e7-bb76-3ff833b59810 | HA network tenant 262ad652ee434a4aa95a77dd588ccaae | vxlan | 65635 |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+

When you create another router, nothing changes:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| id | name | network_type | segmentation_id |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| 35974f3d-2328-44e7-bb76-3ff833b59810 | HA network tenant 262ad652ee434a4aa95a77dd588ccaae | vxlan | 65635 |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+

However, if you delete only one of the routers, the network's record in the networksegments table disappears:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| id | name | network_type | segmentation_id |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| 35974f3d-2328-44e7-bb76-3ff833b59810 | HA network tenant 262ad652ee434a4aa95a77dd588c...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/525697

Changed in neutron:
assignee: Steven Davis (sterdnotshaken) → Brian Haley (brian-haley)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/525737

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/526102

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/475955
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=eaf7e65469d38156b2a38f62cf75d9f8015aaa0c
Submitter: Zuul
Branch: master

commit eaf7e65469d38156b2a38f62cf75d9f8015aaa0c
Author: Miguel Lavalle <email address hidden>
Date: Tue Jun 20 23:25:24 2017 +0000

    Move segment deletion back to PRECOMMIT_DELETE

    This essentially reverts commit 12d24abba75ab3b926edbac389437bacc23914dd.

    Making the callback _delete_segments_for_network respond to
    BEFORE_DELETE network event has created some bugs. In one of them,
    it is not possible to delete a routed network, because the segments
    cannot be deleted due to the fact that the associated subnets still
    exist.

    Making _delete_segments_for_network respond to PRECOMMIT_DELETE
    introduces a StaleDataError with the standard attributes of the
    deleted segments. To work around that, network_db is expired and
    read again after notifying the PRECOMMIT_DELETE event in
    delete_network in the DB core plug-in.

    This also fixes an issue where we could delete the segment ID
    of the l3-ha network when deleting a router, leaving all other
    routers non-functioning. Moving this to PRECOMMIT_DELETE fixes
    it since it is done after we have checked that the network is
    not in use and can be deleted.

    Closes-Bug: #1697324
    Closes-Bug: #1732543

    Change-Id: I7c3c4654f183b317647a28d599a538fe460db68f

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/526102
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9dff53ce65e44a75d953ed9b6c6859184b0995a8
Submitter: Zuul
Branch: stable/pike

commit 9dff53ce65e44a75d953ed9b6c6859184b0995a8
Author: Miguel Lavalle <email address hidden>
Date: Tue Jun 20 23:25:24 2017 +0000

    Move segment deletion back to PRECOMMIT_DELETE

    This essentially reverts commit 12d24abba75ab3b926edbac389437bacc23914dd.

    Making the callback _delete_segments_for_network respond to
    BEFORE_DELETE network event has created some bugs. In one of them,
    it is not possible to delete a routed network, because the segments
    cannot be deleted due to the fact that the associated subnets still
    exist.

    Making _delete_segments_for_network respond to PRECOMMIT_DELETE
    introduces a StaleDataError with the standard attributes of the
    deleted segments. To work around that, network_db is expired and
    read again after notifying the PRECOMMIT_DELETE event in
    delete_network in the DB core plug-in.

    This also fixes an issue where we could delete the segment ID
    of the l3-ha network when deleting a router, leaving all other
    routers non-functioning. Moving this to PRECOMMIT_DELETE fixes
    it since it is done after we have checked that the network is
    not in use and can be deleted.

    Closes-Bug: #1697324
    Closes-Bug: #1732543

    Change-Id: I7c3c4654f183b317647a28d599a538fe460db68f

tags: added: in-stable-pike
Revision history for this message
Florian Haas (fghaas) wrote :

To everyone involved in fixing this — you really made a difference. Thanks a lot!

Revision history for this message
Steven Davis (sterdnotshaken) wrote : Re: [Bug 1732543] Re: HA network tenant network fails upon router delete

This is great news!

Thank you!

On Thu, Dec 14, 2017 at 2:41 PM Florian Haas <email address hidden> wrote:

> To everyone involved in fixing this — you really made a difference.
> Thanks a lot!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1732543
>
> Title:
> HA network tenant network fails upon router delete
>
> Status in neutron:
> Fix Released
>
> Bug description:
> Openstack version: Pike
> Openvswitch version: 2.7
>
> Let's say I have a Openstack project where I've created 2 routers (R1
> & R2). Both routers are configured as L3-HA on pair of network nodes.
> Each of the 2 routers has an Active and a Passive namespace on each
> network node. Neutron creates a unique HA network for each project
> that allows the Active router to send VRRP messages to the Passive
> router. When using vxlan for the "tenant_network_type" a vxlan vni is
> assigned to the said HA network that allows that VRRP east/west
> traffic make it between the 2 network nodes.
>
> The assigned vni is discovered using "openstack network show UUID"
>
> Now, if we delete, for example, router R2, R1 will still need the HA
> network with its associated vni so VRRP communication can continue to
> work. The nature of this discovered bug is that if 1 router (either R1
> or R2, doesn't matter) is deleted the vni gets removed from a HA
> network entirely. At this point, the remaining router (R1) will
> continue to work, despite the HA network not functioning any more.
>
> After the network nodes get restarted though, the broken config is
> loaded and the routers cease to function per the HA network lacking a
> vni assignment.
>
> This problem didn't start happening until we upgraded to Pike.
>
> See attached for proposed fix.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/neutron/+bug/1732543/+subscriptions
>

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b3

This issue was fixed in the openstack/neutron 12.0.0.0b3 development milestone.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.3

This issue was fixed in the openstack/neutron 11.0.3 release.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

So this bug can be closed.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Brian Haley (brian-haley) → nobody
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/525697
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/525737
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.