Comment 8 for bug 1732543

Revision history for this message
Florian Haas (fghaas) wrote :

I'd like to give this a bump because this is a truly debilitating issue for those dealing with it.

With access to the Neutron database, admins can easily verify whether they are affected by this issue.

First, verify that no HA router network exists for your tenant:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
Empty set (0.00 sec)

Next, create a router. If Neutron is configured with HA routers enabled, this will create entries in both the networks and networksegments table:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| id | name | network_type | segmentation_id |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| 35974f3d-2328-44e7-bb76-3ff833b59810 | HA network tenant 262ad652ee434a4aa95a77dd588ccaae | vxlan | 65635 |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+

When you create another router, nothing changes:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| id | name | network_type | segmentation_id |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| 35974f3d-2328-44e7-bb76-3ff833b59810 | HA network tenant 262ad652ee434a4aa95a77dd588ccaae | vxlan | 65635 |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+

However, if you delete only one of the routers, the network's record in the networksegments table disappears:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| id | name | network_type | segmentation_id |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+
| 35974f3d-2328-44e7-bb76-3ff833b59810 | HA network tenant 262ad652ee434a4aa95a77dd588ccaae | NULL | NULL |
+--------------------------------------+----------------------------------------------------+--------------+-----------------+

So while the network still exists in the database, it has no segments left, so it's no longer mapped to a GRE key, VXLAN vnid, or VLAN ID. It's effectively dead.

At this point, no new routers in the tenant can be made to function. And users can't even work around that by disabling HA on a router, because doing so requires them being an admin (per Neutron's default policy.json).

Then, only when the last router is deleted, the HA network disappears as well:

MariaDB [neutron]> SELECT n.id, n.name, ns.network_type, ns.segmentation_id FROM networks n LEFT JOIN networksegments ns ON n.id=ns.network_id WHERE n.name LIKE 'HA network tenant 262ad652ee434a4aa95a77dd588ccaae';
Empty set (0.00 sec)

While cleaning up the network when the last router is deleted completely makes sense, cleaning up the networksegments record at any time before that looks like it's certainly unintentional.

What makes this particularly painful is that to a user, it is absolutely impossible to recover from this error without deleting (and then recreating) *all* routers in their tenant, in other words, it's impossible to fix this without downtime.

For operators currently affected by this, the only workaround seems to be to temporarily disable HA routers altogether (in neutron.conf):

[DEFAULT]
l3_ha = False