[RFE] Unable to create a router that's both HA and distributed

Bug #1365473 reported by Assaf Muller
56
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Wishlist
Carl Baldwin

Bug Description

An etherpad that summarizes the current progress of this bug:
https://etherpad.openstack.org/p/DVR_HA_Routers

There's several issues with the L3 schedulers and L3 agent that need to be addressed.

This bug is dependent on:
https://bugs.launchpad.net/neutron/+bug/1365476

Agent side patch:
https://review.openstack.org/#/c/196893/

Server side patch:
https://review.openstack.org/#/c/143169/

Changed in neutron:
importance: Undecided → High
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

This looks more like a blueprint for Kilo rather than a bug.
For now i think it makes sense to explicitly forbid such operation via API

tags: added: api
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/122024

Changed in neutron:
assignee: nobody → Mike Smith (michael-smith6)
status: New → In Progress
Revision history for this message
Assaf Muller (amuller) wrote : Re: Unable to create a router that's both HA and distributed

Eugene, it is blocked at the L3 plugin level.

Changed in neutron:
assignee: Mike Smith (michael-smith6) → nobody
status: In Progress → New
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Not sure we can get this into Juno

Changed in neutron:
status: New → Confirmed
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/139686

Changed in neutron:
assignee: nobody → Mike Smith (michael-smith6)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/143169

Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/143719

Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/143719
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6ee81d35fc269863f1fe08c57f3672fdc48197cf
Submitter: Jenkins
Branch: master

commit 6ee81d35fc269863f1fe08c57f3672fdc48197cf
Author: rajeev <email address hidden>
Date: Tue Dec 23 13:49:19 2014 -0500

    HA for DVR - schema migration and change

    To support HA for DVR SNAT, default SNAT has to be schedulable
    on multiple L3 agents. The csnat_l3_agent_bindings table is being
    modified to include l3_agent_id in the primary key.
    The migration script and Class definition update is included in
    this patch. For modularity and code management, HA/DVR methods
    that would make use of this change will be included in a different
    patch.

    Partial-bug: #1365473
    Change-Id: Idfe93cace0c1b633be6e786206fbec6e1f3c13cd

Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Mike Smith (michael-smith6)
Changed in neutron:
assignee: Mike Smith (michael-smith6) → Rajeev Grover (rajeev-grover)
Revision history for this message
Assaf Muller (amuller) wrote : Re: Unable to create a router that's both HA and distributed

There's operator buy-in for this, bumping priority to high.

I think that both DVR and L3-HA made good progress with respect to stabilization during Kilo and this is reasonable to push for Liberty.

Changed in neutron:
importance: Medium → High
description: updated
Changed in neutron:
assignee: Rajeev Grover (rajeev-grover) → Adolfo Duarte (adolfo-duarte)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Assaf Muller (<email address hidden>) on branch: master
Review: https://review.openstack.org/139686

Assaf Muller (amuller)
Changed in neutron:
milestone: none → liberty-3
description: updated
tags: added: rfe
Revision history for this message
Kyle Mestery (mestery) wrote : Re: Unable to create a router that's both HA and distributed

What is the current status of this bug? We'd like this to land in Liberty as operators clearly want this.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

This defect its in its final phase of review.
Hopefully it would be ready for merging in the next few days.

Revision history for this message
Assaf Muller (amuller) wrote :

We're closer now that bug 1365476 (L3 HA + l2pop) has (Finally) been fixed today. I think that the expectation to merge this over the next few days is perhaps optimistic.

I would like to see a fullstack test for L3 HA + DVR. The reason I haven't communicated this earlier is that fullstack isn't there yet, but it's close. The first patch in the series is very close:
https://review.openstack.org/#/c/188221/

We will need some of the other patches in the series before we can add DVR support (Or a DVR + L3 HA test). I'm working on fullstack as my main priority because it's blocking a bunch of different efforts: L3 HA + DVR, distributed DHCP and QoS. We don't have coverage for L3 HA in Tempest, and we have nothing in the current L3 HA + DVR patches that shows that the integration is actually working, which is why I think that a fullstack test is a requirement.

Adolfo, you will need to create a dependency between the two L3 HA + DVR patches, and in the later patch add a full stack test. I'm doing everything I can to speed fullstack along.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

You are correct Assaf.. but I am always am optimist... Let me rephrase what I wanted to say. The basic functionality of the server and agent is pretty far along. Currently the agent side can handle receving a router with distributed=True and ha=True, and it goes ahead and setups the router in the service node (snat node i.e it brings up the external gatway port (qg) and internal snat gateway (sg) in the correct namespaces. It also setups up the keepalive process to monitor them and configures them with the correct l3 information. My tetsting so far shows that it fails over correctly.
It can also handle the different combination of create router, set gateway, add interface, remove interface, repeat in different order, etc...

The server sid is obvisouly sending the correct information for the routers as well.

In other words, currently the two patches setup the router interfaces correctly and allow for the manipulation of the router, now weather or not the full system can correctly utilize the routers is a different story. For that, as you say, we need a few other fixes to line up.

We do need to test this on a full stack. internally I have been testing on a setup with three nodes (two service and one compute) but I agree with you that there needs to be a full stack test for it so we can make sure it works. Currently I am frankly focusing on making sure it did not break any of the current functionality with dvr and ha by themselves. In other workds "do no evil". We need to make sure this does not break any current functionality.

I will create the dependency. of the server patch on the agent patch as the agent patch can actually exist on its own, since it won't do anything with out the server patch allowing for the creation of a dvr/ha router.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

I have been doing some testing and althought 1365476 (L3 HA + l2pop) fix is merged: https://review.openstack.org/#/c/141114/
BUT, on the master branch currently if I do an "ifconfig ...down" on the ha interfaces of the snat nodes the ha_router_agent_port_bindings database does not failover so the datapath switch does not happen.

is "ifconfig .... down" on the ha network interfaces not enough to trigger a failover? how should I test this?

Revision history for this message
Assaf Muller (amuller) wrote :

@Adolfo: I don't know about DVR + L3 HA integration, but if you use vanilla L3 HA, and you ifconfig .. down the HA interface of the *master* instance of the router, it should failover correctly with l2pop on. If you look at the L3 agent logs on that machine, after roughly 8-9 seconds it should log that router %s became standby, and around 10 seconds after that it should log that it's updating the server that it became standby. On the backup node you should see the same logs but with 'active'. If you then execute neutron l3-agent-list-hosting-router %s, you should see the ha_state column update.

If this isn't happening with the DVR + L3 HA integration patches then there's an issue with one of those two patches.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

Thanks. I was testing it wrong. I was not waiting long enough for the fail over to register in the database.. It does after a few seconds.. and I also was reaching the vm from the dhcp-namespace, and I guess when it fails over (from master to active) the dhcp namespace lost connectivity to the vm. so all in all it was probably user error.

Thanks for the insight

Revision history for this message
Assaf Muller (amuller) wrote :

Can you update https://bugs.launchpad.net/neutron/+bug/1365476 with this conclusion? I don't want Googlers to get the wrong impression. Also you can always pop on IRC to ask these sort of questions :)

Changed in neutron:
assignee: Adolfo Duarte (adolfo-duarte) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Adolfo Duarte (adolfo-duarte)
Revision history for this message
John Schwarz (jschwarz) wrote :

I've checked out the latest patchset and tried to get a DVR+HA router to work on my setup last week. Overall it works (2 nodes, connectivity established in the snat- namespaces, instances can ping external gateway, etc), however there is a problem where sometimes, upon setting an external gateway to the router, some ovs flows are deleted or one/more of the vxlan- devices get deleted, or both of these situations occur at the same time.

As said, I've been working on this for the last week so I'm deep in the code now. Will update once I find the root cause of this.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

Hi John, thanks for the testing. I updated a patch with the latest review comments yesterday for the agent and server. As you mentioned it seems to be working.
About the issue you are seeing. Do the vxlan tunnels disappear when you add an external gateway?
I have notice a similar behavior, but what I have seen might be outside of the dvr/ha/openstack domain.
What I have seen is that when I am using virtual hosts to do my testing, if *ANY* interface gets added to the br-ex bridge, it drops the br-tun interfaces. I can't figure out why it happens. It just does. I have been able to work around it by adding a "dummy" interface to the br-ex bridge before I start doing anything. like so: "sudo ovs-vsctl add-port br-ex dummy-interface"
I said I am not sure if it has anything to do with openstack at all because I can reproduce the problem by adding any interface *after* the br-tuns are up. Just go to the command line and do something like "ovs-vsctl add-port br-ex some-test-interface" and then the tunnels drop. I thought it was something different in my setup. I am using openstack in openstack to do my testing, so i thought that perhaps stp or something was dropping my tunnels.
The funny thing is that, like I said, the problem goes away if I add dummy interfaces *before* any stacking.

I'll open a bug on what I have seen to track it because for me it actually happens on ANY openstack deployment using vxlan (legacy, dvr, or ha)... Please let me know if this is the same problem you are seeing. Like I said all you have to do is add an interface to br-ex by hand, and the br-tuns will drop.

thanks.

Revision history for this message
John Schwarz (jschwarz) wrote :

@Adolfo, What you're saying sounds very interesting, and I'll have a deep look at it on Sunday to see if the two cases are related. Even if it's some other Neutorn bug and we find a fix for it, I feel it's blocking this effort since if I've been able to steadily reproduce this quite easily (without knowing much about DVR+HA), others will be able to run into this in a heartbeat as well (this might have been the error that was reported on the ML a few weeks back).

I'm reproducing the vxlan dropping by clearing and re-setting the gateway a few times. After a few times it's bound to happen that the vxlan tunnel is missing from one/both of the nodes. Also (I forgot to mention this), when this happens I'm noticing the tags of the sg- devices are set to 4095.

Revision history for this message
Assaf Muller (amuller) wrote :

> Also (I forgot to mention this), when this happens I'm noticing the tags of the sg- devices are set to 4095.

I'm glad you said that John. This is an educated guess, but that is probably the root cause, and not the other way around. When the OVS agent tags a port with the VLAN tag of 4095, it will also put the Neutron port status from ACTIVE to DOWN. L2pop in turn will tear down tunnels if there are no remaining ACTIVE (That's the important part) ports on a host.

So, the next step is to be able to reproduce (Which it sounds like you can), then understand why the OVS agent is putting those ports in the 4095 VLAN. One thing you can try is grepping for the port ID in the neutron-server.log. It often has more information about such events. Failing that, just pdb through the OVS agent (We currently don't log the reason the OVS agent puts a port in the 4095 VLAN, which is a separate issue) to see the stack, which will explain why it's doing that.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

John, I am trying to reproduce the issue here and I just realize I need a couple more questions answered.
You are saying that the problem is that the vxlan tunnels drop. And this is an unexpeced drop right? What I mean is that vxlan tunnels are only active ("up") when needed, so vxlan tunnels dropping is quite normal if in the right situaion.
For example if you clear the gateway and there is not need for a compute node to send traffic to the "network" node, the tunnel will drop. That is expected. It will be recreated when needed.
Again, vxlan tunnels do not stay up all the time, they do come down and up as necessary.

for example, if I bring up a vm and set the gateway, a vxlan tunnel will be created from the compute node to the network node which has the "physical " gateway in it. And then if you clear the gateway, and there is no other service (like dhcp) needed by the vm in the compute node on that network node, the vxlan tunnel will drop.

can you give a detail list of the steps you are following, so I can reproduce the problem, that way I know where I am supposed to look.

Thnaks.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

repeated this test on master branch:
two network nodes : one running q-vpn the other q-l3, a third node as compute node.
Create an ha router (neutron router-create ha --distributed=False --ha=True
crated public network (neutron net-create public --router:external; neutron subnet-create public 123.0.0.0/24)
create private networks (neutron net-create private; neutron subnet-create private 103.0.0.0/24 --name private)
span up vm: nova boot --etc....--nic net-id=privatenetid
add interface to router: neutron router-interface-add ha private
then log into vm and started pinging 103.0.0.1 (the gateway).
it works fine, bu thten I did the following command a couple of times:

neutron router-gateway-set ha public
neutron router-gateway-clear ha
neutron router-gateway-set ha public
neutron router-gateway-clear ha
neutron router-gateway-set ha public
neutron router-gateway-clear ha

after a few times the vm ha problems pinging the 103.0.0.1 gateways, sometimes it takes up to a minute for the vm to be able to ping the 103.0.0.1 (gateway) after either gateway-set or gateway-clear. Sometimes it never can ping (5 minutes).

It seems that something is coming in and changing the flow rulse or vxlan tunnels.
I will repeat with dvr (ha=false to see what the behavior is there).

Revision history for this message
John Schwarz (jschwarz) wrote :

@Asaf, looking at the q-svc log and searching for the port id doesn't show the reason that it's not working. I'm working my way through the OVS with pdb - it's taking some time though ;-)

@Adolfo,
It's my understanding that in an HA router, the HA ports should always have connectivity between them, so the tunnels should *always* be up. Am I wrong in this assumption? Something that supports this assumption, btw, is that sometimes when the setup is all good and dandy - the vxlan is there. Even before I even set the gateway the first time - the tunnels are there.

As for the steps I'm doing to reproduce - they look very much the same as you posted on your second comment. Only, I'm not booting a VM because of my assumption (this helps to 'cut down' the costs of each iteration). My setup is consistent of only 2 nodes: an all-in-one, and a compute node with q-agt and q-l3.

Waiting for your report on whether this happens on a DVR-non-HA router :)

John.

Revision history for this message
John Schwarz (jschwarz) wrote :

@Asaf, also, the fact that the 'sg-' device has a tag of 4095 doesn't always happen. In fact, I've just encountered a situation where the 'sg-' device is set properly, but the tunnel device itself is missing.

Revision history for this message
Assaf Muller (amuller) wrote :

Consider turning off l2pop just to be able to continue testing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/218669

Revision history for this message
John Schwarz (jschwarz) wrote : Re: Unable to create a router that's both HA and distributed

The related patch fixes this issue for me (I'm no longer witnessing this on my setup), though I've not checked if the problem exists for DVR routers on their own (without these patches for DVR+HA). Such checks will be done tomorrow morning, but I don't think it's something caused by this patch (hunch says it effects the entirety of DVR).

Adolfo, I'll be happy if you could cherry-pick this on top of your patches and see if it fixes the problems for you as well. Note that you'll probably want to base yourself on https://review.openstack.org/#/c/217927/ if it doesn't merge by tomorrow.

On a different note, I'm noticing a different issue where after a number of clear/set iterations, some rules get deleted/added, causing traffic to not go through. I think that this is a different issue from what we've been investigating so far, though I've not checked in thoroughly enough to justify this statement. I have diffs of states which shows the rules that are added/deleted so it should go smoothly. Hopefully tomorrow will prove to be a fruitful day as today was :)

John.

Revision history for this message
John Schwarz (jschwarz) wrote :

A summary of issues I'm experiencing (just so that we are on the same page):

1) The problem in the previous comment, where some rules are leftover/deleted when they shouldn't, causing connectivity errors which are fixed by manually deleting all the flows in br-int (this causes a re-creation of all the flows in br-tun).

2) Removing the gateway of a DVR+HA router does not delete the 'sg-' device. This does not reproduce on a normal DVR router, which means it's a regression caused by this patch. I have a fix locally and already posted it on [1].

3) Re-starting the ovs agent after everything is setup, deletes all the flow rules (as it should - they're "stale") but it does not recreate them. This probably has something to do with the cookies of some of the flows is not set properly (they are 0 before restarting the ovs agent and are deleted after the restart).

[1]: https://review.openstack.org/#/c/196893/19/neutron/agent/l3/dvr_edge_ha_router.py

John.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

@john, I did notice same behavior while I was doing testing of some rules being delete and not recreated. I happened to me on the master branch with ha only routers. In my case I was doing some testing by setting and unsetting external gateway, and ran into previously mentioned issue about the tunnels which carry the ha network dropping and breaking ha.
After that I noticed that some of the rules being created on the br-tun and br-int switches where incorrect or missing. I have not been able to reproduce the exact same behavior again, but it looked to me like for some reason the rules being written where in the incorrect host. as if l2 pop was confuse as to what tables where in which host.

On another not jon, when you say two comments above:
"The related patch fixes this issue for me"
which issue do you mean? the gs interface not being deleted, or the
"Fix Prefix delegation router deletion key error"

Either way i'll cherry pick and test. Just wondering if I should also integrate your code from comment [1] above.

Thanks.

Revision history for this message
John Schwarz (jschwarz) wrote :

I meant the sg devices not being cleaned up. I think it should be integrated and reviewed, but a fix for this should go in either way.. :-)

What you wrote about the flows being incorrect in an HA only router sounds like what I saw. It deserves another look IMO.

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

Integrated the fix proposed by john and pushed the code. Testing suggests it is working as expected.
Some of the issues mentioned above (flows being incorrect, and having to be deleted by hand) seem to be present int the master branch so they might point to a more general problem with neutron routering (dvr, ha, and dvr/ha).

Will continue testing to try and find a root cause.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

this surely will need more time

Changed in neutron:
milestone: liberty-3 → liberty-rc1
Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

On plain HA router, after sequence of add/clear gateway, I'm landing with dead OVS flow to drop packets from deleted qg port...

In neutron OVS agent logs, I'm seeing that problem is with VLAN assign to qg port:
2015-09-03 15:26:51.239 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-207921ff-5b91-48b7-9625-03692368e9ba None None] Assigning 6 as local vlan for net-id=d52d878c-8d29-4146-9309-36c00f688274
2015-09-03 15:26:53.172 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-207921ff-5b91-48b7-9625-03692368e9ba None None] Port 'qg-1d20df8a-f0' has lost its vlan tag '6'!

The second line is repeating at each ovs agent loop.

After stopping ovs agent, the VLAN for qg port is set o 28... the restart of ovs agent is settling the VLAN for qg- port at 28 and there is no reassigning the VLAN to 6...

Any ideas?

Revision history for this message
Assaf Muller (amuller) wrote :

@Artur, can you describe a reproducer? I'm unclear on a few details. If the last step is clearing the gateway, then it looks like the 'qg' device is not being deleted properly? The OVS agent is not really supposed to deal with the 'qg' port at that point.

Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

The reproduction step is to simply add/remove the gateway for HA router:
neutron router-gateway-set demo-router ext-net
neutron router-gateway-clear demo-router
neutron router-gateway-set demo-router ext-net
neutron router-gateway-clear demo-router
neutron router-gateway-set demo-router ext-net

The problem with losing the VLAN is constantly repeating during neutron OVS rpc loop.
Log:
http://paste.openstack.org/show/444597/

only after clearing the gateway the reclaiming the VLAN stops. And we have leftover in OVS flow:
2 0 0 2 in_port=34 drop
3 0 0 2 in_port=36 drop
The port 34 and 36 are IDs of removed qg- port.

The log and situation is from standby l3 agent.

Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

Ok, so what I have found is a situation when qg- port is added to br-int and assign the VLAN (configuration with int-br-ex and phy-br-ex veth).

The ovs_neutron_agent is in one iteration of rpc_loop processing the deleted port via process_deleted_ports() method, marking the qg- port as dead (ovs flow rule to drop the traffic) and in another iteration, the ovs_neutron_agent is processing the removed port by treat_devices_removed() method.

In first iteration, the port deleting is triggered by port_delete() method:
2015-09-04 14:16:20.337 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-e43234b1-633b-404d-92d0-0f844dadb586 admin 0f6c0469ea6e4d95a27782c46021243a] port_delete message processed for port 1c749258-74fb-498b-9a08-1fec6725a1cf from (pid=136030) port_delete /opt/openstack/neutron/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:410

and in second iteration, the device removed is triggered by ovsdb:
2015-09-04 14:16:20.848 DEBUG neutron.agent.linux.ovsdb_monitor [-] Output received from ovsdb monitor: {"data":[["bab86f35-d004-4df6-95c2-0f7432338edb","delete","qg-1c749258-74",49,["map",[["attached-mac","fa:16:3e:99:37:68"],["iface-id","1c749258-74fb-498b-9a08-1fec6725a1cf"],["iface-status","active"]]]]],"headings":["row","action","name","ofport","external_ids"]}
 from (pid=136030) _read_stdout /opt/openstack/neutron/neutron/agent/linux/ovsdb_monitor.py:50

Log:
http://paste.openstack.org/show/445479/

Should this be filed as a bug?

Revision history for this message
Assaf Muller (amuller) wrote :

Yes, this sounds like a bug with HA routers (And probably nothing to do with l2pop or with DVR).

Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

The flow marked as deleted is more general problem - it affects legacy, dvr and HA routers in both gateway port and interfaces in tenant networks.
I will file a bug with description how to reproduce.

Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :
Revision history for this message
John Schwarz (jschwarz) wrote :

https://bugs.launchpad.net/neutron/+bug/1493788 reported for the "ovs agent restart does not recreate all flows" problem reported in comment #30.

John Schwarz (jschwarz)
description: updated
Revision history for this message
Kyle Mestery (mestery) wrote :

We're getting down to the wire with this bug. I'm leaving it targeted at RC1 now in the hope we can get something merged before then.

Kyle Mestery (mestery)
Changed in neutron:
milestone: liberty-rc1 → none
Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

what is still missing to merge this in Liberty cycle?

Revision history for this message
Adolfo Duarte (adolfo-duarte) wrote :

reviews ? verification all issues with patch have been addressed perhaps.
Here is the etherpad following the current status of the patch:
https://etherpad.openstack.org/p/DVR_HA_Routers

Changed in neutron:
assignee: Adolfo Duarte (adolfo-duarte) → Assaf Muller (amuller)
Changed in neutron:
milestone: none → liberty-rc1
Kyle Mestery (mestery)
Changed in neutron:
milestone: liberty-rc1 → mitaka-1
tags: added: liberty-rc-potential
Changed in neutron:
assignee: Assaf Muller (amuller) → Adolfo Duarte (adolfo-duarte)
Changed in neutron:
assignee: Adolfo Duarte (adolfo-duarte) → Assaf Muller (amuller)
Changed in neutron:
assignee: Assaf Muller (amuller) → Artur Korzeniewski (artur-korzeniewski)
tags: removed: liberty-rc-potential
Revision history for this message
Artur Korzeniewski (artur-korzeniewski) wrote :

Is there any chance to merge DVR-HA in Liberty?

Revision history for this message
Assaf Muller (amuller) wrote :

@Artur: I don't see how that's possible.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/218669
Reason: This was fixed in https://review.openstack.org/#/c/227982/

Revision history for this message
John Schwarz (jschwarz) wrote : Re: Unable to create a router that's both HA and distributed

Changeset that was abandoned was already fixed by https://review.openstack.org/#/c/227982/

Changed in neutron:
assignee: Artur Korzeniewski (artur-korzeniewski) → John Schwarz (jschwarz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/196893
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f63366e615e2d4e799fe8072b450bce96f179c9e
Submitter: Jenkins
Branch: master

commit f63366e615e2d4e799fe8072b450bce96f179c9e
Author: Michael Smith <email address hidden>
Date: Thu Dec 4 16:15:43 2014 -0800

    L3 Agent support for routers with HA and DVR

    The main difference for DVR HA routers is where
    the VRRP/keepalived logic is run and which ports
    fall in the HA domain for DVR. Instead of running
    in the qrouter namespace, keepalived will run inside
    the snat-namespace. Therefore only snat ports will
    fall under the control of the HA domain.

    Partial-Bug: #1365473

    Change-Id: If2962580397d39f72fd1fbbc1188a6958f00ff0c
    Co-Authored-By: Michael Smith <email address hidden>
    Co-Authored-By: Hardik Italia <email address hidden>
    Co-Authored-By: Adolfo Duarte <email address hidden>
    Co-Authored-By: John Schwarz <email address hidden>

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote : Re: Unable to create a router that's both HA and distributed

RFE-approved: we wanted this badly for a long time.

tags: added: rfe-approved
removed: rfe
Revision history for this message
Miguel Lavalle (minsel) wrote :
Changed in neutron:
assignee: John Schwarz (jschwarz) → Adolfo Duarte (adolfo-duarte)
Changed in neutron:
importance: High → Wishlist
Changed in neutron:
milestone: mitaka-1 → mitaka-2
Changed in neutron:
milestone: mitaka-2 → mitaka-3
Henry Gessau (gessau)
summary: - Unable to create a router that's both HA and distributed
+ [RFE] Unable to create a router that's both HA and distributed
Changed in neutron:
assignee: Adolfo Duarte (adolfo-duarte) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Adolfo Duarte (adolfo-duarte)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
assignee: Adolfo Duarte (adolfo-duarte) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Adolfo Duarte (adolfo-duarte)
Changed in neutron:
assignee: Adolfo Duarte (adolfo-duarte) → Oleg Bondarev (obondarev)
Changed in neutron:
assignee: Oleg Bondarev (obondarev) → Carl Baldwin (carl-baldwin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/143169
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3f0c618cfd2957be9d76d2763903f0023b1ac737
Submitter: Jenkins
Branch: master

commit 3f0c618cfd2957be9d76d2763903f0023b1ac737
Author: rajeev <email address hidden>
Date: Tue Dec 9 11:51:22 2014 -0500

    HA for DVR - Neutron Server side code changes

    This patch adds HA support for DVR centralized default SNAT
    functionality to Neutron Server. For the agent side changes
    another patch has been merged.

    Salient changes here are:

     - Schedule/de-schedule SNAT on multiple agents
     - Enables
        'router-create <router name> --ha True --distributed True'

    Closes-bug: #1365473

    Co-Authored-By: Adolfo Duarte <email address hidden>
    Co-Authored-By: Hardik Italia <email address hidden>
    Co-Authored-By: John Schwarz <email address hidden>
    Co-Authored-By: Oleg Bondarev <email address hidden>
    Change-Id: I6a19481d0e19b8a55f32199a27057bf777548b33

Changed in neutron:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.