Comment 11 for bug 1864822

Revision history for this message
James Denton (james-denton) wrote :

So, I have not seen this issue in production since implementing that small patch in https://bugs.launchpad.net/neutron/+bug/1864822/comments/4.

However, I can sorta simulate what happens if/when the connection to :6640 is lost, which we did experience in production and Acoss69 referenced in the opening comment. This may help with developing a patch to the OVS agent that could help recover from this condition.

What we see is this: a normal set of flows on the provider bridge (br-ex or br-vlan, in this example):

Every 1.0s: ovs-ofctl dump-flows br-vlan compute1: Tue Mar 3 07:42:42 2020

NXST_FLOW reply (xid=0x4):
 cookie=0xbe35f1e76f2f0e27, duration=468.374s, table=0, n_packets=0, n_bytes=0, idle_age=532, priority=2,in_port=1 actions=resubmit(,1)
 cookie=0xbe35f1e76f2f0e27, duration=469.071s, table=0, n_packets=0, n_bytes=0, idle_age=532, priority=0 actions=NORMAL
 cookie=0xbe35f1e76f2f0e27, duration=468.373s, table=0, n_packets=2, n_bytes=140, idle_age=184, priority=1 actions=resubmit(,3)
 cookie=0xbe35f1e76f2f0e27, duration=468.371s, table=1, n_packets=0, n_bytes=0, idle_age=532, priority=0 actions=resubmit(,2)
 cookie=0xbe35f1e76f2f0e27, duration=467.008s, table=2, n_packets=0, n_bytes=0, idle_age=532, priority=4,in_port=1,dl_vlan=1 actions=mod_vlan_vid:1111,NORMAL
 cookie=0xbe35f1e76f2f0e27, duration=468.370s, table=2, n_packets=0, n_bytes=0, idle_age=532, priority=2,in_port=1 actions=drop
 cookie=0xbe35f1e76f2f0e27, duration=468.339s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority=2,dl_src=fa:16:3f:01:ad:70 actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=468.329s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority=2,dl_src=fa:16:3f:15:73:1b actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=468.322s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority=2,dl_src=fa:16:3f:49:67:3e actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=468.312s, table=3, n_packets=0, n_bytes=0, idle_age=532, priority=2,dl_src=fa:16:3f:b8:7d:b0 actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=468.368s, table=3, n_packets=2, n_bytes=140, idle_age=184, priority=1 actions=NORMAL

When we see "tcp:127.0.0.1:6640: send error: Broken pipe" in the neutron-openvswitch-agent.log file, it is followed up with something like this:

...
2020-03-03 07:33:50.061 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Mapping physical network vlan to bridge br-vlan
2020-03-03 07:33:50.065 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Bridge br-vlan datapath-id = 0x000086ce24d0d14a
2020-03-03 07:33:50.153 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Bridge br-vlan has datapath-ID 000086ce24d0d14a
2020-03-03 07:33:50.271 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_dvr_neutron_agent [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] L2 Agent operating in DVR Mode with MAC fa:16:3f:8e:8f:ed
2020-03-03 07:33:50.382 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Physical bridge br-vlan was just re-created.
2020-03-03 07:33:50.383 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Mapping physical network vlan to bridge br-vlan
2020-03-03 07:33:50.385 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Bridge br-vlan datapath-id = 0x000086ce24d0d14a
2020-03-03 07:33:50.463 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_bridge [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Bridge br-vlan has datapath-ID 000086ce24d0d14a
2020-03-03 07:33:50.581 3705 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-51eb4348-565b-492b-8910-30c8bca078c5 - - - - -] Agent out of sync with plugin!
...

Most importantly: Physical bridge br-vlan was just re-created.

You will see the flows change, some have a new cookie and others the old cookie. But the drop flow on table 0 causes traffic to be dropped:

Every 1.0s: ovs-ofctl dump-flows br-vlan compute1: Tue Mar 3 07:46:16 2020

NXST_FLOW reply (xid=0x4):
 cookie=0xfc7afcb358f7936e, duration=2.522s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=2,in_port=1 actions=drop
 cookie=0xbe35f1e76f2f0e27, duration=2.665s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=1 actions=resubmit(,3)
 cookie=0xfc7afcb358f7936e, duration=2.525s, table=0, n_packets=0, n_bytes=0, idle_age=2, priority=0 actions=NORMAL
 cookie=0xbe35f1e76f2f0e27, duration=2.664s, table=1, n_packets=0, n_bytes=0, idle_age=2, priority=0 actions=resubmit(,2)
 cookie=0xfc7afcb358f7936e, duration=2.451s, table=2, n_packets=0, n_bytes=0, idle_age=2, priority=4,in_port=1,dl_vlan=1 actions=mod_vlan_vid:1111,NORMAL
 cookie=0xbe35f1e76f2f0e27, duration=2.663s, table=2, n_packets=0, n_bytes=0, idle_age=2, priority=2,in_port=1 actions=drop
 cookie=0xbe35f1e76f2f0e27, duration=2.636s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority=2,dl_src=fa:16:3f:01:ad:70 actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=2.626s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority=2,dl_src=fa:16:3f:15:73:1b actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=2.621s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority=2,dl_src=fa:16:3f:49:67:3e actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=2.613s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority=2,dl_src=fa:16:3f:b8:7d:b0 actions=output:1
 cookie=0xbe35f1e76f2f0e27, duration=2.661s, table=3, n_packets=0, n_bytes=0, idle_age=2, priority=1 actions=NORMAL

The agent does not appear to re-implement the proper flows unless you restart the agent.

The only way I have been able to simulate this behavior is by killing the ovsdb-server or better yet, restarting the openvswitch-switch service without subsequently restarting the neutron ovs agent. In this prod environment I mentioned, the connection to :6640 was lost a couple of minutes after the neutron agent was restarted, which caused this 'drop' rule to be implemented until the agent was restarted. This behavior continue ad nauseam on all compute nodes in the environment until I patched the agent.

OVS Version: 2.11.0
Neutron Version: neutron-openvswitch-agent version 14.0.5.dev19