L2 Agent switch to non-dvr mode on first RPC failure

Bug #1364215 reported by Vivekanandan Narasimhan
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brian Haley

Bug Description

The DVR enabled L2 OVS Agent switches to operate in non-dvr mode if the first RPC call get_dvr_mac_address_by_host() fails during its init_(). After that the L2 Agent sticks to operate in non-dvr mode thereby ripping off the ability to run DVR on such nodes.

The fix for this bug , is to enable DVR RPC calls to be made on-demand to the controller only when the first local port on a dvr routed subnet is detected to be plumbed by the L2 OVS Agent on the node.

tags: added: l3-dvr-backlog
Changed in neutron:
importance: Undecided → High
Revision history for this message
Vivekanandan Narasimhan (vivekanandan-narasimhan) wrote :

Carl,

Can you please let us know why this is classified 'High'.

A restart of L2 Agent will bring it back to DVR mode (after controller is made up).

Changed in neutron:
assignee: nobody → Vivekanandan Narasimhan (vivekanandan-narasimhan)
Changed in neutron:
status: New → In Progress
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

I classified as High because I didn't see a very friendly way to diagnose this problem in order to apply the L2 agent restart to fix it. My impression is that this would manifest itself externally as a general failure of DVR without much indication about where to start debugging.

If you think this would be obvious to deployers then I could downgrade the importance of this bug.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Vivek, are you actively working on a fix?

Changed in neutron:
status: In Progress → Confirmed
Revision history for this message
Vivekanandan Narasimhan (vivekanandan-narasimhan) wrote :

Yes, i have done some changes already.

I have moved the DVR RPCs to be now invoked from rpc_loop of neutron agent instead of init_()-constructor of the agent.

Some test cases are affected and so am fixing them. I will post patch for review next week.

Revision history for this message
Vivekanandan Narasimhan (vivekanandan-narasimhan) wrote :

Carls question:

The L2 Agent report state information carries which mode it operates on using enable_distributed_routing flag.

When the first rpc failure happens, from that point on all the report-state carry enable_distributed_routing=False in the payload. So just by doing neutron agent-show <openvswitch-agent-id of the node where problem is>, the admin can recognize that the agent is operating is non-dvr mode.

So we donot silently move to a situation of running in non-dvr mode. It is easily diagnosable.

So I request to reduce the severity.

Revision history for this message
Brian Haley (brian-haley) wrote :

I do see this failure right after an RPC error:

2014-09-23 13:31:58.292 10102 DEBUG oslo.messaging._drivers.impl_rabbit [req-67030e45-807c-46ea-8553-42b8b278d1f5 ] Timed out waiting for RPC response: timed out _error_callback /opt/stack/venvs/openstack/local/lib/python2.7/site-packages/oslo/messaging/_drivers/impl_rabbit.py:721
2014-09-23 13:31:58.293 10102 ERROR neutron.plugins.openvswitch.agent.ovs_dvr_neutron_agent [req-67030e45-807c-46ea-8553-42b8b278d1f5 None] DVR: Failed to obtain local DVR Mac address

But strangely I still see the distributed routing flag as True in the report messages:

2014-09-23 13:34:25.842 10102 DEBUG neutron.common.rpc [-] neutron.agent.rpc.PluginReportStateAPI method cast called with arguments (<neutron.context.ContextBase object at 0x7f7fa0299d50>, {'args': {'agent_state': {'agent_state': {'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': 'overcloud-controllermgmt0-nszfnfjwm4fy', 'agent_type': 'Open vSwitch agent', 'configurations': {'arp_responder_enabled': False, 'tunneling_ip': '192.0.2.29', 'devices': 0, 'l2_population': True, 'tunnel_types': ['vxlan'], 'enable_distributed_routing': True, 'bridge_mappings': {}}}}, 'time': '2014-09-23T13:34:25.842207'}, 'namespace': None, 'method': 'report_state'}) {} wrapper /opt/stack/venvs/openstack/local/lib/python2.7/site-packages/neutron/common/log.py:35

It is probably still the same error, since a restart of the openvswitch agent fixes the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/123911

Changed in neutron:
assignee: Vivekanandan Narasimhan (vivekanandan-narasimhan) → Brian Haley (brian-haley)
status: Confirmed → In Progress
Changed in neutron:
assignee: Brian Haley (brian-haley) → Vivekanandan Narasimhan (vivekanandan-narasimhan)
Changed in neutron:
assignee: Vivekanandan Narasimhan (vivekanandan-narasimhan) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Carl Baldwin (carl-baldwin)
Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Carl Baldwin (carl-baldwin)
Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Armando Migliaccio (armando-migliaccio)
Changed in neutron:
assignee: Armando Migliaccio (armando-migliaccio) → Brian Haley (brian-haley)
Revision history for this message
Kyle Mestery (mestery) wrote :

This seems like a reasonable back port candidate for Juno once the fix lands in Kilo.

Changed in neutron:
milestone: none → kilo-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/123911
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=51303b5fe4785d0cda76f095c95eb4d746d7d783
Submitter: Jenkins
Branch: master

commit 51303b5fe4785d0cda76f095c95eb4d746d7d783
Author: Brian Haley <email address hidden>
Date: Wed Sep 24 21:45:06 2014 -0400

    Make L2 DVR Agent start successfully without an active neutron server

    If the L2 Agent is started before the neutron controller
    is available, it will fail to obtain its unique DVR MAC
    address, and fall-back to operate in non-DVR mode
    permanently.

    This fix does two things:
    1. Makes the L2 Agent attempt to retry obtaining a DVR MAC
    address up to five times on initialization, which should be
    enough time for RPC to be successful. On failure, it will
    fall back to non-DVR mode, ensuring that basic switching
    continues to be functional.

    2. Correctly obtains the current operating mode of the
    L2 Agent in _report_state(), instead of only reporting
    the configured state. This operating mode is carried
    in 'in_distributed_mode' attribute of agent state, and
    is separate from the existing enable_distributed_routing
    static config that is already sent.

    Change-Id: I5fd9bf4163eafa321c5fca7ffb7901ae289f323b
    Closes-bug: #1364215

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/163349

Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-1 → 2015.1.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/juno)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/163349
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.