agent instances are not rescheduled when one agent goes offline

Bug #1174591 reported by Robert Collins
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Kevin Benton
tripleo
Invalid
Medium
Unassigned

Bug Description

When an agent fails/goes offline for any reason, the services it was providing are unavailable until either the agent comes back or the work gets scheduled onto another agent - which is currently a manual task.

This is related to bug 1174132 which covers the impact of users when an agent fails and there is no agent running for a period of time.

Ideally, Quantum could ensure that rescheduling happens automatically, making the addition and removal of agents automatic - nearly free of configuration.

Tags: l3-ipam-dhcp
description: updated
Changed in tripleo:
status: New → Triaged
importance: Undecided → Medium
Gary Kotton (garyk)
Changed in quantum:
status: New → Confirmed
tags: added: l3-ipam-dhcp
Revision history for this message
Mark McClain (markmcclain) wrote :

Automatic scheduling is something that is planned.

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: Confirmed → In Progress
Revision history for this message
Kevin Benton (kevinbenton) wrote :

There is a patch under discussion here that addresses the problem for L3 routers.
https://review.openstack.org/#/c/110893/

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

  I know the approach you're trying to implement, and I know it's used
in a few companies via the AT&T scripts.

    I once started writing the same implementation you're proposing,
but then I stopped it after talking to our HA experts, they gave me
very good reasons, imagine this scenario/failure mode:

    1) An agent crashes, therefore it stops answering hearbeats,
       imagine that it also has some kind of resource exhaustion that
       wont let it restart automatically even if you configured that...

    2) Neutron server (or the at&t scripts) detect that, and move
       the vrouters to a different host.

    3) The new l3 agent starts the vrouters, good...

    but now, we have the same IP addresses and Vrouters *working* in both
    nodes, the old one, which is "headless" but has all the qrouter-*
    namespaces, and the new one which actually has an agent.

    This make you having IP conflicts in the network, duplicate packets,
    etc...

    We solved this in Red Hat (for the time being), using pacemaker + active/passive
configuration, using two nodes, (but you could do several pairs).., and adding
two things to the setup:
     1) neutron-netns-cleanup --forced and neutron-ovs-cleanup (when we manually migrate
                                                                an agent from A to P host)

     2) fencing (that means using IPMI api or LOM to reboot the failed host) via pacemaker

    we use the agent "host=" parameter, that sets a logical identifier towards
neutron-server, in

in
node1: we set host=neutron-network-node
node2: we set host=neutron-network-node

Also, you need to set the exact same host= logical id into the openvswitch plugin.ini
file in the host, otherwise they won't play well together.

More details here:
https://github.com/fabbione/rhos-ha-deploy/blob/master/rhos5-rhel7/mrgcloud-setup/RHOS-RHEL-HA-how-to-mrgcloud-rhos5-on-rhel7-neutron-n-latest.txt

The solution is not perfect, for example, if you have lots of resources in the
network node, it will take some time to rebuild on the new host, but that problem
also happens with the auto router migration.

Revision history for this message
Sudhakar Gariganti (sudhakar-gariganti) wrote :

I share the same views as posted above by Miguel. Actually I also was trying to address this issue in the same way and was warned of the negative impacts which could arise with L3 rescheduling based on the agent status.

L3 VRRP Blueprint ( https://blueprints.launchpad.net/openstack/?searchtext=l3-high-availability ) will help address this a bit for the L3 agent case.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

That failure scenario isn't really a problem in certain deployments. The reason I say that is, depending on the network backend, it doesn't matter that the old L3 agent is still trying to use the router IP addresses because the network knows where the addresses actually belongs based on the neutron port bindings. After rescheduling, the network hardware can just block the bad packets from the old L3 agent.

Even if you ignore the solution above since it doesn't apply to the reference implementation, this should be a tradeoff that deployers can decide to make. The pacemaker solution is very difficult to deploy because it requires the stonith approach via out of band like you mentioned in order to be immune to the same headless problems. So I think at minimum we should offer auto rescheduling as a configurable naive approach with the well documented assumption of shared fate between the L3 agent and the underlying hardware.

Out of curiosity, have you run into a scenario where both nodes think the other is dead and simultaneously trigger an IPAM reboot of the other? Out of band power management APIs are usually slow enough to leave a narrow window where two nodes could successfully shoot each other in the head, so to speak. :-)

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Ahh, understood Kevin, yes, if the network backend has control over that,
then it shouldn't be a problem.

But it's a problem if this features is used blindly, so I think it'd be ok if
we provide a setting to enable/disable this feature, and we set it disabled
by default, so, the deployer can enable it if he knows what he's doing
in combination with certain backends.

And yes, we have run into that scenario during testing, It happened to us
when a network node ran out of memory (lots and lots of networks) and
the neutron-l3-agent was killed by OOM, all the underlaying qrouters kept
working, but the neutron-l3-agent stops replying to heartbeats and routers
got moved elsewhere.

With pacemaker, in that specific situation, netns-cleanup --forced and ovs-cleanup
forced would be started, but there are chances they have no memory /resources either.

In a last take, IPAM reboot will happen, and, yes, you're right that it's slow, but it
will only fail for a ~1 minute or so.

Revision history for this message
Sudhakar Gariganti (sudhakar-gariganti) wrote :

Folks,

See if this makes sense.
https://docs.google.com/document/d/1SJ-Rq2Q2fV1dWPj4xiO5aO5a-PccyJOSypC80EsmWpw/

We had implemented the above(I can put up the code for review if there is interest), but realizing the problems which could arise at scale, we did not proceed further to submit the blueprint.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Sudhakar,

That design doc is very close to what the patch I proposed does. The main difference is that mine is limited to the L3 scheduler and it doesn't set the admin state to down because routers already will not be scheduled to agents with dead heartbeats.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Thanks Miguel, I will adjust the behavior to require a configured boolean that defaults to False.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/110893
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9677cf87cb831cfd344bc63cbee23be52392c300
Submitter: Jenkins
Branch: master

commit 9677cf87cb831cfd344bc63cbee23be52392c300
Author: Kevin Benton <email address hidden>
Date: Wed Jul 30 15:49:59 2014 -0700

    Option to remove routers from dead l3 agents

    Add a configuration-enabled periodic check to examine the
    status of all L3 agents with routers scheduled to them and
    admin_state_up set to True. If the agent is dead, the router
    will be rescheduled to an alive agent.

    Neutron considers and agent 'dead' when the server doesn't
    receive any heartbeat messages from the agent over the
    RPC channel within a given number of seconds (agent_down_time).
    There are various false positive scenarios where the agent may
    fail to report even though the node is still forwarding traffic.

    This is configuration driven because a dead L3 agent with active
    namespaces forwarding traffic and responding to ARP requests may
    cause issues. If the network backend does not block the dead
    agent's node from using the router's IP addresses, there will be
    a conflict between the old and new namespace.

    This conflict should not break east-west traffic because both
    namespaces will be attached to the appropriate networks and
    either can forward the traffic without state. However, traffic
    being overloaded onto the router's external network interface
    IP in north-south traffic will be impacted because the matching
    translation for port address translation will only be present
    on one router. Additionally, floating IPs associated to ports
    after the rescheduling will not work traversing the old
    namespace because the mapping will not be present.

    DocImpact

    Partial-Bug: #1174591
    Change-Id: Id7d487f54ca54fdd46b7616c0969319afc0bb589

Revision history for this message
haruka tanizawa (h-tanizawa) wrote :

Any other patch is needed?

Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Eugene Nikanorov (enikanorov)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

we should probably split the this bug into l3 (done way back in juno) + dhcp (still pending), because this is confusing.

Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Kevin Benton (kevinbenton)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Either way, some feedback would be good.

Revision history for this message
Assaf Muller (amuller) wrote :

DHCP rescheduling handled by:
https://review.openstack.org/#/c/131150/

Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron:
milestone: none → liberty-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-2 → 7.0.0
Revision history for this message
Brent Eagles (beagles) wrote :

I've marked as invalid for tripleo as it isn't clear from the description how this is specific to tripleo and the neutron aspect has been resolved.

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.