multinode rabbitmq failing upgrades

Bug #1930293 reported by Radosław Piliszek
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
High
Radosław Piliszek
Train
Fix Committed
High
Radosław Piliszek
Ussuri
Fix Committed
High
Radosław Piliszek
Victoria
Fix Committed
High
Radosław Piliszek
Wallaby
Fix Committed
High
Radosław Piliszek
Xena
Fix Released
High
Radosław Piliszek

Bug Description

Multinode rabbitmq upgrade may fail depending on the order of stops and starts.
It can be randomly wrong and cause the run to fail.

Example failure:

ara summary: (It shows stop on 'secondary1' last, yet first to start is 'secondary2')

Stopping all rabbitmq instances but the first node secondary1 kolla_docker 0:02:38 0:00:00 SKIPPED
Stopping all rabbitmq instances but the first node secondary2 kolla_docker 0:02:38 0:00:07 CHANGED
Stopping all rabbitmq instances but the first node primary kolla_docker 0:02:38 0:00:09 CHANGED
Stopping rabbitmq on the first node secondary2 kolla_docker 0:02:48 0:00:00 SKIPPED
Stopping rabbitmq on the first node primary kolla_docker 0:02:48 0:00:00 SKIPPED
Stopping rabbitmq on the first node secondary1 kolla_docker 0:02:48 0:00:17 CHANGED
Restart rabbitmq container secondary2 include_tasks 0:03:06 0:00:00 OK
Restart rabbitmq container secondary1 include_tasks 0:03:06 0:00:00 OK
Restart rabbitmq container primary include_tasks 0:03:06 0:00:00 OK
Restart rabbitmq container secondary2 kolla_docker 0:03:06 0:00:01 CHANGED
Waiting for rabbitmq to start secondary2 command 0:03:07 0:10:06 FAILED
Restart rabbitmq container secondary1 kolla_docker 0:13:14 0:00:01 CHANGED
Waiting for rabbitmq to start secondary1 command 0:13:15 0:00:05 CHANGED
Restart rabbitmq container primary kolla_docker 0:13:21 0:00:01 CHANGED
Waiting for rabbitmq to start primary command 0:13:23 0:00:07 CHANGED

docker logs for the failing rabbitmq: (It shows the order is the actual problem)

2021-05-31T13:48:33.608436819Z BOOT FAILED
2021-05-31T13:48:33.608444389Z ===========
2021-05-31T13:48:33.608571562Z Timeout contacting cluster nodes: [rabbit@primary,rabbit@secondary1].
2021-05-31T13:48:33.608687375Z
2021-05-31T13:48:33.608727786Z BACKGROUND
2021-05-31T13:48:33.608930872Z ==========
2021-05-31T13:48:33.608990003Z
2021-05-31T13:48:33.609201178Z This cluster node was shut down while other nodes were still running.
2021-05-31T13:48:33.609556107Z To avoid losing data, you should start the other nodes first, then
2021-05-31T13:48:33.609564438Z start this one. To force this node to start, first invoke
2021-05-31T13:48:33.609612299Z "rabbitmqctl force_boot". If you do so, any changes made on other
2021-05-31T13:48:33.609766853Z cluster nodes after this one was shut down may be lost.
2021-05-31T13:48:33.609805674Z
2021-05-31T13:48:33.609895306Z DIAGNOSTICS
2021-05-31T13:48:33.609953178Z ===========
2021-05-31T13:48:33.609981468Z
2021-05-31T13:48:33.610106611Z attempted to contact: [rabbit@primary,rabbit@secondary1]
2021-05-31T13:48:33.610173433Z
2021-05-31T13:48:33.610252235Z rabbit@primary:
2021-05-31T13:48:33.610450790Z * unable to connect to epmd (port 4369) on primary: address (cannot connect to host/port)
2021-05-31T13:48:33.610635545Z
2021-05-31T13:48:33.610760428Z rabbit@secondary1:
2021-05-31T13:48:33.610963233Z * unable to connect to epmd (port 4369) on secondary1: address (cannot connect to host/port)
2021-05-31T13:48:33.611150918Z
2021-05-31T13:48:33.611189209Z
2021-05-31T13:48:33.611298392Z Current node details:
2021-05-31T13:48:33.611434945Z * node name: rabbit@secondary2

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Note that, depending on the order, in the most typical (and recommended) 3-node rabbitmq scenario one or two rabbitmqs may fail.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

It seems the issue with upgrades was introduced with https://review.opendev.org/c/openstack/kolla-ansible/+/763137 to fix https://launchpad.net/bugs/1904702
I think I mistakenly merged some other issue with non-upgrade failures here. I will downscale this bug to the issue with upgrades.

summary: - multinode rabbitmq unstable kolla ansible actions
+ multinode rabbitmq failing upgrades
description: updated
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

At first sight, the code looks just fine. We use the order of the same group in both places (stop, start), yet it seems Ansible does not care about the group order when scheduling the tasks.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :
Changed in kolla-ansible:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/795091
Committed: https://opendev.org/openstack/kolla-ansible/commit/0cd5b027c985a8f6d3368ae0dc08b65f67f67fe0
Submitter: "Zuul (22348)"
Branch: master

commit 0cd5b027c985a8f6d3368ae0dc08b65f67f67fe0
Author: Mark Goddard <email address hidden>
Date: Mon Jun 7 12:56:36 2021 +0100

    Fix RabbitMQ restart ordering

    The host list order seen during Ansible handlers may differ to the usual
    play host list order, due to race conditions in notifying handlers. This
    means that restart_services.yml for RabbitMQ may be included in a
    different order than the rabbitmq group, resulting in a node other than
    the 'first' being restarted first. This can cause some nodes to fail to
    join the cluster. The include_tasks loop was introduced in [1].

    This change fixes the issue by splitting the handler into two tasks, and
    restarting the first node before all others.

    [1] https://review.opendev.org/c/openstack/kolla-ansible/+/763137

    Change-Id: I1823301d5889589bfd48326ed7de03c6061ea5ba
    Closes-Bug: #1930293

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/795284

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/795285

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/795286

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/795287

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/795286
Committed: https://opendev.org/openstack/kolla-ansible/commit/3ffcf4636f9d9813ca826eea98ae0445fbe7d970
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 3ffcf4636f9d9813ca826eea98ae0445fbe7d970
Author: Mark Goddard <email address hidden>
Date: Mon Jun 7 12:56:36 2021 +0100

    Fix RabbitMQ restart ordering

    The host list order seen during Ansible handlers may differ to the usual
    play host list order, due to race conditions in notifying handlers. This
    means that restart_services.yml for RabbitMQ may be included in a
    different order than the rabbitmq group, resulting in a node other than
    the 'first' being restarted first. This can cause some nodes to fail to
    join the cluster. The include_tasks loop was introduced in [1].

    This change fixes the issue by splitting the handler into two tasks, and
    restarting the first node before all others.

    [1] https://review.opendev.org/c/openstack/kolla-ansible/+/763137

    Change-Id: I1823301d5889589bfd48326ed7de03c6061ea5ba
    Closes-Bug: #1930293
    (cherry picked from commit 0cd5b027c985a8f6d3368ae0dc08b65f67f67fe0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/795285
Committed: https://opendev.org/openstack/kolla-ansible/commit/6387e431f20a26ee89bac26b6255925224df9757
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 6387e431f20a26ee89bac26b6255925224df9757
Author: Mark Goddard <email address hidden>
Date: Mon Jun 7 12:56:36 2021 +0100

    Fix RabbitMQ restart ordering

    The host list order seen during Ansible handlers may differ to the usual
    play host list order, due to race conditions in notifying handlers. This
    means that restart_services.yml for RabbitMQ may be included in a
    different order than the rabbitmq group, resulting in a node other than
    the 'first' being restarted first. This can cause some nodes to fail to
    join the cluster. The include_tasks loop was introduced in [1].

    This change fixes the issue by splitting the handler into two tasks, and
    restarting the first node before all others.

    [1] https://review.opendev.org/c/openstack/kolla-ansible/+/763137

    Change-Id: I1823301d5889589bfd48326ed7de03c6061ea5ba
    Closes-Bug: #1930293
    (cherry picked from commit 0cd5b027c985a8f6d3368ae0dc08b65f67f67fe0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/795284
Committed: https://opendev.org/openstack/kolla-ansible/commit/f11af96cc62106cb9ab95e1687a693fa2a6a6df7
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit f11af96cc62106cb9ab95e1687a693fa2a6a6df7
Author: Mark Goddard <email address hidden>
Date: Mon Jun 7 12:56:36 2021 +0100

    Fix RabbitMQ restart ordering

    The host list order seen during Ansible handlers may differ to the usual
    play host list order, due to race conditions in notifying handlers. This
    means that restart_services.yml for RabbitMQ may be included in a
    different order than the rabbitmq group, resulting in a node other than
    the 'first' being restarted first. This can cause some nodes to fail to
    join the cluster. The include_tasks loop was introduced in [1].

    This change fixes the issue by splitting the handler into two tasks, and
    restarting the first node before all others.

    [1] https://review.opendev.org/c/openstack/kolla-ansible/+/763137

    Change-Id: I1823301d5889589bfd48326ed7de03c6061ea5ba
    Closes-Bug: #1930293
    (cherry picked from commit 0cd5b027c985a8f6d3368ae0dc08b65f67f67fe0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/train)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/795287
Committed: https://opendev.org/openstack/kolla-ansible/commit/ebaa5bb8a16249b487448be1b7712a47d11396c3
Submitter: "Zuul (22348)"
Branch: stable/train

commit ebaa5bb8a16249b487448be1b7712a47d11396c3
Author: Mark Goddard <email address hidden>
Date: Mon Jun 7 12:56:36 2021 +0100

    Fix RabbitMQ restart ordering

    The host list order seen during Ansible handlers may differ to the usual
    play host list order, due to race conditions in notifying handlers. This
    means that restart_services.yml for RabbitMQ may be included in a
    different order than the rabbitmq group, resulting in a node other than
    the 'first' being restarted first. This can cause some nodes to fail to
    join the cluster. The include_tasks loop was introduced in [1].

    This change fixes the issue by splitting the handler into two tasks, and
    restarting the first node before all others.

    [1] https://review.opendev.org/c/openstack/kolla-ansible/+/763137

    Change-Id: I1823301d5889589bfd48326ed7de03c6061ea5ba
    Closes-Bug: #1930293
    (cherry picked from commit 0cd5b027c985a8f6d3368ae0dc08b65f67f67fe0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.3.2

This issue was fixed in the openstack/kolla-ansible 9.3.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 12.0.0.0rc2

This issue was fixed in the openstack/kolla-ansible 12.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 11.1.0

This issue was fixed in the openstack/kolla-ansible 11.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 10.3.0

This issue was fixed in the openstack/kolla-ansible 10.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 13.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 13.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by "Radosław Piliszek <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/794026
Reason: not pursuing

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.