[train][c8] RabbitMQ loses all queues after network disruption

Bug #1884034 reported by Stig Telfer
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla
Invalid
Undecided
Unassigned
Train
Fix Released
High
Radosław Piliszek
kolla-ansible
Invalid
Undecided
Unassigned
Train
Fix Released
High
Radosław Piliszek

Bug Description

After a control plane reconfiguration I started to see errors in OpenStack service logs, for example from cinder-volume:

2020-06-17 11:19:39.766 37 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'cinder-volume.rbd:volumes@ceph-rbd' in vhost '/' due to timeout: amqp.exceptions.NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'cinder-volume.rbd:volumes@ceph-rbd' in vhost '/' due to timeout

On inspection of the RabbitMQ cluster state, the cluster appeared healthy but had no queues registered.

RabbitMQ cluster_status:

Cluster status of node rabbit@controller-3 ...
[{nodes,[{disc,['rabbit@controller-1','rabbit@controller-2',
                'rabbit@controller-3']}]},
 {running_nodes,['rabbit@controller-2','rabbit@controller-1',
                 'rabbit@controller-3']},
 {cluster_name,<<"rabbit@controller-2">>},
 {partitions,[]},
 {alarms,[{'rabbit@controller-2',[]},
          {'rabbit@controller-1',[]},
          {'rabbit@controller-3',[]}]}]

RabbitMQ list_queues:

Timeout: 60.0 seconds ...
Listing queues for vhost / ...

The likely trigger is that a disruptive network reconfiguration was made, causing some short periods of disruption to all controllers simultaneously.

On network disruption the pause-minority fallback is invoked:

2020-06-16 12:18:17.171 [warning] <0.31346.26> Cluster minority/secondary status detected - awaiting recovery
2020-06-16 12:18:17.171 [info] <0.7102.27> RabbitMQ is asked to stop...

Shutdown results in error messages of this form:

2020-06-16 12:18:17.279 [info] <0.31433.26> stopped TCP listener on 192.168.4.104:5672
2020-06-16 12:18:17.280 [error] <0.2722.27> Error on AMQP connection <0.2722.27> (192.168.4.102:44172 -> 192.168.4.104:5672 - neutron-metadata-agent:7:e689633c-1b01-4b6f-9869-d0e3fbd856bc, vhost: '/', user: 'openstack', state: running), channel 0:
 operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"

After a stream of errors of that form, it appears things escalate:

2020-06-16 12:18:18.135 [warning] lager_file_backend dropped 861 messages in the last second that exceeded the limit of 50 messages/sec
2020-06-16 12:18:20.286 [error] <0.1125.27> CRASH REPORT Process <0.1125.27> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 767
2020-06-16 12:18:20.286 [error] <0.333.27> CRASH REPORT Process <0.333.27> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 767
2020-06-16 12:18:20.286 [error] <0.1123.27> Supervisor {<0.1123.27>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.1124.27>, {acceptor,{192.168.4,104},5672}) at <0.1125.27> exit with reason channel_termination_timeout in context shutdown_error

Further references to queues on shutdown:

2020-06-16 12:18:22.283 [info] <0.31370.26> Closing all connections in vhost '/' on node 'rabbit@controller-3' because the vhost is stopping
2020-06-16 12:18:22.293 [warning] <0.2477.27> Mirrored queue 'neutron-vo-SubPort-1.0_fanout_37b11779a3b64893a0be1aa34aaf2fdd' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
2020-06-16 12:18:22.293 [warning] <0.5179.27> Mirrored queue 'neutron-vo-Trunk-1.1.nova-03' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

RabbitMQ terminates and is immediately restarted (but with errors reported from Mnesia):

2020-06-16 12:18:24.219 [info] <0.7102.27> Log file opened with Lager
2020-06-16 12:18:24.439 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-06-16 12:18:24.471 [error] <0.10513.27> Mnesia('rabbit@controller-3'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-1'}
2020-06-16 12:18:24.471 [error] <0.10513.27> Mnesia('rabbit@controller-3'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-2'}
2020-06-16 12:18:24.513 [info] <0.10559.27>
 Starting RabbitMQ 3.7.26 on Erlang 22.2.8
 Copyright (c) 2007-2020 Pivotal Software, Inc.
 Licensed under the MPL. See https://www.rabbitmq.com/

Appears that message queues are regenerated:

2020-06-16 12:18:24.748 [info] <0.10754.27> Started message store of type persistent for vhost '/'
2020-06-16 12:18:24.753 [info] <0.10754.27> Mirrored queue 'neutron-vo-Subnet-1.0.nova-05' in vhost '/': Adding mirror on node 'rabbit@controller-3': <0.10780.27>
2020-06-16 12:18:24.753 [info] <0.10754.27> Mirrored queue 'conductor_fanout_29f623d8f3a94284bbd1141e1ac8a48a' in vhost '/': Adding mirror on node 'rabbit@controller-3': <0.10785.27>
2020-06-16 12:18:24.754 [info] <0.10754.27> Mirrored queue 'neutron-vo-Subnet-1.0' in vhost '/': Adding mirror on node 'rabbit@controller-3': <0.10790.27>

First client connections accepted:

2020-06-16 12:18:25.167 [info] <0.13886.27> accepting AMQP connection <0.13886.27> (192.168.4.105:55636 -> 192.168.4.104:5672)
2020-06-16 12:18:25.168 [info] <0.13889.27> accepting AMQP connection <0.13889.27> (192.168.4.105:55638 -> 192.168.4.104:5672)
2020-06-16 12:18:25.169 [info] <0.13892.27> accepting AMQP connection <0.13892.27> (192.168.4.102:49920 -> 192.168.4.104:5672)
2020-06-16 12:18:25.169 [info] <0.13886.27> Connection <0.13886.27> (192.168.4.105:55636 -> 192.168.4.104:5672) has a client-provided name: neutron-openvswitch-agent:6:5ef3f9bc-d535-4a71-9867-baa93a3d2455
2020-06-16 12:18:25.169 [info] <0.13889.27> Connection <0.13889.27> (192.168.4.105:55638 -> 192.168.4.104:5672) has a client-provided name: neutron-openvswitch-agent:6:9edafa93-db4e-49cb-9216-adc7e46f666a
2020-06-16 12:18:25.170 [info] <0.13892.27> Connection <0.13892.27> (192.168.4.102:49920 -> 192.168.4.104:5672) has a client-provided name: neutron-server:26:1ddbdeeb-2879-446d-96fa-8e9b37459eff

Other Rabbit servers detected:

2020-06-16 12:18:25.279 [info] <0.10721.27> rabbit on node 'rabbit@controller-2' up
2020-06-16 12:18:25.297 [info] <0.10721.27> rabbit on node 'rabbit@controller-1' up

First signs of trouble a minute later:

2020-06-16 12:19:28.393 [error] <0.13873.27> Channel error on connection <0.13865.27> (192.168.4.103:40724 -> 192.168.4.104:5672, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'engine' in vhost '/' due to timeout
2020-06-16 12:19:28.429 [error] <0.14497.27> Channel error on connection <0.14481.27> (192.168.4.104:33390 -> 192.168.4.104:5672, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'heat-engine-listener' in vhost '/' due to timeout
2020-06-16 12:19:28.463 [error] <0.13916.27> Channel error on connection <0.13889.27> (192.168.4.105:55638 -> 192.168.4.104:5672, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'neutron-vo-Network-1.0' in vhost '/' due to timeout

Doesn't appear to recover from there on.

Environment:

This is a new Train/CentOS 8 deployment. There are 3 controllers and no non-default configuration for RabbitMQ.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Did you manage to recover it at least via some manual means?

Looking at https://ethercalc.openstack.org/kolla-infra-service-matrix, Train CentOS 8 seems to be using RMQ and Erlang from two different sources.

Revision history for this message
Stig Telfer (stigtelfer) wrote :

Hi Radoslaw, it was recovered by deleting the RabbitMQ containers and volumes and redeploying RabbitMQ. We can recreate the circumstances of the problem but I don't know what more data should be gathered. What other information would be useful?

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Unless we got some wrong Erlang OTP + RabbitMQ combination in Kolla images, I don't think there is much we can do about it. I guess it deserves reporting upstream as well. The Erlang processes seems to have crashed, likely due to some unhandled situation. All the clustering traffic is Erlang-native.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

We get:

rabbitmq-server.noarch 3.7.26-1.el8 @rabbitmq_rabbitmq-server

but also:

erlang.x86_64 22.2.8-1.el8 @centos-rabbitmq-38

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

In ussuri and master we have:

rabbitmq-server.noarch 3.8.5-1.el8 @rabbitmq_rabbitmq-server

and also:

erlang.x86_64 22.3.4.1-1.el8 @rabbitmq_rabbitmq-erlang

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

It seems
centos-release-openstack-train
depends on
centos-release-rabbitmq-38

so this late change broke the desired 3.7 stay we planned and introduced an incompatible erlang platform, oh my.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Actually, it is deemed compatible https://www.rabbitmq.com/which-erlang.html
I just wonder if CentOS compiled erlang the upstream expected way.

Revision history for this message
Stig Telfer (stigtelfer) wrote :

Small update - I have the same package versions you mentioned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/737478

Revision history for this message
Radosław Piliszek (yoctozepto) wrote : Re: RabbitMQ loses all queues after network disruption

Thanks, Stig. Provided logs also confirm the versions:

 Starting RabbitMQ 3.7.26 on Erlang 22.2.8

I proposed unifying the source of both: https://review.opendev.org/737478

Changed in kolla:
status: New → In Progress
summary: - RabbitMQ loses all queues after network disruption
+ [train][c8] RabbitMQ loses all queues after network disruption
Changed in kolla:
status: In Progress → Invalid
Changed in kolla-ansible:
status: New → Invalid
tags: added: centos8
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I would like to ask you to test if the new images really fix your problem.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

side note: upstream erlang 22 dropped hipe support

Revision history for this message
Stig Telfer (stigtelfer) wrote :

We haven't had the chance to try the updated containers yet.

We have tested a few more times with network disruption but the failure condition has not reproduced.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/train)

Reviewed: https://review.opendev.org/737478
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=fec2c701dc36ea8d246b69290f5546bf37d414b3
Submitter: Zuul
Branch: stable/train

commit fec2c701dc36ea8d246b69290f5546bf37d414b3
Author: Radosław Piliszek <email address hidden>
Date: Tue Jun 23 11:34:09 2020 +0200

    Use upstream Erlang on CentOS 8

    Train only.

    In CentOS 8 Train we had a mix of rmq upstream and rdo packages,
    erlang was from RDO and rmq (3.7) from rmq.
    As this is not really tested by any other party, I propose to
    unify the sources of both.

    Change-Id: I52094ce9e3d98cbdedfdc23c8a43dd3093ce1771
    Closes-bug: #1884034

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla 9.3.0

This issue was fixed in the openstack/kolla 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.