[train][c8] RabbitMQ loses all queues after network disruption
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla |
Invalid
|
Undecided
|
Unassigned | ||
Train |
Fix Released
|
High
|
Radosław Piliszek | ||
kolla-ansible |
Invalid
|
Undecided
|
Unassigned | ||
Train |
Fix Released
|
High
|
Radosław Piliszek |
Bug Description
After a control plane reconfiguration I started to see errors in OpenStack service logs, for example from cinder-volume:
2020-06-17 11:19:39.766 37 ERROR oslo.messaging.
On inspection of the RabbitMQ cluster state, the cluster appeared healthy but had no queues registered.
RabbitMQ cluster_status:
Cluster status of node rabbit@controller-3 ...
[{nodes,
{running_
{cluster_
{partitions,[]},
{alarms,
RabbitMQ list_queues:
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
The likely trigger is that a disruptive network reconfiguration was made, causing some short periods of disruption to all controllers simultaneously.
On network disruption the pause-minority fallback is invoked:
2020-06-16 12:18:17.171 [warning] <0.31346.26> Cluster minority/secondary status detected - awaiting recovery
2020-06-16 12:18:17.171 [info] <0.7102.27> RabbitMQ is asked to stop...
Shutdown results in error messages of this form:
2020-06-16 12:18:17.279 [info] <0.31433.26> stopped TCP listener on 192.168.4.104:5672
2020-06-16 12:18:17.280 [error] <0.2722.27> Error on AMQP connection <0.2722.27> (192.168.
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
After a stream of errors of that form, it appears things escalate:
2020-06-16 12:18:18.135 [warning] lager_file_backend dropped 861 messages in the last second that exceeded the limit of 50 messages/sec
2020-06-16 12:18:20.286 [error] <0.1125.27> CRASH REPORT Process <0.1125.27> with 0 neighbours exited with reason: channel_
2020-06-16 12:18:20.286 [error] <0.333.27> CRASH REPORT Process <0.333.27> with 0 neighbours exited with reason: channel_
2020-06-16 12:18:20.286 [error] <0.1123.27> Supervisor {<0.1123.
Further references to queues on shutdown:
2020-06-16 12:18:22.283 [info] <0.31370.26> Closing all connections in vhost '/' on node 'rabbit@
2020-06-16 12:18:22.293 [warning] <0.2477.27> Mirrored queue 'neutron-
2020-06-16 12:18:22.293 [warning] <0.5179.27> Mirrored queue 'neutron-
RabbitMQ terminates and is immediately restarted (but with errors reported from Mnesia):
2020-06-16 12:18:24.219 [info] <0.7102.27> Log file opened with Lager
2020-06-16 12:18:24.439 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-06-16 12:18:24.471 [error] <0.10513.27> Mnesia(
2020-06-16 12:18:24.471 [error] <0.10513.27> Mnesia(
2020-06-16 12:18:24.513 [info] <0.10559.27>
Starting RabbitMQ 3.7.26 on Erlang 22.2.8
Copyright (c) 2007-2020 Pivotal Software, Inc.
Licensed under the MPL. See https:/
Appears that message queues are regenerated:
2020-06-16 12:18:24.748 [info] <0.10754.27> Started message store of type persistent for vhost '/'
2020-06-16 12:18:24.753 [info] <0.10754.27> Mirrored queue 'neutron-
2020-06-16 12:18:24.753 [info] <0.10754.27> Mirrored queue 'conductor_
2020-06-16 12:18:24.754 [info] <0.10754.27> Mirrored queue 'neutron-
First client connections accepted:
2020-06-16 12:18:25.167 [info] <0.13886.27> accepting AMQP connection <0.13886.27> (192.168.
2020-06-16 12:18:25.168 [info] <0.13889.27> accepting AMQP connection <0.13889.27> (192.168.
2020-06-16 12:18:25.169 [info] <0.13892.27> accepting AMQP connection <0.13892.27> (192.168.
2020-06-16 12:18:25.169 [info] <0.13886.27> Connection <0.13886.27> (192.168.
2020-06-16 12:18:25.169 [info] <0.13889.27> Connection <0.13889.27> (192.168.
2020-06-16 12:18:25.170 [info] <0.13892.27> Connection <0.13892.27> (192.168.
Other Rabbit servers detected:
2020-06-16 12:18:25.279 [info] <0.10721.27> rabbit on node 'rabbit@
2020-06-16 12:18:25.297 [info] <0.10721.27> rabbit on node 'rabbit@
First signs of trouble a minute later:
2020-06-16 12:19:28.393 [error] <0.13873.27> Channel error on connection <0.13865.27> (192.168.
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'engine' in vhost '/' due to timeout
2020-06-16 12:19:28.429 [error] <0.14497.27> Channel error on connection <0.14481.27> (192.168.
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'heat-engine-
2020-06-16 12:19:28.463 [error] <0.13916.27> Channel error on connection <0.13889.27> (192.168.
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'neutron-
Doesn't appear to recover from there on.
Environment:
This is a new Train/CentOS 8 deployment. There are 3 controllers and no non-default configuration for RabbitMQ.
Did you manage to recover it at least via some manual means?
Looking at https:/ /ethercalc. openstack. org/kolla- infra-service- matrix, Train CentOS 8 seems to be using RMQ and Erlang from two different sources.