kolla-ansible

[train][c8] RabbitMQ loses all queues after network disruption

Bug #1884034 reported by Stig Telfer on 2020-06-18

This bug affects 3 people

	Status	Importance	Assigned to
kolla	Invalid	Undecided	Unassigned
Train	Fix Released	High	Radosław Piliszek
kolla-ansible	Invalid	Undecided	Unassigned
Train	Fix Released	High	Radosław Piliszek

Bug Description

After a control plane reconfiguration I started to see errors in OpenStack service logs, for example from cinder-volume:

2020-06-17 11:19:39.766 37 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'cinder-volume.rbd:volumes@ceph-rbd' in vhost '/' due to timeout: amqp.exceptions.NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'cinder-volume.rbd:volumes@ceph-rbd' in vhost '/' due to timeout

On inspection of the RabbitMQ cluster state, the cluster appeared healthy but had no queues registered.

RabbitMQ cluster_status:

Cluster status of node rabbit@controller-3 ...
[{nodes,[{disc,['rabbit@controller-1','rabbit@controller-2',
                'rabbit@controller-3']}]},
{running_nodes,['rabbit@controller-2','rabbit@controller-1',
                 'rabbit@controller-3']},
{cluster_name,<<"rabbit@controller-2">>},
{partitions,[]},
{alarms,[{'rabbit@controller-2',[]},
          {'rabbit@controller-1',[]},
          {'rabbit@controller-3',[]}]}]

RabbitMQ list_queues:

Timeout: 60.0 seconds ...
Listing queues for vhost / ...

The likely trigger is that a disruptive network reconfiguration was made, causing some short periods of disruption to all controllers simultaneously.

On network disruption the pause-minority fallback is invoked:

2020-06-16 12:18:17.171 [warning] <0.31346.26> Cluster minority/secondary status detected - awaiting recovery
2020-06-16 12:18:17.171 [info] <0.7102.27> RabbitMQ is asked to stop...

Shutdown results in error messages of this form:

2020-06-16 12:18:17.279 [info] <0.31433.26> stopped TCP listener on 192.168.4.104:5672
2020-06-16 12:18:17.280 [error] <0.2722.27> Error on AMQP connection <0.2722.27> (192.168.4.102:44172 -> 192.168.4.104:5672 - neutron-metadata-agent:7:e689633c-1b01-4b6f-9869-d0e3fbd856bc, vhost: '/', user: 'openstack', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"

After a stream of errors of that form, it appears things escalate:

2020-06-16 12:18:18.135 [warning] lager_file_backend dropped 861 messages in the last second that exceeded the limit of 50 messages/sec
2020-06-16 12:18:20.286 [error] <0.1125.27> CRASH REPORT Process <0.1125.27> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 767
2020-06-16 12:18:20.286 [error] <0.333.27> CRASH REPORT Process <0.333.27> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 767
2020-06-16 12:18:20.286 [error] <0.1123.27> Supervisor {<0.1123.27>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.1124.27>, {acceptor,{192.168.4,104},5672}) at <0.1125.27> exit with reason channel_termination_timeout in context shutdown_error

Further references to queues on shutdown:

2020-06-16 12:18:22.283 [info] <0.31370.26> Closing all connections in vhost '/' on node 'rabbit@controller-3' because the vhost is stopping
2020-06-16 12:18:22.293 [warning] <0.2477.27> Mirrored queue 'neutron-vo-SubPort-1.0_fanout_37b11779a3b64893a0be1aa34aaf2fdd' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
2020-06-16 12:18:22.293 [warning] <0.5179.27> Mirrored queue 'neutron-vo-Trunk-1.1.nova-03' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

RabbitMQ terminates and is immediately restarted (but with errors reported from Mnesia):

2020-06-16 12:18:24.219 [info] <0.7102.27> Log file opened with Lager
2020-06-16 12:18:24.439 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-06-16 12:18:24.471 [error] <0.10513.27> Mnesia('rabbit@controller-3'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-1'}
2020-06-16 12:18:24.471 [error] <0.10513.27> Mnesia('rabbit@controller-3'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@controller-2'}
2020-06-16 12:18:24.513 [info] <0.10559.27>
Starting RabbitMQ 3.7.26 on Erlang 22.2.8
Copyright (c) 2007-2020 Pivotal Software, Inc.
Licensed under the MPL. See https://www.rabbitmq.com/

Appears that message queues are regenerated:

2020-06-16 12:18:24.748 [info] <0.10754.27> Started message store of type persistent for vhost '/'
2020-06-16 12:18:24.753 [info] <0.10754.27> Mirrored queue 'neutron-vo-Subnet-1.0.nova-05' in vhost '/': Adding mirror on node 'rabbit@controller-3': <0.10780.27>
2020-06-16 12:18:24.753 [info] <0.10754.27> Mirrored queue 'conductor_fanout_29f623d8f3a94284bbd1141e1ac8a48a' in vhost '/': Adding mirror on node 'rabbit@controller-3': <0.10785.27>
2020-06-16 12:18:24.754 [info] <0.10754.27> Mirrored queue 'neutron-vo-Subnet-1.0' in vhost '/': Adding mirror on node 'rabbit@controller-3': <0.10790.27>

First client connections accepted:

2020-06-16 12:18:25.167 [info] <0.13886.27> accepting AMQP connection <0.13886.27> (192.168.4.105:55636 -> 192.168.4.104:5672)
2020-06-16 12:18:25.168 [info] <0.13889.27> accepting AMQP connection <0.13889.27> (192.168.4.105:55638 -> 192.168.4.104:5672)
2020-06-16 12:18:25.169 [info] <0.13892.27> accepting AMQP connection <0.13892.27> (192.168.4.102:49920 -> 192.168.4.104:5672)
2020-06-16 12:18:25.169 [info] <0.13886.27> Connection <0.13886.27> (192.168.4.105:55636 -> 192.168.4.104:5672) has a client-provided name: neutron-openvswitch-agent:6:5ef3f9bc-d535-4a71-9867-baa93a3d2455
2020-06-16 12:18:25.169 [info] <0.13889.27> Connection <0.13889.27> (192.168.4.105:55638 -> 192.168.4.104:5672) has a client-provided name: neutron-openvswitch-agent:6:9edafa93-db4e-49cb-9216-adc7e46f666a
2020-06-16 12:18:25.170 [info] <0.13892.27> Connection <0.13892.27> (192.168.4.102:49920 -> 192.168.4.104:5672) has a client-provided name: neutron-server:26:1ddbdeeb-2879-446d-96fa-8e9b37459eff

Other Rabbit servers detected:

2020-06-16 12:18:25.279 [info] <0.10721.27> rabbit on node 'rabbit@controller-2' up
2020-06-16 12:18:25.297 [info] <0.10721.27> rabbit on node 'rabbit@controller-1' up

First signs of trouble a minute later:

2020-06-16 12:19:28.393 [error] <0.13873.27> Channel error on connection <0.13865.27> (192.168.4.103:40724 -> 192.168.4.104:5672, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'engine' in vhost '/' due to timeout
2020-06-16 12:19:28.429 [error] <0.14497.27> Channel error on connection <0.14481.27> (192.168.4.104:33390 -> 192.168.4.104:5672, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'heat-engine-listener' in vhost '/' due to timeout
2020-06-16 12:19:28.463 [error] <0.13916.27> Channel error on connection <0.13889.27> (192.168.4.105:55638 -> 192.168.4.104:5672, vhost: '/', user: 'openstack'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'neutron-vo-Network-1.0' in vhost '/' due to timeout

Doesn't appear to recover from there on.

Environment:

This is a new Train/CentOS 8 deployment. There are 3 controllers and no non-default configuration for RabbitMQ.

Tags:

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-19:

Did you manage to recover it at least via some manual means?

Looking at https://ethercalc.openstack.org/kolla-infra-service-matrix, Train CentOS 8 seems to be using RMQ and Erlang from two different sources.

Revision history for this message

Stig Telfer (stigtelfer) wrote on 2020-06-22:

Hi Radoslaw, it was recovered by deleting the RabbitMQ containers and volumes and redeploying RabbitMQ. We can recreate the circumstances of the problem but I don't know what more data should be gathered. What other information would be useful?

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-22:

Unless we got some wrong Erlang OTP + RabbitMQ combination in Kolla images, I don't think there is much we can do about it. I guess it deserves reporting upstream as well. The Erlang processes seems to have crashed, likely due to some unhandled situation. All the clustering traffic is Erlang-native.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-22:

We get:

rabbitmq-server.noarch 3.7.26-1.el8 @rabbitmq_rabbitmq-server

but also:

erlang.x86_64 22.2.8-1.el8 @centos-rabbitmq-38

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-22:

In ussuri and master we have:

rabbitmq-server.noarch 3.8.5-1.el8 @rabbitmq_rabbitmq-server

and also:

erlang.x86_64 22.3.4.1-1.el8 @rabbitmq_rabbitmq-erlang

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-22:

It seems
centos-release-openstack-train
depends on
centos-release-rabbitmq-38

so this late change broke the desired 3.7 stay we planned and introduced an incompatible erlang platform, oh my.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-22:

Actually, it is deemed compatible https://www.rabbitmq.com/which-erlang.html
I just wonder if CentOS compiled erlang the upstream expected way.

Revision history for this message

Stig Telfer (stigtelfer) wrote on 2020-06-22:

Small update - I have the same package versions you mentioned.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-23: Fix proposed to kolla (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/737478

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-23: Re: RabbitMQ loses all queues after network disruption

#10

Thanks, Stig. Provided logs also confirm the versions:

Starting RabbitMQ 3.7.26 on Erlang 22.2.8

I proposed unifying the source of both: https://review.opendev.org/737478

Changed in kolla:
status:	New → In Progress
summary:	- RabbitMQ loses all queues after network disruption + [train][c8] RabbitMQ loses all queues after network disruption
Changed in kolla:
status:	In Progress → Invalid
Changed in kolla-ansible:
status:	New → Invalid
tags:	added: centos8

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-23:

#11

I would like to ask you to test if the new images really fix your problem.

Revision history for this message

Radosław Piliszek (yoctozepto) wrote on 2020-06-24:

#12

side note: upstream erlang 22 dropped hipe support

Revision history for this message

Stig Telfer (stigtelfer) wrote on 2020-07-01:

#13

We haven't had the chance to try the updated containers yet.

We have tested a few more times with network disruption but the failure condition has not reproduced.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-04: Fix merged to kolla (stable/train)

#14

Reviewed: https://review.opendev.org/737478
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=fec2c701dc36ea8d246b69290f5546bf37d414b3
Submitter: Zuul
Branch: stable/train

commit fec2c701dc36ea8d246b69290f5546bf37d414b3
Author: Radosław Piliszek <email address hidden>
Date: Tue Jun 23 11:34:09 2020 +0200

Use upstream Erlang on CentOS 8

Train only.

    In CentOS 8 Train we had a mix of rmq upstream and rdo packages,
    erlang was from RDO and rmq (3.7) from rmq.
    As this is not really tested by any other party, I propose to
    unify the sources of both.

Change-Id: I52094ce9e3d98cbdedfdc23c8a43dd3093ce1771
Closes-bug: #1884034

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-01-07: Fix included in openstack/kolla 9.3.0

#15

This issue was fixed in the openstack/kolla 9.3.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.