Comment 14 for bug 1993149

Revision history for this message
DUFOUR Olivier (odufourc) wrote (last edit ):

I've made many many tests on my lab. The following could be noted

This is not reproducible on :
* Focal Ussuri
* Focal Wallaby

However this has been reproduced on :
* Focal Yoga without CIS
* Focal Yoga with CIS

I think it is safe to say that RabbitMQ Cluster is not the root cause.
TL;DR : it seems to be an issue with python3-oslo.messaging on the control plane units

One noticeable behavior is when looking on a unit using extensively rabbitmq communications such as nova-cloud-controller, like in /var/log/nova/nova-conductor.log :

On Yoga these lines would repeat indefinitely :
2022-10-19 11:12:02.457 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:05.529 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:08.600 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
...
2022-10-19 11:12:23.956 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:27.029 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:30.105 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>

Whereas on Ussuri or Wallaby Openstack, it is seen that the workers move quickly to another rabbitmq server from the cluster :
2022-10-20 07:50:13.142 73006 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-10-20 07:50:13.370 73005 ERROR oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-10-20 07:50:14.156 73006 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-10-20 07:50:14.383 73005 ERROR oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-10-20 07:50:14.515 73004 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-10-20 07:50:15.197 73006 INFO oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] Reconnected to AMQP server on 192.168.24.69:5672 via [amqp] client with port 48002.
2022-10-20 07:50:15.403 73005 INFO oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] Reconnected to AMQP server on 192.168.24.249:5672 via [amqp] client with port 33050.
2022-10-20 07:50:21.908 73005 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-10-20 07:50:26.926 73005 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out
2022-10-20 07:50:56.738 73006 INFO nova.compute.rpcapi [req-85dfba48-86f5-4071-b599-9fb8d1538349 - - - - -] Automatically selected compute RPC version 6.0 from minimum service version 56

Since the RabbitMQ cluster is mostly identical between Focal Ussuri/Wallaby/Yoga, that leaves another possibility, the python3 rabbitmq library (python3-oslo.messaging)

This is an ugly test/workaround on a Focal Yoga deployment but by :
* adding focal-wallaby cloud-archive repository
* downgrading the package python3-oslo.messaging to 12.7.1 (versus 12.13.1 on Focal Yoga)
* restarting the services using the library

The issue about seeing the control plane stuck reconnecting to a dead rabbitmq unit disappears.

Current workaround :
for i in nova-cloud-controller nova-compute neutron-api glance cinder; do
juju run -a $i -- sudo bash -c "echo 'deb http://ubuntu-cloud.archive.canonical.com/ubuntu focal-updates/wallaby main' >> /etc/apt/sources.list.d/cloud-archive.list"
juju run -a $i "sudo apt update; sudo DEBIAN_FRONTEND=noninteractive apt install python3-oslo.messaging=12.7.1-0ubuntu1~cloud0 -y --allow-downgrades"
done

juju run -a nova-compute sudo systemctl restart nova-compute nova-api-metadata
juju run -a nova-cloud-controller sudo systemctl restart nova-scheduler nova-conductor
juju run -a neutron-api sudo systemctl restart neutron-server
juju run -a glance sudo systemctl restart glance-api
juju run -a cinder sudo systemctl restart cinder-volume cinder-scheduler