I've made many many tests on my lab. The following could be noted
This is not reproducible on :
* Focal Ussuri
* Focal Wallaby
However this has been reproduced on :
* Focal Yoga without CIS
* Focal Yoga with CIS
I think it is safe to say that RabbitMQ Cluster is not the root cause.
TL;DR : it seems to be an issue with python3-oslo.messaging on the control plane units
One noticeable behavior is when looking on a unit using extensively rabbitmq communications such as nova-cloud-controller, like in /var/log/nova/nova-conductor.log :
On Yoga these lines would repeat indefinitely :
2022-10-19 11:12:02.457 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:05.529 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:08.600 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
...
2022-10-19 11:12:23.956 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:27.029 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
2022-10-19 11:12:30.105 203949 ERROR oslo.messaging._drivers.impl_rabbit [-] [c152039d-ea4a-4d69-a8ed-30ba5cc621ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
Whereas on Ussuri or Wallaby Openstack, it is seen that the workers move quickly to another rabbitmq server from the cluster :
2022-10-20 07:50:13.142 73006 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-10-20 07:50:13.370 73005 ERROR oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-10-20 07:50:14.156 73006 ERROR oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-10-20 07:50:14.383 73005 ERROR oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-10-20 07:50:14.515 73004 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-10-20 07:50:15.197 73006 INFO oslo.messaging._drivers.impl_rabbit [-] [9b752ed1-7e84-4618-bfd6-847867ff6fc4] Reconnected to AMQP server on 192.168.24.69:5672 via [amqp] client with port 48002.
2022-10-20 07:50:15.403 73005 INFO oslo.messaging._drivers.impl_rabbit [req-7648decf-3e3d-4dcf-946d-f9e75b0de335 - - - - -] [a2a5d71e-711f-4221-a50c-87d1e72e4481] Reconnected to AMQP server on 192.168.24.249:5672 via [amqp] client with port 33050.
2022-10-20 07:50:21.908 73005 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-10-20 07:50:26.926 73005 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out
2022-10-20 07:50:56.738 73006 INFO nova.compute.rpcapi [req-85dfba48-86f5-4071-b599-9fb8d1538349 - - - - -] Automatically selected compute RPC version 6.0 from minimum service version 56
Since the RabbitMQ cluster is mostly identical between Focal Ussuri/Wallaby/Yoga, that leaves another possibility, the python3 rabbitmq library (python3-oslo.messaging)
This is an ugly test/workaround on a Focal Yoga deployment but by :
* adding focal-wallaby cloud-archive repository
* downgrading the package python3-oslo.messaging to 12.7.1 (versus 12.13.1 on Focal Yoga)
* restarting the services using the library
The issue about seeing the control plane stuck reconnecting to a dead rabbitmq unit disappears.
Current workaround :
for i in nova-cloud-controller nova-compute neutron-api glance cinder; do
juju run -a $i -- sudo bash -c "echo 'deb http://ubuntu-cloud.archive.canonical.com/ubuntu focal-updates/wallaby main' >> /etc/apt/sources.list.d/cloud-archive.list"
juju run -a $i "sudo apt update; sudo DEBIAN_FRONTEND=noninteractive apt install python3-oslo.messaging=12.7.1-0ubuntu1~cloud0 -y --allow-downgrades"
done
juju run -a nova-compute sudo systemctl restart nova-compute nova-api-metadata
juju run -a nova-cloud-controller sudo systemctl restart nova-scheduler nova-conductor
juju run -a neutron-api sudo systemctl restart neutron-server
juju run -a glance sudo systemctl restart glance-api
juju run -a cinder sudo systemctl restart cinder-volume cinder-scheduler
I've made many many tests on my lab. The following could be noted
This is not reproducible on :
* Focal Ussuri
* Focal Wallaby
However this has been reproduced on :
* Focal Yoga without CIS
* Focal Yoga with CIS
I think it is safe to say that RabbitMQ Cluster is not the root cause. oslo.messaging on the control plane units
TL;DR : it seems to be an issue with python3-
One noticeable behavior is when looking on a unit using extensively rabbitmq communications such as nova-cloud- controller, like in /var/log/ nova/nova- conductor. log :
On Yoga these lines would repeat indefinitely : _drivers. impl_rabbit [-] [c152039d- ea4a-4d69- a8ed-30ba5cc621 ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableCon nectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions .RecoverableCon nectionError: <RecoverableCon nectionError: unknown error> _drivers. impl_rabbit [-] [c152039d- ea4a-4d69- a8ed-30ba5cc621 ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableCon nectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions .RecoverableCon nectionError: <RecoverableCon nectionError: unknown error> _drivers. impl_rabbit [-] [c152039d- ea4a-4d69- a8ed-30ba5cc621 ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableCon nectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions .RecoverableCon nectionError: <RecoverableCon nectionError: unknown error> _drivers. impl_rabbit [-] [c152039d- ea4a-4d69- a8ed-30ba5cc621 ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableCon nectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions .RecoverableCon nectionError: <RecoverableCon nectionError: unknown error> _drivers. impl_rabbit [-] [c152039d- ea4a-4d69- a8ed-30ba5cc621 ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableCon nectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions .RecoverableCon nectionError: <RecoverableCon nectionError: unknown error> _drivers. impl_rabbit [-] [c152039d- ea4a-4d69- a8ed-30ba5cc621 ed] AMQP server on 192.168.24.87:5672 is unreachable: <RecoverableCon nectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions .RecoverableCon nectionError: <RecoverableCon nectionError: unknown error>
2022-10-19 11:12:02.457 203949 ERROR oslo.messaging.
2022-10-19 11:12:05.529 203949 ERROR oslo.messaging.
2022-10-19 11:12:08.600 203949 ERROR oslo.messaging.
...
2022-10-19 11:12:23.956 203949 ERROR oslo.messaging.
2022-10-19 11:12:27.029 203949 ERROR oslo.messaging.
2022-10-19 11:12:30.105 203949 ERROR oslo.messaging.
Whereas on Ussuri or Wallaby Openstack, it is seen that the workers move quickly to another rabbitmq server from the cluster : _drivers. impl_rabbit [-] [9b752ed1- 7e84-4618- bfd6-847867ff6f c4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionReset Error: [Errno 104] Connection reset by peer _drivers. impl_rabbit [req-7648decf- 3e3d-4dcf- 946d-f9e75b0de3 35 - - - - -] [a2a5d71e- 711f-4221- a50c-87d1e72e44 81] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionReset Error: [Errno 104] Connection reset by peer _drivers. impl_rabbit [-] [9b752ed1- 7e84-4618- bfd6-847867ff6f c4] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefus edError: [Errno 111] ECONNREFUSED _drivers. impl_rabbit [req-7648decf- 3e3d-4dcf- 946d-f9e75b0de3 35 - - - - -] [a2a5d71e- 711f-4221- a50c-87d1e72e44 81] AMQP server on 192.168.24.252:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefus edError: [Errno 111] ECONNREFUSED _drivers. impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer _drivers. impl_rabbit [-] [9b752ed1- 7e84-4618- bfd6-847867ff6f c4] Reconnected to AMQP server on 192.168.24.69:5672 via [amqp] client with port 48002. _drivers. impl_rabbit [req-7648decf- 3e3d-4dcf- 946d-f9e75b0de3 35 - - - - -] [a2a5d71e- 711f-4221- a50c-87d1e72e44 81] Reconnected to AMQP server on 192.168.24.249:5672 via [amqp] client with port 33050. _drivers. impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer _drivers. impl_rabbit [-] Connection failed: timed out (retrying in 0 seconds): socket.timeout: timed out 86f5-4071- b599-9fb8d15383 49 - - - - -] Automatically selected compute RPC version 6.0 from minimum service version 56
2022-10-20 07:50:13.142 73006 ERROR oslo.messaging.
2022-10-20 07:50:13.370 73005 ERROR oslo.messaging.
2022-10-20 07:50:14.156 73006 ERROR oslo.messaging.
2022-10-20 07:50:14.383 73005 ERROR oslo.messaging.
2022-10-20 07:50:14.515 73004 INFO oslo.messaging.
2022-10-20 07:50:15.197 73006 INFO oslo.messaging.
2022-10-20 07:50:15.403 73005 INFO oslo.messaging.
2022-10-20 07:50:21.908 73005 INFO oslo.messaging.
2022-10-20 07:50:26.926 73005 ERROR oslo.messaging.
2022-10-20 07:50:56.738 73006 INFO nova.compute.rpcapi [req-85dfba48-
Since the RabbitMQ cluster is mostly identical between Focal Ussuri/ Wallaby/ Yoga, that leaves another possibility, the python3 rabbitmq library (python3- oslo.messaging)
This is an ugly test/workaround on a Focal Yoga deployment but by : oslo.messaging to 12.7.1 (versus 12.13.1 on Focal Yoga)
* adding focal-wallaby cloud-archive repository
* downgrading the package python3-
* restarting the services using the library
The issue about seeing the control plane stuck reconnecting to a dead rabbitmq unit disappears.
Current workaround : controller nova-compute neutron-api glance cinder; do ubuntu- cloud.archive. canonical. com/ubuntu focal-updates/ wallaby main' >> /etc/apt/ sources. list.d/ cloud-archive. list" FRONTEND= noninteractive apt install python3- oslo.messaging= 12.7.1- 0ubuntu1~ cloud0 -y --allow-downgrades"
for i in nova-cloud-
juju run -a $i -- sudo bash -c "echo 'deb http://
juju run -a $i "sudo apt update; sudo DEBIAN_
done
juju run -a nova-compute sudo systemctl restart nova-compute nova-api-metadata controller sudo systemctl restart nova-scheduler nova-conductor
juju run -a nova-cloud-
juju run -a neutron-api sudo systemctl restart neutron-server
juju run -a glance sudo systemctl restart glance-api
juju run -a cinder sudo systemctl restart cinder-volume cinder-scheduler