Comment 0 for bug 1789177

Revision history for this message
Oleg Bondarev (obondarev) wrote : RabbitMQ fails to synchronize exchanges under high load

Input:
 - OpenStack Pike cluster with ~500 nodes
 - DVR enabled in neutron
 - Lots of messages

Scenario: failover of one rabbit node in a cluster

Issue: after failed rabbit node gets back online some rpc communications appear broken
Logs from rabbit:

=ERROR REPORT==== 10-Aug-2018::17:24:37 ===
Channel error on connection <0.14839.1> (10.200.0.24:55834 -> 10.200.0.31:5672, vhost: '/openstack', user: 'openstack'), channel 1:
operation basic.publish caused a channel exception not_found: no exchange 'reply_5675d7991b4a4fb7af5d239f4decb19f' in vhost '/openstack'

Investigation:
After rabbit node gets back online it gets many new connections immediately and fails to synchronize exchanges for some reason (number of exchanges in that cluster was ~1600), on that node it stays low and not increasing.

Workaround: let the recovered node synchronize all exchanges - forbid new connections with iptables rules for some time after failed node gets online (30 sec)

Proposal: do not create new exchanges (use default) for all direct messages - this also fixes the issue.

Is there a good reason for creating new exchanges for direct messages?