rmq + nrpe on >= Vivid No PID file found

Bug #1485722 reported by Ryan Beisner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nrpe (Juju Charms Collection)
Invalid
Undecided
Unassigned
rabbitmq-server (Juju Charms Collection)
Fix Released
Undecided
Ryan Beisner

Bug Description

For Vivid-Kilo (and presumably later), the rabbitmq pid file is in a different location than earlier versions. The script in the cron job errors out, but that is not evident unless the cron fail mail is inspected:

Return-Path: <email address hidden>
X-Original-To: root
Delivered-To: <email address hidden>
Received: by juju-beis0-machine-2.openstacklocal (Postfix, from userid 0)
        id 41CF73E528; Wed, 26 Aug 2015 01:38:01 +0000 (UTC)
From: <email address hidden> (Cron Daemon)
To: <email address hidden>
Subject: Cron <root@juju-beis0-machine-2> /usr/local/bin/collect_rabbitmq_stats.sh
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=root>
Message-Id: <email address hidden>
Date: Wed, 26 Aug 2015 01:38:01 +0000 (UTC)

No PID file found

The bubbles up as an error to the user that the /var/lib/rabbitmq/data/ dir does not exist. This definitely impacts nrpe checks, potentially other things.

It affects next and stable and can be considered a high-priority deployment blocker for Vivid (and possibly Wily).

Test scenario: a basic 3-node rabbitmq-server native cluster, with nrpe as a subordinate to exercise nrpe-external-master functionality, and with cinder to exercise and inspect amqp relation data. After basic rmq cluster, config and relations are validated, amqp messaging and queue replication are functionally tested with and without ssl, and nrpe checks are fired then checked.

IOError: [Errno 2] No such file or directory: '/var/lib/rabbitmq/data/juju-beis0-machine-2_queue_stats.dat'
ERROR subprocess encountered error code 1

Details @ http://paste.ubuntu.com/12110980/
And @ http://paste.ubuntu.com/12196571/

Related branches

Ryan Beisner (1chb1n)
summary: - rmq on > vivid has mnesia
+ rmq on >= vivid has mnesia
summary: - rmq on >= vivid has mnesia
+ rmq on >= vivid has mnesia (no data dir)
Ryan Beisner (1chb1n)
description: updated
Revision history for this message
David Ames (thedac) wrote : Re: rmq on >= vivid has mnesia (no data dir)

This could be considered a testing race condition. collect_rabbitmq_stats.sh is run by cron and will create the /var/lib/rabbitmq/data directory.

This can be tested by running
juju run --service rabbitmq-server /usr/local/bin/collect_rabbitmq_stats.sh
before running
egrep -oh /usr/local.* /etc/nagios/nrpe.d/check_rabbitmq_queue.cfg

The question is should the charm setup this directory? If it should this is the fix: https://pastebin.canonical.com/138221/

Revision history for this message
David Ames (thedac) wrote :

The more I think about this the more I know this is a testing race. Even if the charm creates the data directory the .dat data file will not exist until the cron runs so the same error will occur.

Changed in nrpe (Juju Charms Collection):
status: New → Invalid
Changed in rabbitmq-server (Juju Charms Collection):
status: New → Invalid
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Why would the same test consistently pass Precise-Icehouse, Trusty-Icehouse, Trusty-Kilo, but fail on Vivid-Kilo? Are we just getting lucky?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I must respectfully disagree. I maintain that this is a rabbitmq-server charm bug, so I'm resetting the status to new. Flip it back to invalid if the following is determined to be crack.
...

On Vivid, the /var/lib/rabbitmq/data dir doesn't exist even after the collect_rabbitmq_stats cron job has run.

The cron job is trying to retrieve rabbitmq PIDs from files which are not in the expected location.

Consequently, the mkdir and data collection routines are never fired.

Vivid's rabbitmq version 3.4.3 has different pidfile behavior than the prior rabbitmq versions (3.2.4 et al).

The proposed amulet test was written to wait on the cron job, then check for data. It waited for the cron job, then didn't find data, and raised this red flag as a functional test failure.

To reinforce this, the same test routine passes on Precise-Icehouse, Trusty-Icehouse, Trusty-Juno and Trusty-Kilo -- while it fails on Vivid-Kilo.

Here's how I arrived at all this:

http://paste.ubuntu.com/12196571/

Changed in rabbitmq-server (Juju Charms Collection):
status: Invalid → New
Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, I'm testing with patch: http://paste.ubuntu.com/12199499/

Which I've also pushed into WIP @ lp:~1chb1n/charms/trusty/rabbitmq-server/amulet-refactor-1508

Ryan Beisner (1chb1n)
summary: - rmq on >= vivid has mnesia (no data dir)
+ rmq + nrpe on >= Vivid pid file location changed
Ryan Beisner (1chb1n)
summary: - rmq + nrpe on >= Vivid pid file location changed
+ rmq + nrpe on >= Vivid No PID file found
description: updated
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Testing the proposed fix passes P-I, T-I, T-J, T-K and V-K. I think this is a wrap.

Changed in rabbitmq-server (Juju Charms Collection):
assignee: nobody → Ryan Beisner (1chb1n)
Ryan Beisner (1chb1n)
Changed in rabbitmq-server (Juju Charms Collection):
status: New → Fix Committed
Changed in rabbitmq-server (Juju Charms Collection):
status: Fix Committed → Fix Released
milestone: none → 15.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.