Comment 18 for bug 325962

Revision history for this message
Barry Warsaw (barry) wrote : Re: lp-mailman startup is blocking on a pid file in the wrong directory

I believe I have figured this out, and it's not a problem with Mailman. spm
helped me greatly with examining the running production system on forster, and
here is what I believe is happening.

Mailman is actually working exactly as expected. The fact that you get a
"Starting Mailman" and "Shutting down Mailman" message with a traceback in
between is a red herring, as is the fact that you see "mailman" in the pid
file name in the traceback.

What is happening is this. The initscript calls "make start" which calls
bin/run, which calls runlaunchpad.py, which examines the config files to
determine which services to start. In the
production-mailman/launchpad-lazr.conf file, we tell it to start Mailman.
Which it does perfect.

Then something goes wrong[1] and the normal Launchpad shutdown procedure takes
over, which correctly shuts down Mailman. Mailman does exactly the right
thing here. Let's look at that traceback again:

Traceback (most recent call last):
  File "runlaunchpad.py", line 60, in ?
    run()
  File "runlaunchpad.py", line 56, in run
    start_launchpad(argv)
  File "/srv/lists.launchpad.net/production/launchpad-rev-7667/lib/canonical/launchpad/scripts/runlaunchpad.py", line 237, in start_launchpad
    make_pidfile('launchpad')
  File "/srv/lists.launchpad.net/production/launchpad-rev-7667/utilities/../lib/canonical/lazr/pidfile.py", line 34, in make_pidfile
    raise RuntimeError("PID file %s already exists. Already running?" %
RuntimeError: PID file /srv/launchpad.net/var/production-mailman-launchpad.pid already exists. Already running?

You think Mailman's involved because you see
'production-mailman-launchpad.pid' there, but it's not! That pid file is
named after the LPCONFIG variable and config directory that's being used,
which for forster is... production-mailman. In fact, Mailman's pid file is
managed by mailmanctl, not by lazr/pidfile.py, so this cannot be referring to
Mailman's pid file. It's referring to a Launchpad instance.

spm confirmed this by cat'ing two pid files on forster.

/srv/lists.launchpad.net/var/mailman/data/master-qrunner.pid pointed to the
mailmanctl master qrunner, humming along perfectly.

/srv/launchpad.net/var/production-mailman-launchpad.pid pointed to a running
bin/run -i process, in other words, a running zope instance. So 'make start'
appears to start both Mailman and an appserver, and it's this latter that
fails because of the pre-existing pid file.

A little spelunking in runlaunchpad.py and Zope seems to indicate that an
appserver is unconditionally started by 'make start'. There appears to be no
way to prevent that, so if a previous crash left a trash Zope pidfile, you
would see exactly the error we're seeing. Mailman is nicely cleaning up it's
trash, but Launchpad/Zope isn't :)

To fix this, I think we need to modify runlaunchpad.py, inside
start_launchpad() so that Zope's main() isn't called unconditionally. It
should probably consult a config file option before it starts Zope. Then, we
would update the production-mailman/launchpad-lazr.conf file to disable the
appserver.

I'm kicking this over to Francis since it seems like more of a Foundations
issue. Francis, if you just want to verify the analysis and have me do the
work to fix it, kick it back my way. I don't think it's a lot of work to fix.