'make start' starts an appserver unconditionally

Bug #325962 reported by Steve McInerney
6
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Barry Warsaw

Bug Description

mailman is run from /srv/lists.launchpad.net

The startup was failing (from nohup.out) due to a pid file in:
/srv/launchpad.net/var/production-mailman-launchpad.pid

Starting Mailman's master qrunner.
Traceback (most recent call last):
  File "runlaunchpad.py", line 60, in ?
    run()
  File "runlaunchpad.py", line 56, in run
    start_launchpad(argv)
  File "/srv/lists.launchpad.net/production/launchpad-rev-7667/lib/canonical/launchpad/scripts/runlaunchpad.py", line 237, in start_launchpad
    make_pidfile('launchpad')
  File "/srv/lists.launchpad.net/production/launchpad-rev-7667/utilities/../lib/canonical/lazr/pidfile.py", line 34, in make_pidfile
    raise RuntimeError("PID file %s already exists. Already running?" %
RuntimeError: PID file /srv/launchpad.net/var/production-mailman-launchpad.pid already exists. Already running?
Shutting down Mailman's master qrunner

A nicer "why" error would also be appreciated. :-)
vs https://pastebin.canonical.com/13540/

Related branches

Revision history for this message
Curtis Hovey (sinzui) wrote :

Can we get this fixed or documented immediately so that we can keep the service available at all times?

Changed in launchpad-registry:
assignee: nobody → barry
importance: Undecided → High
milestone: none → 2.2.2
status: New → Triaged
Revision history for this message
Barry Warsaw (barry) wrote :

This smells very familiar. Something like 6 or 9 months ago we had a similar problem which was caused by a configuration issue. IIRC, two different processes were using the same pid file and colliding. This may or may not be the same problem, but has any of the relevant configs been changed recently?

Revision history for this message
Curtis Hovey (sinzui) wrote :

I guess this bug is not high since no action has been taken on this. Can an interested party assess the seriousness of the and update the importance.

Changed in launchpad-registry:
milestone: 2.2.2 → 2.2.3
Revision history for this message
Curtis Hovey (sinzui) wrote :

If this is not addresses in 2.2.4, this will be marked low importance.

Changed in launchpad-registry:
milestone: 2.2.3 → 2.2.4
Revision history for this message
Tom Haddon (mthaddon) wrote : Re: [Bug 325962] Re: lp-mailman startup is blocking on a pid file in the wrong directory

On Tue, 2009-03-31 at 16:37 +0000, Curtis Hovey wrote:
> If this is not addresses in 2.2.4, this will be marked low importance.

I'm a little concerned at the implication that the longer a bug has been
outstanding, the less relevant it is. This bug is still an issue, and
the importance still stands.

Revision history for this message
Curtis Hovey (sinzui) wrote : Re: lp-mailman startup is blocking on a pid file in the wrong directory

I think Tom and Barry need to discuss this. I am threatening to drop this because this really cannot be a High bug if it is not important enough for the two interested parties to close.

I'm certain that any bug that has been high for more than three months is not really high. I suspect that the reason is because bug that were high took all the attention from the interested parties.

Revision history for this message
Tom Haddon (mthaddon) wrote : Re: [Bug 325962] Re: lp-mailman startup is blocking on a pid file in the wrong directory

On Wed, 2009-04-01 at 19:29 +0000, Curtis Hovey wrote:
> I think Tom and Barry need to discuss this.

Sounds like a good idea. After the rollout I'll try and make a point to
follow up

> I am threatening to drop
> this because this really cannot be a High bug if it is not important
> enough for the two interested parties to close.
>
> I'm certain that any bug that has been high for more than three months
> is not really high. I suspect that the reason is because bug that were
> high took all the attention from the interested parties.

I suspect it's because this bug doesn't affect us very often (i.e. only
when mailman is restarted under certain conditions), but is very painful
when it does.

Revision history for this message
Curtis Hovey (sinzui) wrote : Re: lp-mailman startup is blocking on a pid file in the wrong directory

I think the issue we are concerned about here is risk. If there is an outage, someone will have to write a outage report. I don't want to write that, and I certainly do not want to write it and have to point to this bug reporting the problem.

We really need to understand the nature of this problem to get it fixed. I think Barry + 1 LOSA can do this.

Barry Warsaw (barry)
Changed in launchpad-registry:
milestone: 2.2.4 → 2.2.5
Revision history for this message
Tom Haddon (mthaddon) wrote :

I can confirm this is no longer an issue:

https://pastebin.canonical.com/17392/

Basically stopped, created a pidfile and restarted. Seemed to DTRT.

Changed in launchpad-registry:
status: Triaged → Invalid
Revision history for this message
Barry Warsaw (barry) wrote :

<mthaddon> barry: I think that must now be fixed too, as that file hasn't changed since Feb 15th
<mthaddon> barry: I've just removed that file - I think we can consider this one done and dusted

Curtis Hovey (sinzui)
Changed in launchpad-registry:
milestone: 2.2.5 → none
Revision history for this message
Tom Haddon (mthaddon) wrote :

Just got bitten from this again in 8193/2.2.6. Pidfile from /srv/lists.launchpad.net/var/production-mailman-launchpad.pid caused mailman to stop from starting up after forster hard crashed.

Changed in launchpad-registry:
status: Invalid → Confirmed
Revision history for this message
Barry Warsaw (barry) wrote :

It makes sense that if forster crashes, it would leave a stale Mailman pidfile laying around. I'll look into this, but the only thing I can think of is that the -s flag isn't getting passed to mailmanctl properly.

Revision history for this message
Tom Haddon (mthaddon) wrote : Re: [Bug 325962] Re: lp-mailman startup is blocking on a pid file in the wrong directory

On Thu, 2009-06-25 at 09:45 +0000, Barry Warsaw wrote:
> It makes sense that if forster crashes, it would leave a stale Mailman
> pidfile laying around. I'll look into this, but the only thing I can
> think of is that the -s flag isn't getting passed to mailmanctl
> properly.

The initscript currently uses "make start" run from the root of the LP
tree.

Also, there are kind of two separate issues here - one is the not using
the -s flag as above. The other is that the pidfile is in the wrong
directory (/srv/launchpad.net/var rather
than /srv/lists.launchpad.net/var) - this means it takes longer for the
LOSA to find it and remove it as it's not in an expected location.

Curtis Hovey (sinzui)
Changed in launchpad-registry:
milestone: none → 2.2.7
status: Confirmed → In Progress
Barry Warsaw (barry)
Changed in launchpad-registry:
status: In Progress → Triaged
Curtis Hovey (sinzui)
Changed in launchpad-registry:
importance: High → Low
Curtis Hovey (sinzui)
Changed in launchpad-registry:
milestone: 2.2.7 → 2.2.9
Revision history for this message
Tom Haddon (mthaddon) wrote :

Any thoughts on why this was changed from "High" to "Low" priority - it's still a high priority for us, and the SAs were recently bitten by this too.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Tom, you need to beat barry into fixing this. If this was a High priority, you would have insisted that we stop working on features months ago to get this fixed. That is is the key point here. To work on this is to stop working on planned features. If this is that important, we will. To be honest, I am not sure why you have not escalated this.

My rules for High are pretty simple, If this is something we want to commit to fixing in 3 months, It is high. If it was not fixed, we lied to ourselves about its importance. I have tried scheduling this to be fixed. it did not work. So I think the stakeholders need to play a larger role in helping us fix this issue.

Revision history for this message
Tom Haddon (mthaddon) wrote :

Curtis, I'm not really in the business of beating people that don't even work in my department, let alone work for me, into doing things. And I disagree that a low priority bug is a high priority bug that hasn't been done for a certain amount of time. We've already had a similar discussion about this earlier in this same bug report. As for bumping features to get it done, it's not really up to me to determine what gets done - it is up to me to let you know what are high priority bugs for us, and that's what I thought I was doing with this bug report.

Changed in launchpad-registry:
importance: Low → High
Revision history for this message
Curtis Hovey (sinzui) wrote :

Barry will meet with a LOSA on 2009-08-11 to discuss how to debug and fix the problem
If there is no confidence the bug can be fixed by 2009-08-14, the issue will be escalated to other developers to fix the process.

Barry Warsaw (barry)
tags: added: mailing-lists
Revision history for this message
Barry Warsaw (barry) wrote : Re: lp-mailman startup is blocking on a pid file in the wrong directory
Download full text (3.2 KiB)

I believe I have figured this out, and it's not a problem with Mailman. spm
helped me greatly with examining the running production system on forster, and
here is what I believe is happening.

Mailman is actually working exactly as expected. The fact that you get a
"Starting Mailman" and "Shutting down Mailman" message with a traceback in
between is a red herring, as is the fact that you see "mailman" in the pid
file name in the traceback.

What is happening is this. The initscript calls "make start" which calls
bin/run, which calls runlaunchpad.py, which examines the config files to
determine which services to start. In the
production-mailman/launchpad-lazr.conf file, we tell it to start Mailman.
Which it does perfect.

Then something goes wrong[1] and the normal Launchpad shutdown procedure takes
over, which correctly shuts down Mailman. Mailman does exactly the right
thing here. Let's look at that traceback again:

Traceback (most recent call last):
  File "runlaunchpad.py", line 60, in ?
    run()
  File "runlaunchpad.py", line 56, in run
    start_launchpad(argv)
  File "/srv/lists.launchpad.net/production/launchpad-rev-7667/lib/canonical/launchpad/scripts/runlaunchpad.py", line 237, in start_launchpad
    make_pidfile('launchpad')
  File "/srv/lists.launchpad.net/production/launchpad-rev-7667/utilities/../lib/canonical/lazr/pidfile.py", line 34, in make_pidfile
    raise RuntimeError("PID file %s already exists. Already running?" %
RuntimeError: PID file /srv/launchpad.net/var/production-mailman-launchpad.pid already exists. Already running?

You think Mailman's involved because you see
'production-mailman-launchpad.pid' there, but it's not! That pid file is
named after the LPCONFIG variable and config directory that's being used,
which for forster is... production-mailman. In fact, Mailman's pid file is
managed by mailmanctl, not by lazr/pidfile.py, so this cannot be referring to
Mailman's pid file. It's referring to a Launchpad instance.

spm confirmed this by cat'ing two pid files on forster.

/srv/lists.launchpad.net/var/mailman/data/master-qrunner.pid pointed to the
mailmanctl master qrunner, humming along perfectly.

/srv/launchpad.net/var/production-mailman-launchpad.pid pointed to a running
bin/run -i process, in other words, a running zope instance. So 'make start'
appears to start both Mailman and an appserver, and it's this latter that
fails because of the pre-existing pid file.

A little spelunking in runlaunchpad.py and Zope seems to indicate that an
appserver is unconditionally started by 'make start'. There appears to be no
way to prevent that, so if a previous crash left a trash Zope pidfile, you
would see exactly the error we're seeing. Mailman is nicely cleaning up it's
trash, but Launchpad/Zope isn't :)

To fix this, I think we need to modify runlaunchpad.py, inside
start_launchpad() so that Zope's main() isn't called unconditionally. It
should probably consult a config file option before it starts Zope. Then, we
would update the production-mailman/launchpad-lazr.conf file to disable the
appserver.

I'm kicking this over to Francis since it seems like more of a Foundations
issue. Francis, if you just wa...

Read more...

Changed in launchpad-registry:
assignee: Barry Warsaw (barry) → Francis J. Lacoste (flacoste)
summary: - lp-mailman startup is blocking on a pid file in the wrong directory
+ 'make start' starts an appserver unconditionally
Revision history for this message
Barry Warsaw (barry) wrote :

I will work up a fix for this.

Changed in launchpad-registry:
assignee: Francis J. Lacoste (flacoste) → Barry Warsaw (barry)
Barry Warsaw (barry)
Changed in launchpad-registry:
status: Triaged → In Progress
Barry Warsaw (barry)
Changed in launchpad-registry:
status: In Progress → Fix Committed
milestone: 3.0 → 2.2.8
Barry Warsaw (barry)
Changed in launchpad-registry:
status: Fix Committed → Fix Released
Revision history for this message
Tom Haddon (mthaddon) wrote :

I'm not sure this is working as expected in production. See https://pastebin.canonical.com/22549/ - we don't have any other services running on this server, and the last entry appears to be a launchpad app server process.

Changed in launchpad-registry:
status: Fix Released → Confirmed
Curtis Hovey (sinzui)
Changed in launchpad-registry:
status: Confirmed → Triaged
milestone: 2.2.8 → 3.1.10
Revision history for this message
Barry Warsaw (barry) wrote :

Tom, I think this is working as expected, but here are some things to check.

Are you running revno 26 of ~launchpad-pqm/lp-production-configs/trunk ?

If you notice in production-mailman/launchpad-lazr.conf you'll see a section

[launchpad]
launch: False

that disables the the actual starting of the appserver. You still see the run -i line in the ps output because there's no way to disable run from, um, running. But the process itself does not start an appserver and it just basically sleeps forever. The other python processes are Mailman running as expected.

I tested this in launchpad.dev by adding the above config to development/launchpad-lazr.conf and it did the right thing.

Please make sure you cannot get to the appserver on this machine.

Changed in launchpad-registry:
status: Triaged → Incomplete
Revision history for this message
Tom Haddon (mthaddon) wrote :

This seems very strange. I guess we can live with the process running even though it's doing nothing (although I think this is still a bug that should be addressed by lauchpad-foundations), but there is still something worrying - that processes pidfile is in the /srv/launchpad.net/var - can we add a:

[canonical]
pid_dir: /srv/lists.launchpad.net/var

to the production-mailman/launchpad-lazr.conf so we're not using a directory outside of the codetree?

It also looks like we're getting OOPSes written to /srv/launchpad.net/production-logs/mailman-xmlrpc - can this be changed so we can remove the /srv/launchpad.net tree from the mailman server altogether?

Curtis Hovey (sinzui)
Changed in launchpad-registry:
status: Incomplete → Fix Released
milestone: 3.1.10 → 3.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.