timing out a branch puller worker leaves a locked branch

Bug #172286 reported by Michael Hudson-Doyle
2
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Michael Hudson-Doyle

Bug Description

Because when a branch puller worker times out, it is SIGKILLed it doesn't get the opportunity to unlock the branch and so the next update to the branch will fail because it can't acquire the lock.

Tags: lp-code
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Here's what ddaa had to say about this:

<mwhudson> so from the "oh crap" department:
<mwhudson> when a branch puller worker times out and is killed, it leaves the branch locked
<mwhudson> jml: ^^
<ddaa> I thought that was "told you so" dept.
<ddaa> need to kill with SIGINT
<mwhudson> well, you certainly never told _me_ that
<ddaa> we occasionally need to poke published data directly, so quietly breaking locks is not a great idea
<mwhudson> though it's kind of obvious in retrospect
<ddaa> I'm pretty sure I told someone :)
<mwhudson> the problem with sigint, of course, is that it might not always work
<ddaa> sure
<ddaa> and there's always the "elmo tripped on the power cable" scenario
<mwhudson> anyway, i need coffee before thinking about this
<ddaa> one nice way
<ddaa> would be to put some distinctive UID in the lock
<ddaa> different from what bzr would naturally put there
<ddaa> and have this UID be passed from high up the call chain
<ddaa> so when we find this in the lock we know we should be able to break it
<ddaa> this way, we still have the option to safely poke data directly, e.g. for mass reconciling of branches in a concurrent script
<ddaa> but in the normal case, the way to kill should be something like "SIGINT, wait some time for child to terminate, if not terminated yet then SIGKILL".

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

More discussion:

<jml> mwhudson: I think that what ddaa says sounds right (although kind of crappy)
<thumper> mwhudson: perhaps we up the timeout to 30 minutes? :-|
<mwhudson> thumper: worth a try, i guess
<mwhudson> jml: ddaa said several things, which did you mean?
<mwhudson> SIGINT, wait, then KILL?
<jml> yeah.
<jml> (sorry, am doing a couple of other things atm)
<thumper> followed by explicit unlock
<thumper> does suck somewhat though
<mwhudson> if the process is siginted it should unlock the branch
<mwhudson> (i think?)
<thumper> that's my understanding too
<thumper> however I think it'd still be worth testing in the code
<thumper> basicly check if locked, and if so, unlock
<mwhudson> makes sense i guess
<mwhudson> do either of you desperately want to fix it or shall i assign it to me?
<thumper> I didn't even get to look at code today :(
<jml> mwhudson: I'm full.
<mwhudson> i'll take it then

Changed in launchpad-bazaar:
assignee: nobody → mwhudson
importance: Undecided → High
milestone: none → 1.1.12
status: New → Confirmed
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Progress in the branch at bzr+ssh://devpad.canonical.com/code/mwh/launchpad/more-gentle-killing-and-unlocking-bug-172286.

Testing this stuff is "tricky" :/

Changed in launchpad-bazaar:
status: Confirmed → In Progress
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

RF launchpad/devel revision 5393

I managed to implement and test the "full-fat" solution of recognizable locks and only breaking recognizable locks.

Changed in launchpad-bazaar:
status: In Progress → Fix Committed
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Rolled out with 1.1.12.

Changed in launchpad-bazaar:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.