Bazaar

closing terminal causes stale lock

Bug #257217 reported by Robert Collins on 2008-08-12

Affects		Status	Importance	Assigned to	Milestone
	Bazaar	Won't Fix	High	Unassigned

Bug Description

18:55 < lifeless> now you said you'd had a locking problem
18:55 < lifeless> has that happened more than once?
18:57 < robsta> no, and it was my fault, closed the terminal while bzr
was prompting for commit msg
18:57 < lifeless> heh
18:57 < robsta> that works in svn
18:57 * robsta ducks
18:57 < lifeless> mmm, wonder what signal bzr got, perhaps it could
handle it better
18:57 < lifeless> feel free to file a bug

affects bzr
--
GPG key available at: <http://www.robertcollins.net/keys.txt>.

Tags:

Related branches

lp:~mbp/bzr/257217-sighup

Revision history for this message

Martin Pool (mbp) wrote on 2008-08-13:

It's probably SIGHUP. We could catch it and turn it into a KeyboardInterrupt or something similar, and use that to unwind in the usual way.

Changed in bzr:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Stefan Monnier (monnier) wrote on 2010-08-21:

I see the same problem when I "kill <bzrpid>". It's really annoying since I end up having to break three locks with 2 commands (the local branch lock, the remote branch lock (the branch is bound) and the checkout lock) and that if I forget to do that, other people will have to guess whether they can break my lock, ...

Revision history for this message

Vincent Ladeuil (vila) wrote on 2010-08-21: Re: [Bug 257217] Re: closing terminal causes stale lock

>>>>> Stefan Monnier <email address hidden> writes:

    > I see the same problem when I "kill <bzrpid>". It's really annoying
    > since I end up having to break three locks with 2 commands (the local
    > branch lock, the remote branch lock (the branch is bound) and the
    > checkout lock)

A bare 'bzr break-lock' in the right directory should query you for the
three locks at once.

And since you *killed* the process rather than interrupting it, I'm not
sure we should even try to unwind in the usual way as proposed above.

Martin Pool (mbp) on 2010-08-23

Changed in bzr:
importance:	Medium → High
tags:	added: affects-emacs locking

Revision history for this message

Andrew Bennetts (spiv) wrote on 2010-08-23:

I agree with Martin in comment 1: turning SIGHUP into KeyboardInterrupt sounds good to me. I don't see any reason for bzr to treat "controlling terminal has gone" differently to "user hit Ctrl-C".

(A quick experiment suggests this scenario will indeed trigger SIGHUP.)

bzr might still fail to unlock in some circumstances (e.g. if the transport is already busy with another request?), but I can't see SIGHUP-raises-KeyboardInterrupt doing worse than the status quo, and it will surely do better in at least some cases, such as interrupting a commit to a local branch.

Revision history for this message

Martin Pool (mbp) wrote on 2010-08-23:

We should probably also unwind on SIGTERM. If the user really wants
"stop now without cleaning up" there's always SIGKILL.

I suspect another contributing problem is that they press C-c and then
nothing apparently happens, so they press it again, thus giving us no
chance to clean up. If we print "interrupted, cleaning up..." then we
might get a bit more patience. Some cleanup operations, like closing
a socket or deleting limbo files may take a noticeable time.

Revision history for this message

Martin Pool (mbp) wrote on 2010-09-03:

See also bug 588431, pointing out that we sometimes refuse to break the lock even when it was held by a process on the same machine and we ought to be able to tell that it's no longer running, bug 397315, bug 336481, bug 217134.

tags:

added: lockdir

Revision history for this message

Martin Pool (mbp) wrote on 2010-09-03:

I'll try turning a sighup into a KeyboardInterrupt subclass, which will give us a chance to unwind.

I suspect if the terminal has gone away altogether we may hit other problems such as io failing with EPIPE, and so we'll need to also fix some other locking-related bugs, which would be good anyhow.

Changed in bzr:
assignee:	nobody → Martin Pool (mbp)
status:	Confirmed → In Progress

Revision history for this message

Martin Pool (mbp) wrote on 2010-09-03:

OK, so lp:~mbp/bzr/257217-sighup catches sighup and turns it into an exception, and in toy tests this does indeed mean that we get a chance to unwind. However, in a more realistic test of closing the window while bzr's pushing to an sftp server, we get ERESTARTSYS from the send() syscall talking to the ssh child. We could of course fix that though I'm concerned it will send us back in to the mess of trying to handle that error that we recently had with SIGWINCH. (Specifically, iirc there are some bugs in python and libraries we use to do with restarting system calls.)

Revision history for this message

John A Meinel (jameinel) wrote on 2011-04-20:

How about we land SIGHUP changes, and just go with it for now. We can always open another bug if we need to.

Revision history for this message

Martin Pool (mbp) wrote on 2011-04-20:

#10

On 20 April 2011 19:15, John A Meinel <email address hidden> wrote:
> How about we land SIGHUP changes, and just go with it for now. We can
> always open another bug if we need to.

I think you're right: catching sighup is a step forward and will fix
some cases. What I saw before lead me to believe that when the
window's killed ssh may also be killed without us getting a chance to
do anything about it. (We could try to prevent that by eg changing
process group etc, but that may lead to knock-on problems.)

Probably we should just
* merge that branch and close this bug
* reliably release locks server-side over ssh (maybe we do?)
* just cope with stale locks over sftp

I'll have a look at this tomorrow.

Revision history for this message

John A Meinel (jameinel) wrote on 2011-04-20:

#11

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/20/2011 11:44 AM, Martin Pool wrote:
> On 20 April 2011 19:15, John A Meinel <email address hidden> wrote:
>> How about we land SIGHUP changes, and just go with it for now. We can
>> always open another bug if we need to.
>
> I think you're right: catching sighup is a step forward and will fix
> some cases. What I saw before lead me to believe that when the
> window's killed ssh may also be killed without us getting a chance to
> do anything about it. (We could try to prevent that by eg changing
> process group etc, but that may lead to knock-on problems.)
>
> Probably we should just
> * merge that branch and close this bug
> * reliably release locks server-side over ssh (maybe we do?)

For commands where we can, we do. However, we still have a "write-lock
the remote object because I'm going to do multiple operations" RPC,
where we hand the client a lock token, which it is required to send to
the server for each operation. So the server doesn't know when the last
write action is going to be received. (If ever.)
Again, we try to be stateless, so just having the connection drop
"doesn't count". (Since HTTP wouldn't be able to tell you that.)

> * just cope with stale locks over sftp
>
> I'll have a look at this tomorrow.
>

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2u1m4ACgkQJdeBCYSNAAPBQwCg0Ne30XzbU1q2J1ekLKvM6pGz
t8cAoK3YmwSwhYCRC0wEforg0LtLdEv1
=eIgD
-----END PGP SIGNATURE-----

Revision history for this message

Martin Pool (mbp) wrote on 2011-04-21:

#12

On 20 April 2011 22:49, John Arbash Meinel <email address hidden> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 04/20/2011 11:44 AM, Martin Pool wrote:
>> On 20 April 2011 19:15, John A Meinel <email address hidden> wrote:
>>> How about we land SIGHUP changes, and just go with it for now. We can
>>> always open another bug if we need to.
>>
>> I think you're right: catching sighup is a step forward and will fix
>> some cases. What I saw before lead me to believe that when the
>> window's killed ssh may also be killed without us getting a chance to
>> do anything about it. (We could try to prevent that by eg changing
>> process group etc, but that may lead to knock-on problems.)
>>
>> Probably we should just
>> * merge that branch and close this bug
>> * reliably release locks server-side over ssh (maybe we do?)
>
> For commands where we can, we do. However, we still have a "write-lock
> the remote object because I'm going to do multiple operations" RPC,
> where we hand the client a lock token, which it is required to send to
> the server for each operation. So the server doesn't know when the last
> write action is going to be received. (If ever.)
> Again, we try to be stateless, so just having the connection drop
> "doesn't count". (Since HTTP wouldn't be able to tell you that.)

We shouldn't be doctrinaire about this. If the ssh connection drops,
it is very likely the whole client has terminated.

Perhaps in the future we should teach the client to start a new ssh
connection and pick up over that, but I'm not sure that's a really
important case. ssh connections fairly rarely just drop by themselves
and then can be immediately restarted, though this can sometimes
happen when for example switching from a wireless to wired network. A
more likely reason is that the whole client died or was interrupted,
or that there was an extended network outage, in which case it's not a
very safe assumption no one will have broken the lock.

For SSH dropouts I think the best thing is probably just to make it
easy, safe and fast to resume an interrupted transfer in a second run
of the program. Secondarily we could make it try to restart the SSH
connection, but in that case it should probably be prepared for the
objects no longer to be locked.

Over HTTP it's true we can't tell when the client has disappeared, but
that is a much less common case.

>> * just cope with stale locks over sftp
>>
>> I'll have a look at this tomorrow.

That's bug https://bugs.launchpad.net/bugs/220464 which I actually am
working on.

Martin

On 20 April 2011 22:49, John Arbash Meinel <john@arbash-meinel.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 04/20/2011 11:44 AM, Martin Pool wrote:
>> On 20 April 2011 19:15, John A Meinel <john@arbash-meinel.com> wrote:
>>> How about we land SIGHUP changes, and just go with it for now. We can
>>> always open another bug if we need to.
>>
>> I think you're right: catching sighup is a step forward and will fix
>> some cases.  What I saw before lead me to believe that when the
>> window's killed ssh may also be killed without us getting a chance to
>> do anything about it.  (We could try to prevent that by eg changing
>> process group etc, but that may lead to knock-on problems.)
>>
>> Probably we should just
>>  * merge that branch and close this bug
>>  * reliably release locks server-side over ssh (maybe we do?)
>
> For commands where we can, we do. However, we still have a "write-lock
> the remote object because I'm going to do multiple operations" RPC,
> where we hand the client a lock token, which it is required to send to
> the server for each operation. So the server doesn't know when the last
> write action is going to be received. (If ever.)
> Again, we try to be stateless, so just having the connection drop
> "doesn't count". (Since HTTP wouldn't be able to tell you that.)

We shouldn't be doctrinaire about this.  If the ssh connection drops,
it is very likely the whole client has terminated.

Perhaps in the future we should teach the client to start a new ssh
connection and pick up over that, but I'm not sure that's a really
important case.  ssh connections fairly rarely just drop by themselves
and then can be immediately restarted, though this can sometimes
happen when for example switching from a wireless to wired network.  A
more likely reason is that the whole client died or was interrupted,
or that there was an extended network outage, in which case it's not a
very safe assumption no one will have broken the lock.

For SSH dropouts I think the best thing is probably just to make it
easy, safe and fast to resume an interrupted transfer in a second run
of the program.  Secondarily we could make it try to restart the SSH
connection, but in that case it should probably be prepared for the
objects no longer to be locked.

Over HTTP it's true we can't tell when the client has disappeared, but
that is a much less common case.

>>  * just cope with stale locks over sftp
>>
>> I'll have a look at this tomorrow.

That's bug https://bugs.launchpad.net/bugs/220464 which I actually am
working on.

Martin

Revision history for this message

Andrew Bennetts (spiv) wrote on 2011-04-21:

#13

My only concern is that by registering a signal handler we may be inviting the return of some EINTR bugs? Hopefully if we're just going to cleanup and exit the process anyway that's not a big deal, but we should at least keep an eye out for that sort of bug reports after this lands.

Revision history for this message

Martin Pool (mbp) wrote on 2011-04-21:

#14

On 21 April 2011 10:23, Andrew Bennetts <email address hidden> wrote:
> My only concern is that by registering a signal handler we may be
> inviting the return of some EINTR bugs? Hopefully if we're just going
> to cleanup and exit the process anyway that's not a big deal, but we
> should at least keep an eye out for that sort of bug reports after this
> lands.

That is a concern. More generally I think it's a reason not to sink
too much effort into sophisticated cleanup from signals: first of all
the ssh connection may be dead already, and secondly handling them
safely and reliably in Python is difficult. Not only eintr things,
but also cases like the toomanyconcurrentrequests from trying to clean
up objects that are already in use.

Revision history for this message

Martin Pool (mbp) wrote on 2011-06-03:

#15

<https://code.launchpad.net/~mbp/bzr/220464-stale-locks/+merge/62582> should soon be ready to merge and should handle well the specific reported case of closing a window or killing a local process, by removing the lock the next time bzr runs.

Based on our experience with similar situations, I think trying to catch the sigterm or sighup and do nontrivial processing from there is likely to either not work or cause knock-on bugs. A few reasons: one is that signals tend to also kill the ssh process we're using to talk to the server; another is that Python error unwinding and signal handling is not all one might want.

Locally we can easily detect it's stale; remotely our connection is likely to have been killed or to be problematic to reopen. Over hpss we should concentrate on having the server process release the lock if the client goes away. Over sftp it's inherently hard but there I suppose we can work on just holding locks for the minimum time.

Changed in bzr:
assignee:	Martin Pool (mbp) → nobody
status:	In Progress → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.