sshfs mounts occasionally lock up

Bug #554749 reported by Bogdan Butnaru
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
sshfs-fuse (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

This is similar to bug #137514. However, that bug has stopped manifesting for me a while time ago. (But see comment #15 there.) Ever since I upgraded to Lucid, the bug (or something similar) started manifesting, even worse than before.

Description:

I have several disks on a headless server, and I mount them on my desktop using SSHfs. The server runs Karmic, the desktop Lucid.

The problem is that every now and then one (or more) of the sshfs mounts lock up. What this means depends on the program using the mountpoint. If Amarok is playing a song, it stops generating sound, but I can press “stop” and play some local files. If it’s Nautilus or even a “ls” in a terminal, it freezes until killed. I’ve had a few cases of programs turning zombie, too (I had to reboot to release the mountpoints).

It’s not deterministic, but it seems to happen more often (but not exclusively) when the mount is used more intensely, ie when several programs access it at the same time.

Most frequently it’s the mountpoint with music that does this (simply because I have music playing almost all the time); in those cases I can stop Amarok and try to unmount everything. This tends not to work:

$ unmount-all [a script that unmounts every SSHfs mount]
umount: /media/corum: device is busy.
        (In some cases useful info about processes that use
         the device is found by lsof(8) or fuser(1))

Invariably both lsof and fuser lock up if I try to use them. Usually I can just kill the SSHfs and ssh processes, and _if_ nothing else is accessing the mountpoint (e.g., if I stop Amarok), I can then unmount.

However, if something still holds an open file on those filesystems, bad things (TM) happen. For instance, MusicBrainz Picard enters a zombie state:

bogdanb@mabelode:~$ ps aux|grep picard
bogdanb 2107 2.5 0.0 0 0 ? Zl 20:19 0:43 [picard] <defunct>

AFAIK, there’s nothing I can do to get rid of it, except rebooting. I can of course mount things in a different place, but lots of programs expect things to be in a specific place (eg, my music library).

This report was started with ubuntu-bug, giving it the pid of a sshfs process that locked up. Below are strace calls on the sshfs process (pid 2000) and ssh process (pid 1996) involved; they both stop after the NULL, and I pressed Ctrl-C:

bogdanb@mabelode:~$ sudo strace -p 2000
Process 2000 attached - interrupt to quit
futex(0x7fff3e31cd90, FUTEX_WAIT_PRIVATE, 0, NULL^C <unfinished ...>
Process 2000 detached
bogdanb@mabelode:~$ sudo strace -p 1996
Process 1996 attached - interrupt to quit
select(7, [3], [], NULL, NULL^C <unfinished ...>
Process 1996 detached

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: sshfs 2.2-1build1
ProcVersionSignature: Ubuntu 2.6.32-19.28-generic 2.6.32.10+drm33.1
Uname: Linux 2.6.32-19-generic x86_64
NonfreeKernelModules: nvidia
Architecture: amd64
Date: Sat Apr 3 20:27:57 2010
EcryptfsInUse: Yes
ExecutablePath: /usr/bin/sshfs
ProcEnviron:
 LANGUAGE=en_US:en
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: sshfs-fuse

Revision history for this message
Bogdan Butnaru (bogdanb) wrote :
Revision history for this message
Bogdan Butnaru (bogdanb) wrote :
Revision history for this message
Bogdan Butnaru (bogdanb) wrote :
Revision history for this message
Charlie Kravetz (cjkgeek) wrote :

Thanks for reporting this bug and any supporting documentation. Since this bug has enough information provided for a developer to begin work, I'm going to mark it as confirmed and let them handle it from here. Thanks for taking the time to make Ubuntu better!

I have sent this upstream by message to 'fuse-sshfs <fuse-
<email address hidden>>' and we will now await a response.

Thanks for taking the time to make Ubuntu better!

Changed in sshfs-fuse (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

The files I just attached are GDB backtraces of the SSH and SSHfs processes after this bug manifests itself.

According to “ps aux”, they were run as:

ssh -x -a -oClearAllForwardings=yes -2 tanelorn -s sftp
sshfs -s -o reconnect,transform_symlinks,allow_other,nonempty,large_read,hard_remove,intr,fsname=tanelorn:/mnt/corum tanelorn:/mnt/corum /media/corum

Everything that touched the mountpount froze, including Nautilus, and, strangely enough, the panel. Some applets on the panel *do* work, but some (and the panel itself) don’t. By the way, I have the system monitor in the panel, and it displays something like a quarter of its width filled with the IOWait color. I assume this means one of the four cores is blocked waiting for something.

Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

By the way, the back-traces were recorded using the instructions for already-running programs at https://wiki.ubuntu.com/Backtrace

However, I skipped step 4; if I ran “continue”, gdb would hang too, and Ctrl-C would only display a “^C”. I hope this is useful.

Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

I’ve done another few tests, this time with « sshfs »’s debug enabled. This didn’t work for the similar bug I experienced last year.

I mounted the fs with:
bogdanb@mabelode:~$ sshfs -d tanelorn:/mnt/corum /media/corum 2>&1 | tee sshfs-log.txt

Then I did things on the FS until I encountered the bug. Note that, as I said, this happens randomly, and usually when there’s a lot of activity. This time it took even longer than usual, I suspect that debugging conflicts with whatever is causing the lock-up.

(In this case, I was tagging music files with Picard; Picard reads the files to generate an acoustic fingerprint, and later writes the file with modified tags somewhere else. At the same time I was looking at images on the same mount-point, to help trigger the bug.)

As a result, the log is 16 MB long uncompressed. I also made backtraces for the SSHFS and SSH processes while locked up. I did several to see if it was completely frozen or just very slow; one of them seems to have changed between runs.

I noticed at one point a “Corrupted MAC on input” message. I had noticed the same message before, in a test run without “-d”. The earlier test run locked up very fast after that message, but this one kept going for quite a while, so I think it may be just a coincidence.

Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

Some comment on the net suggested I look at dmesg, and I noticed the messages below there. (There were several identical copies, with ~120 delay in their timestamps.) It might help to know what Picard was doing when SSHFS locked up.

[ 1440.800052] INFO: task picard:2290 blocked for more than 120 seconds.
[ 1440.800056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.800059] picard D 00000000ffffffff 0 2290 1 0x00000000
[ 1440.800065] ffff8800cfa41c78 0000000000000086 0000000000015b80 0000000000015b80
[ 1440.800071] ffff88010fe6df80 ffff8800cfa41fd8 0000000000015b80 ffff88010fe6dbc0
[ 1440.800076] 0000000000015b80 ffff8800cfa41fd8 0000000000015b80 ffff88010fe6df80
[ 1440.800081] Call Trace:
[ 1440.800091] [<ffffffff810f38d0>] ? sync_page+0x0/0x50
[ 1440.800096] [<ffffffff8153e697>] io_schedule+0x47/0x70
[ 1440.800100] [<ffffffff810f390d>] sync_page+0x3d/0x50
[ 1440.800104] [<ffffffff8153ed9a>] __wait_on_bit_lock+0x5a/0xc0
[ 1440.800108] [<ffffffff810f38a7>] __lock_page+0x67/0x70
[ 1440.800114] [<ffffffff81085340>] ? wake_bit_function+0x0/0x40
[ 1440.800118] [<ffffffff810f36de>] ? find_get_page+0x1e/0xa0
[ 1440.800121] [<ffffffff810f4f33>] find_lock_page+0x63/0x80
[ 1440.800125] [<ffffffff810f5192>] filemap_fault+0x182/0x460
[ 1440.800130] [<ffffffff811118b4>] __do_fault+0x54/0x500
[ 1440.800135] [<ffffffff81114df8>] handle_mm_fault+0x1a8/0x3c0
[ 1440.800140] [<ffffffff810397a9>] ? default_spin_lock_flags+0x9/0x10
[ 1440.800145] [<ffffffff8154334a>] do_page_fault+0x12a/0x3b0
[ 1440.800149] [<ffffffff81540ca5>] page_fault+0x25/0x30

Revision history for this message
Ketil Malde (ketil-ii) wrote :

I don't know if this is the problem here, but I often get zombie processes when I forget to unmount sshfs shares before suspending my laptop (thus breaking the network connection). I can 'fusermount -u -z' the share, but I need to kill the sshfs process before the zombies can go away. I don't think I've had to reboot.

Couldn't the sshfs process be terminated when the network connection breaks, or at least on a lazy (-z) unmount?

Revision history for this message
Simon Kollberg (simonklb) wrote :

I'm not sure that this issue is exactly the problem I'm having but it seems that even the sshd locks up after sshfs has crashed. Since this happens on my headless server it's really tiresome. It started to happen after I upgraded to Lucid.

Would really appreciate if you looked into this asap.

Thanks!

Revision history for this message
D (dj-lp) wrote :

I think it should generally not be possible for FUSE to lock processes like this.

One thing I did to improve the situation was to reduce the ssh timeout:
echo ServerAliveInterval 30 >> ~/.ssh/config

But I think in my case usually the ssh connection is really hung (though the network connection stands without a problem). Maybe this can help for you, too.

Revision history for this message
kripton (kripton-r) wrote :
Revision history for this message
Bogdan Butnaru (bogdanb) wrote :

I have that fix activated since a long time ago. I still get issues; they seem somewhat rarer, but it’s hard to tell with things this intermittent.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.