NFS mounts at boot time prevent boot or print spurious errors

Bug #504224 reported by Alvin
120
This bug affects 22 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
Fix Released
Medium
Unassigned
Lucid
Fix Released
Medium
Unassigned

Bug Description

Binary package hint: mountall

karmic: mountall 1.0

When mounting NFS shares at boot, the first mount attempt usually fails since rpc or portmap are not running, or the network itself is not up yet.

These errors appear although the filesystem will be mounted successfully later on, so the user cannot distinguish them from "real" mount failures that need intervention by the admin.

Even worse, when the NFS mounts happen to be for a mountpoint that mountall considers essential for boot (like /home), the user is dropped to an emergency shell and system refuses to boot.

A temporary workaround is to add "nobootwait" for the affected mountpoints, but this causes the system to continue booting even if the mountpoint is not available when the network is set up correctly, i.e. if there is a real problem with the NFS server.

Proposed fix:

mountall should not try to mount any network file systems when called the first time. Only when the network is up and mountall receives SIGUSR1 it should try to do so. If this does not work at *that* point, mountall should print an error (and stop the boot for essential mount points) as before.

Revision history for this message
Johan Walles (walles) wrote :

For clarification, if this is the same thing I'm having it's preventing my machine from booting.

I have about 10 NFS mounts, and my machine hangs on boot with the above printouts.

To work around this, and boot into X so I can log in, I can enter the recovery shell, do "mount -a", and exit the recovery shell.

This enables me to boot into X. I still cannot access my virtual text consoles after this; I assume that the boot process is still waiting for something else...

Revision history for this message
Alvin (alvind) wrote : Re: [Bug 504224] Re: Inconsistent error message: Filesystem could not be mounted

On Thursday 07 January 2010 14:47:33 Johan Walles wrote:
> For clarification, if this is the same thing I'm having it's preventing
> my machine from booting.

This bug is only about the error messages. They are the same and that makes it
difficult to know what is going on. (However, I do have a lot of machines that
suffer from the same problem as yours and can't boot.)

> I have about 10 NFS mounts, and my machine hangs on boot with the above
> printouts.
>
> To work around this, and boot into X so I can log in, I can enter the
> recovery shell, do "mount -a", and exit the recovery shell.

The same is the case on my servers. I think the problem you describe is caused
by bug #470776. The situation is not very consistent. Some machines simply
can't boot and others boot, but don't mount NFS. Still others (like the
example) do boot and mount the shares, but tell me it didn't work. (That's
what this bug is about.) There is also bug #431248 with some discussion, but
that one is considered fixed.

> This enables me to boot into X. I still cannot access my virtual text
> consoles after this; I assume that the boot process is still waiting for
> something else...

That might be something else, like a problem with the video driver.

Revision history for this message
Johan Walles (walles) wrote : Re: Inconsistent error message: Filesystem could not be mounted

I'm probably suffering from bug 470776, thanks for the reference! It has been fixed for Lucid, but I filed bug 504271 about getting it fixed for Karmic as well. Unattended boots would be nice.

My text consoles work "fine", they just don't have any login prompts. So it's probably not video driver related. Since the whole boot process has more or less broken down for me with Karmic, I tend to blame it for everything...

OT: I have also revived bug 466693 about the really slow boot process in Karmic.

Alvin (alvind)
tags: added: ubuntu-boot-experience
tags: added: boot-experience
removed: ubuntu-boot-experience
tags: added: ubuntu-boot-experience
removed: boot-experience
Revision history for this message
Nikolaus Rath (nikratio) wrote :

Note that you can get around the recovery shell by using the "nobootwait" option it /etc/fstab for /home.

Changed in mountall (Ubuntu):
status: New → Confirmed
Nikolaus Rath (nikratio)
description: updated
summary: - Inconsistent error message: Filesystem could not be mounted
+ NFS mounts at boot time prevent boot or print spurious errors
Revision history for this message
Alvin (alvind) wrote :

The undocumented 'nobootwait' option doesn't change anything here. It should be _netdev, (man 8 mount), because I'd like to wait until the network is up.

Revision history for this message
Paul Elliott (omahn) wrote :

I also have this issue on a test box upgraded from 8.04 LTS to 10.04 LTS alpha. Boot hangs completely, no X (not installed on our servers) and no login shell. VT1 shows the following:

mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.

Repeated several times followed by the following for each NFS entry (we have many) in our fstab:

mountall: mount /usr/systems [849] terminated with status 32
mountall: Filesystem could not be mounted: /usr/systems

Booting with the rescue option from the boot menu makes no difference. If I boot from a live CD and disable the NFS mounts in /etc/fstab then the server boots successfully.

Timo Aaltonen (tjaalton)
Changed in mountall (Ubuntu):
importance: Undecided → Medium
milestone: none → ubuntu-10.04
Changed in mountall (Ubuntu Lucid):
status: Confirmed → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 2.10

---------------
mountall (2.10) lucid; urgency=low

  * Rework the Plymouth connection logic; one needs to attach the client to
    the event loop *after* connection otherwise you don't get disconnection
    notification, and one needs to actually actively disconnect in the
    disconnection handler.
  * For safety and sanity reasons it becomes much simpler to create the
    ply_boot_client when we connect, and free it on disconnection. Thus the
    presence or not of this struct tells us whether we're connected or not.
    LP: #524708.
  * Flush the plymouth connection before closing it and exiting, otherwise
    updates may be pending and the screen have messages that confuse people
    while X is starting (like fsck at 90%). LP: #487744.

  * Replace the modal plymouth prompt for error conditions with code that
    continues working in the background while prompting. This most benefits
    the old "Waiting for" message, which can now allow you to continue to
    wait and it can solve itself. LP: #527666, #545435.
  * Integrate fsck progress updates into the same mechanism.
  * Allow fsck messages to be translated. LP: #390740.
  * Change fsck message to be a little less alarming. LP: #545267.
  * Add hard dependency on Plymouth; without it running, mountall will
    ignore any filesystem which doesn't show up within a few seconds or that
    fails to fsck or mount. If you don't want graphical splash, you simply
    need not install themes.

  * Improve set of messages seen with --verbose, and ensure all visible
    messages are marked for translation. LP: #446592.
  * Reduce priority of failed to mount error for remote filesystems since
    we try again, and this just spams the console. LP: #504224.

  * Keep hold of the dev_t when parsing /proc/self/mountinfo, then after
    mounting /dev (or seeing that it's mounted) create a quick udev rules
    file that adds the /dev/root symlink to this device. LP: #527216.
  * Do not try and update /etc/mtab when it's a symbolic link. LP: #529993.
  * Remove odd -a option from mount calls, probably a C&P error from the
    fsck code long ago. LP: #537135.
  * Wait for Upstart to acknowledge receipt of events, even if we don't
    hang around for them to be handled.
  * Always run through try_mounts() at least once. LP: #537136.
  * Don't keep mountall running if the only remaining unmounted filesystems
  *
 -- Scott James Remnant <email address hidden> Wed, 31 Mar 2010 19:37:31 +0100

Changed in mountall (Ubuntu Lucid):
status: Fix Committed → Fix Released
Revision history for this message
Paul McEnery (pmcenery) wrote :

I've built and installed version 2.10 and have been through a couple of reboot cycles and everything on the NFS front appears to be working correctly now.

On the first boot however, I noticed that the boot stalled while it was supposedly doing an fsck. To my surprise, I was able to ssh to the system while it was in this state, and found that all filesystems including nfs ones were in fact mounted. I ran a reboot, and it has since booted up correctly. I've attached a picture of where it got stuck.

I think there may be an issue with how routine disk checks are handled.

Regards,
Paul.

Revision history for this message
Brett Gardner (brett-gardner) wrote :

I am seeing this bug in Lucid as well.

Revision history for this message
Ethan Baldridge (ethan-superiordocumentservices) wrote :

Yes, it's still happening in Lucid; also with CIFS volumes mounted in /etc/fstab. All pertinent mounts use the _netdev option, which should theoretically signal mountall not to do these until the network is available, but...

Most annoying is that about every other boot or so (probably doesn't happen every time due to timing issues in the parallel startup) it can't mount my CIFS volumes and gives me a recovery console.

My guess is that mountall isn't honoring _netdev.

(also mount.cifs throws an annoying warning message about not understanding _netdev, but that's just a papercut)

Revision history for this message
Steve Langasek (vorlon) wrote :

_netdev is irrelevant on cifs and nfs shares. And mountall understands _netdev.

What, in your own words, is the problem you're seeing? I.e., don't say "it's still happening" - that doesn't tell us what you think "it" is, since in all our tests, there are no problems with CIFS/NFS mounts at boot time due to this bug.

Revision history for this message
paolo (paolo-faverio) wrote :

HI all,
I'm Trying to make autofs running.
It seems suffer the same "rpc.statd is not running but is required for remote locking"
Forcing statd to run manually solve the issue.

Same Autofs configuration was working fine in 9.10.

Revision history for this message
Chris (bridgeriver) wrote :

I also have these problems with nfs-mounted /home on Lucid. Sometimes GDM starts and lets me log in before /home is mounted, which leads to what amounts to a crash.

I made a partial workaround by hacking /etc/init/gdm.conf to make GDM wait on /home being mounted. This is an improvement but not a fix. Sometimes I'm left looking at a text screen for a minute or two before /home mounts and GDM starts; other times the wait is long and I have to log in, gain root, manually issue 'mount /home', and manually restart GDM.

All this works perfectly in Karmic; the problem is new to Lucid.

So can I just copy over Karmic's /etc/init directory to the Lucid install and have a working system?

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 504224] Re: NFS mounts at boot time prevent boot or print spurious errors

On Sat, Jun 12, 2010 at 07:32:23PM -0000, Chris wrote:
> I also have these problems with nfs-mounted /home on Lucid. Sometimes
> GDM starts and lets me log in before /home is mounted, which leads to
> what amounts to a crash.

That is unrelated to this bug, which has been fixed. If gdm is starting
before your /home is mounted, either you have a modified /etc/init/gdm.conf,
or you have something wrong in your /etc/fstab.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message
Chris (bridgeriver) wrote :

@Steve Langasek:

My gdm.conf was not hacked until after the problem occurred, so that's presumably not it.

The relevant line in /etc/fstab is:

192.168.0.1:/home /home nfs rw,auto,intr,hard,bg,exec 0 2

This works fine under Karmic, but I don't have a huge amount of experience with NFS and it's possible there is some subtle error.

Revision history for this message
Chris (bridgeriver) wrote :

@Steve Langasek:

Just to update: I tried changing the fstab lines to defaults:

192.168.0.1:/home /home nfs defaults 0 1

reverting the gdm.conf to remove the check for /home being mounted, and rebooting. This time the machine came up fine. Not sure if it'll do that reliably, but it looks promising.

Thanks!

Revision history for this message
graemev (graeme-launchpad) wrote :

I'm a little concerned about the phrase "Only when the network is up"

I'm experiencing these hangs on my laptop (Acer Aspire One) when using Wireless but not when wired . I assume this is because the network is 'up' but the wireless doe snot come if an until I start a GUI ? (Actually I've not tracked down where the WiFi is started, but it seems to be user specific ...thats why I'm assuming a user needs to be logged on)

Revision history for this message
ubuntuforum-bisi (ubuntuforum-bisi) wrote :

this appears to be happening in lucid 10.04.01

<code>
 cat boot.log
 Begin: Loading essential drivers... ...
 Done.
 Begin: Running /scripts/init-premount ...
 Done.
 Begin: Mounting root file system... ...
 Begin: Running /scripts/local-top ...
 Done.
 Begin: Running /scripts/local-premount ...
 Done.
 Begin: Running /scripts/local-bottom ...
 Done.
 Done.
 Begin: Running /scripts/init-bottom ...
 Done.
 fsck from util-linux-ng 2.17.2
 fsck from util-linux-ng 2.17.2
 init: statd pre-start process (800) terminated with status 1
 /dev/sda1: clean, 244437/14647296 files, 14616186/58564949 blocks
 mount.nfs: DNS resolution failed for 192.168.xxx.3: Name or service not known
 mount.nfs: DNS resolution failed for 192.168.xxx.8: Name or service not known
 mountall: mount /1backup [866] terminated with status 32
 mountall: mount /0data/elf [864] terminated with status 32
 init: ureadahead-other main process (872) terminated with status 4
 mount.nfs: DNS resolution failed for 192.168.xxx.8: Name or service not known
 mountall: mount /0data/elf [890] terminated with status 32
 mount.nfs: DNS resolution failed for 192.168.xxx.3: Name or service not known
 mountall: mount /1backup [893] terminated with status 32
  * Starting AppArmor profiles
 Skipping profile in /etc/apparmor.d/disable: usr.bin.firefox
</code>

mount -a causes nfs mounts to occur as desired/expected

relevant contents of /etc/fstab
 # nfs mount of elf's data
 192.168.xxx.8:/0data /0data/elf nfs nfsvers=3,rw 0 0
 # nfs mount of public folder on qnap
 192.168.xxx.3:/Public /1backup nfs nfsvers=3,rw 0 0

Revision history for this message
astrostl (astrostl) wrote :

I have also seen this on recent 10.04 LTS *VMs* (apparently tried to mount NFS prior to portmap running).

Revision history for this message
Sebastiaan Breedveld (s-breedveld) wrote :

Can confirm this, 10.04 LTS suffers from this bug.

Revision history for this message
Damiön la Bagh (kat-amsterdam) wrote :

Proof of this bug in a screenshot. I have had my ubuntu customer in a panic now twice due to this matter. It happens when the machine has been turned off for the night. The next day he starts the machine (the nas runs 24/7 but goes into sleep mode), but the nas is still sleeping, so it doesn't react right away to the request to mount. The system hangs with the message in the screenshot.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Sat, Jul 30, 2011 at 03:21:33PM -0000, Kat Amsterdam wrote:
> Proof of this bug in a screenshot. I have had my ubuntu customer in a
> panic now twice due to this matter. It happens when the machine has been
> turned off for the night. The next day he starts the machine (the nas
> runs 24/7 but goes into sleep mode), but the nas is still sleeping, so
> it doesn't react right away to the request to mount. The system hangs
> with the message in the screenshot.

This is not the same bug as originally described, despite having
superficially similar symptoms. Please file a separate bug report for the
problem you're experiencing.

For me, the obvious first question is: why is plymouth exiting before
mountall has finished? Nothing in the /etc/init/plymouth-stop.conf job
should stop plymouth until the filesystem mounting is finished. So even if
the mount triggers too early the first time around, plymouth should still be
waiting for mountall-net to trigger and retry the network mount (and you
should never see the error message in normal operation).

The second question is why dbus is being killed with SIGTERM.

Is this screenshot taken after pressing Ctrl+Alt+Delete on the console?

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Changed in mountall (Ubuntu):
assignee: nobody → susmita ghosh (surja-bi-das)
Steve Langasek (vorlon)
Changed in mountall (Ubuntu):
assignee: susmita ghosh (surja-bi-das) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.