wait-for-root fails to detect nbd root

Bug #696435 reported by Alkis Georgopoulos
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Medium
Joseph Salisbury
nbd (Ubuntu)
Invalid
Undecided
Unassigned
systemd (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Undecided
Unassigned

Bug Description

[Impact]
Kernel does not generate any events when ndb-client connects /dev/nbd0 devices, therefore it is impossible to monitor/react to the state of /dev/nbd0.

[Fix]
Generate change uevent when size of /dev/nbd0 changes

[Testcase]
* Start udevadm monitor
* modprobe nbd
* use ndb-client to connect something to /dev/nbd0
* observe that there are change udev events generated on /dev/nbd0 itself

[Regression Potential]
There is no change to existing uevents, or their ordering.
There is now an addition change event which will cause systemd to mark ndb devices as ready and trigger appropriate actions

[Original Bug Report]

When using an nbd root, wait-for-root blocks for 30 seconds before booting continues successfully.

Using Ubuntu Natty, related packages versions:
    nbd-client 1:2.9.16-6ubuntu1
    initramfs-tools 0.98.1ubuntu9

The wait-for-root call from /usr/share/initramfs-tools/scripts/local:
 while [ -z "${FSTYPE}" ]; do
  FSTYPE=$(wait-for-root "${ROOT}" ${ROOTDELAY:-30})

  # Run failure hooks, hoping one of them can fix up the system
  # and we can restart the wait loop. If they all fail, abort
  # and move on to the panic handler and shell.
  if [ -z "${FSTYPE}" ] && ! try_failure_hooks; then
   break
  fi
 done

I replaced wait-for-root with a sh script that did `set >&2`, here are the relevant environment variables at the time wait-for-root was called:
ROOT='/dev/nbd0'
ROOTDELAY=''
ROOTFLAGS=''
ROOTFSTYPE=''
nbdroot='192.168.0.1,2011'

It's probably worth noting that "nbd0: unknown partition table" was displayed asynchronously 1-2 seconds after wait-for-root was invoked and while it was still waiting. But I tried adding a "sleep 5" as the last line of local-top/nbd, so that the nbd message was displayed a lot before wait-for-root was called, and it didn't make a difference. So I don't think a race condition is involved in this problem.

Temporarily I'm passing rootdelay=1 in the kernel command line to work around the problem.

Revision history for this message
Nuno Sucena Almeida (slug-debian) wrote :

I ran into "nbd0: unknown partition table" with maverick 64bit client and lucid server. The client doesn't boot:
Negotiation: ..size = 3397924KB
nbd0: unknown partition table
bs=1024, sz=3397924

and then it stops at
exec run-init /root /sbin/init ro

The client machine has two network cards if it's relevant?

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Nuno: your problem seems unrelated, please file another bug report if you didn't solve it yet.

I found a much better workaround for the problem. In an initramfs hook, I put the following code:

# Work around LP bug #696435
mkdir -p ${DESTDIR}/lib/udev/rules.d
cat > ${DESTDIR}/lib/udev/rules.d/60-squashfs.rules <<EOF
KERNEL=="nbd0", ENV{ID_FS_TYPE}="squashfs"
EOF

This makes wait-for-root happy because it finds the ID_FS_TYPE of our root /dev/nbd0 device.
No delays and no try_failure_hooks() anymore! :)

Revision history for this message
Nuno Sucena Almeida (slug-debian) wrote :

Alkis, thank you for your tip. I ended up moving all the infrastructure to a NFS root based configuration, as I had so many trouble with nbd on our computing cluster. Since then no more issues :)

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

...and yet another workaround, which doesn't hardcode "squashfs", is to put a local-top/nbd_ltsp script with the following contents:

#!/bin/sh

# Work around LP bug #696435
if [ "$ROOT" = /dev/nbd0 ] && [ -z "$FSTYPE" ]; then
    FSTYPE=$(blkid -s TYPE -o value "${ROOT}")
    if [ -n "$FSTYPE" ]; then
        echo "FSTYPE='$FSTYPE'" > /conf/param.conf
    fi
fi

Wouter, would you consider adding that to the local-top/nbd script instead?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Please reply if this is still an issue on a supported release.

Changed in initramfs-tools (Ubuntu):
status: New → Invalid
Changed in nbd (Ubuntu):
status: New → Invalid
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Yes, it's still an issue in Trusty.
Also please use "Incomplete", not "Invalid" when you need feedback from a bug reporter.

root@ltsp241:~# blkid
/dev/nbd0: TYPE="squashfs"
/dev/nbd1: UUID="d7bfcbc8-9718-46f9-b9e3-daf9e46f596a" TYPE="swap"
/dev/sr0: LABEL="Ubuntu 10.04.3 LTS i386" TYPE="iso9660"
root@ltsp241:~# /usr/lib/initramfs-tools/bin/wait-for-root /dev/nbd0 1 || echo failed
failed
root@ltsp241:~# /usr/lib/initramfs-tools/bin/wait-for-root /dev/nbd1 1 || echo failed
failed
root@ltsp241:~# /usr/lib/initramfs-tools/bin/wait-for-root /dev/sr0 1 || echo failed
iso9660
root@ltsp241:~# lsb_release -sc
trusty

Changed in nbd (Ubuntu):
status: Invalid → New
Changed in initramfs-tools (Ubuntu):
status: Invalid → New
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Hmmm, maybe this is an easier way to reproduce something similar without using NBD at all:

wait-for-root /dev/sr0 1

succeeds in a booted system,
but fails from the initramfs if one adds "break=bottom" in the kernel command line.

It succeeds in both cases for e.g. /dev/sda1.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

I think the problem is the missing ID_FS_TYPE in udev for nbd devices,
and that it's also reported more properly there:
https://bugs.freedesktop.org/show_bug.cgi?id=62565

Maybe wait-for-root could find some better workaround when ID_FS_TYPE is unset though, e.g. checking the output of `blkid`...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in initramfs-tools (Ubuntu):
status: New → Confirmed
Changed in nbd (Ubuntu):
status: New → Confirmed
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

This is still an issue in Ubuntu 16.04, now initramfs-tools unconditionally calls `wait-for-root /dev/nbd0 10` without even using ROOTDELAY.

I also reported this bug to https://github.com/yoe/nbd/issues/36.

tags: added: bot-stop-nagging
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Checking the issue that was referred shows it was handled and fixed in
1. Kernel https://github.com/torvalds/linux/commit/37091fdd831f28a6509008542174ed324dd645bc
which is 4.6 and thereby fixed in >=Yakkety.
2. Systemd https://github.com/systemd/systemd/pull/2422 that went into systemd 230 which also means part of >=Yakkety.

I'm filing bug tasks for systemd and kernel to check and maybe consider for Xenial

affects: initramfs-tools (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Confirmed → New
Changed in nbd (Ubuntu):
status: Confirmed → Invalid
no longer affects: nbd (Ubuntu Xenial)
Changed in systemd (Ubuntu):
status: New → Fix Released
Changed in linux (Ubuntu):
status: New → Fix Released
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I do not believe I can land the systemd change, unless 4.4 kernel picks up the above mentioned fix. As otherwise hwe kernels will work, and GA kernel will not.

Changed in systemd (Ubuntu Xenial):
status: New → Incomplete
Changed in linux (Ubuntu Xenial):
assignee: nobody → Canonical Kernel (canonical-kernel)
tags: added: id-594d1970df3ec53730c0d28c
Changed in linux (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Medium
assignee: Canonical Kernel (canonical-kernel) → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Xenial test kernel with a pick of commit 37091fd. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp696435/

To test the kernel, be sure to install both the linux-image and linux-image-extra .deb packages.

tags: added: kernel-da-key
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@jsalisbury that kernel is very nice, could you please cherrypick that patch and include it in the next available/convenient src:linux SRU?

I have added the SRU template to the bug report.

description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Dimitri, yes I'll submit an SRU request.

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Dimitri John Ledkov (xnox) wrote :
Changed in systemd (Ubuntu Xenial):
status: Incomplete → Confirmed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

I'm the bug reporter, and I have a problem in doing the verification.

a) The initial testcase that I reported happens in both 4.4 unpatched and 4.10. I.e. the bug that I reported is not yet fixed.

b) Some Ubuntu developer wrote a new testcase as part of doing the SRU. I cannot reproduce that testcase neither in 4.4 unpatched nor in 4.10, i.e. `udevadm monitor` does show add/change events for nbd devices for me in both kernels. I.e. that sounds like a different bug that I never saw.

So I'm not sure how I can help here...

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Hello, I will retest this with the sru kernel.

The udevadm monitor appears to monitor things that it knows about.... thus first one needs to load nbd module, start monitor, unload nbd module, then proceed with the test case.

I've started validating this on wednesday, but had to travel away from a computer for an emergency =/ will test this first thing on monday.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

With uname -a listing 4.4.0-100-generic #123-Ubuntu I get two change events, on the parent nbd3 device when connecting nbd export to /dev/nbd3 with two partitions (nbd3p1 and nbd3p2).

With uname -a listing 4.4.0-98 I do not get such events.

This bug is verified on xenial, will prepate matching systemd udev rules change SRU.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.3 KiB)

This bug was fixed in the package linux - 4.4.0-101.124

---------------
linux (4.4.0-101.124) xenial; urgency=low

  * linux: 4.4.0-101.124 -proposed tracker (LP: #1731264)

  * s390/mm: fix write access check in gup_huge_pmd() (LP: #1730596)
    - s390/mm: fix write access check in gup_huge_pmd()

linux (4.4.0-100.123) xenial; urgency=low

  * linux: 4.4.0-100.123 -proposed tracker (LP: #1729273)

  * Xenial update to 4.4.95 stable release (LP: #1729107)
    - USB: devio: Revert "USB: devio: Don't corrupt user memory"
    - USB: core: fix out-of-bounds access bug in usb_get_bos_descriptor()
    - USB: serial: metro-usb: add MS7820 device id
    - usb: cdc_acm: Add quirk for Elatec TWN3
    - usb: quirks: add quirk for WORLDE MINI MIDI keyboard
    - usb: hub: Allow reset retry for USB2 devices on connect bounce
    - ALSA: usb-audio: Add native DSD support for Pro-Ject Pre Box S2 Digital
    - can: gs_usb: fix busy loop if no more TX context is available
    - usb: musb: sunxi: Explicitly release USB PHY on exit
    - usb: musb: Check for host-mode using is_host_active() on reset interrupt
    - can: esd_usb2: Fix can_dlc value for received RTR, frames
    - drm/nouveau/bsp/g92: disable by default
    - drm/nouveau/mmu: flush tlbs before deleting page tables
    - ALSA: seq: Enable 'use' locking in all configurations
    - ALSA: hda: Remove superfluous '-' added by printk conversion
    - i2c: ismt: Separate I2C block read from SMBus block read
    - brcmsmac: make some local variables 'static const' to reduce stack size
    - bus: mbus: fix window size calculation for 4GB windows
    - clockevents/drivers/cs5535: Improve resilience to spurious interrupts
    - rtlwifi: rtl8821ae: Fix connection lost problem
    - KEYS: encrypted: fix dereference of NULL user_key_payload
    - lib/digsig: fix dereference of NULL user_key_payload
    - KEYS: don't let add_key() update an uninstantiated key
    - pkcs7: Prevent NULL pointer dereference, since sinfo is not always set.
    - parisc: Avoid trashing sr2 and sr3 in LWS code
    - parisc: Fix double-word compare and exchange in LWS code on 32-bit kernels
    - sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task()
    - f2fs crypto: replace some BUG_ON()'s with error checks
    - f2fs crypto: add missing locking for keyring_key access
    - fscrypt: fix dereference of NULL user_key_payload
    - KEYS: Fix race between updating and finding a negative key
    - fscrypto: require write access to mount to set encryption policy
    - FS-Cache: fix dereference of NULL user_key_payload
    - Linux 4.4.95

  * Xenial update to 4.4.94 stable release (LP: #1729105)
    - percpu: make this_cpu_generic_read() atomic w.r.t. interrupts
    - drm/dp/mst: save vcpi with payloads
    - MIPS: Fix minimum alignment requirement of IRQ stack
    - sctp: potential read out of bounds in sctp_ulpevent_type_enabled()
    - bpf/verifier: reject BPF_ALU64|BPF_END
    - udpv6: Fix the checksum computation when HW checksum does not apply
    - ip6_gre: skb_push ipv6hdr before packing the header in ip6gre_header
    - net: emac: Fix napi poll list corruption
    - packet: hold bind lock when rebinding to fa...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Changed in systemd (Ubuntu Xenial):
status: Confirmed → In Progress
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Alkis, or anyone else affected,

Accepted systemd into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/systemd/229-4ubuntu21.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in systemd (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
removed: verification-done-xenial
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Confirmed with systemd 229-4ubuntu21.2 that dev-nbd4p1.device is only "plugged", after client is running, and all symlinks and partitions exist.

tags: added: verification-done verification-done-xenial
removed: verification-needed verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package systemd - 229-4ubuntu21.2

---------------
systemd (229-4ubuntu21.2) xenial; urgency=medium

  [ Dimitri John Ledkov ]
  * udev: Mark ndb devices as inactive until connected. (LP: #696435)
  * networkd: in dhcp, change UseMTU default to true, to accept DHCP provided MTU by default.
    (LP: #1717471)
  * sysctl: apply parameters in-order, instead of randomly. (LP: #1718444)
  * networkd: apply promote_secondaries, to make DHCP lease changes work.
    (LP: #1721223)
  * shutdown: sync filesystems, before going into a killing spree.
    (LP: #1722481)
  * sysctl: do not fail, when cannot apply sysctl changes due to read-only sysfs in containers.
    (LP: #1734409)
  * networkd,wait-online: add RequiredForOnline to mark mandatory/optional links for boot.
    (LP: #1737570)

  [ David Glasser ]
  * journald: don't reduce BurstRateLimit on low disk space (LP: #1732803)

 -- Dimitri John Ledkov <email address hidden> Wed, 21 Feb 2018 13:46:37 +0000

Changed in systemd (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for systemd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.