Race to mount seed device

Bug #1860046 reported by Dave Jones
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
livecd-rootfs (Ubuntu)
Fix Released
High
Unassigned
Bionic
Fix Released
Critical
Unassigned

Bug Description

[Impact]

We've just come across a situation where cloud-init races systemd to mount a partition, with the result that either cloud-init fails to load its seed, or a relatively important partition fails to mount.
This can basically lead to situations where the device will not boot.

[Test Case]

Boot the image built using the bionic-proposed livecd-rootfs multiple times and make sure it is booting up correctly.

TODO: find a better test case?

[Regression Potential]

The change seems relatively safe and doesn't seem like there are any obvious regressions it could cause, but one should be on a lookout for any cloud-init weirdness or issues on boot-time.

[Original Description]

We've just come across a situation where cloud-init races systemd to mount a partition, with the result that either cloud-init fails to load its seed, or a relatively important partition fails to mount. I'm not entirely sure this is a bug - it could well be argued this is mis-configuration on our part, and there's a trivial workaround, but I'll leave the cloud-init devs to make that call.

The configuration is as follows:

1. On recent Ubuntu classic images for the Raspberry Pi, cloud-init's seed is located on the boot partition (/dev/mmcblk0p1, labelled "system-boot"). This is desirable as it's a straight-forward FAT partition, and thus easily accessible and editable on any OS (as opposed to the original seed location on the root ext4 partition).

2. Cloud-init is configured to look for its seed with "fs_label: system-boot"

3. Systemd is configured (via /etc/fstab) to mount /dev/disk/by-label/system-boot (/dev/mmcblk0p1) on /boot/firmware

The result on boot-up is one of three situations:

1. cloud-init mounts /dev/mmcblk0p1 on a temp path. While attempting to read its seed, systemd attempts to mount /dev/mmcblk0p1 on /boot/firmware but fails because the device is already mounted elsewhere. cloud-init succeeds in reading its seed, umounts /dev/mmcblk0p1 and the system boots successfully, but /boot/firmware isn't mounted (leading to issues further down the line with other things like flash-kernel that expect it to be mounted).

2. cloud-init checks for /dev/mmcblk0p1 in /proc/mounts, and doesn't see it there. It goes to mount /dev/mmcblk0p1 on a temp path, but before it can do so, systemd mounts it on /boot/firmware. cloud-init's mount fails (device already mounted), so it fails to read its seed and uses a default config instead (which isn't entirely desirable for us as the default involves a long wait if ethernet is not connected; our default seed overrides that).

3. Everything works fine. This occurs when either a) systemd mounts /boot/firmware first, then cloud-init sees this mount in /proc/mounts and uses it or b) cloud-init mounts /dev/mmcblk0p1 on a temp path, reads its seed and unmounts it, *then* systemd mounts /boot/firmware.

Here's the relevant snippet of traceback from when the second result occurs:

...
2020-01-14 20:48:01,337 - util.py[DEBUG]: Read 386 bytes from /etc/os-release
2020-01-14 20:48:01,338 - util.py[DEBUG]: Running command ['blkid', '-odevice', '/dev/sr0'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,348 - util.py[DEBUG]: Running command ['blkid', '-odevice', '/dev/sr1'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,357 - util.py[DEBUG]: Running command ['blkid', '-tTYPE=vfat', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,415 - util.py[DEBUG]: Running command ['blkid', '-tTYPE=iso9660', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,432 - util.py[DEBUG]: Running command ['blkid', '-tLABEL=SYSTEM-BOOT', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,448 - util.py[DEBUG]: Running command ['blkid', '-tLABEL=system-boot', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True)
2020-01-14 20:48:01,466 - DataSourceNoCloud.py[DEBUG]: Attempting to use data from /dev/mmcblk0p1
2020-01-14 20:48:01,467 - util.py[DEBUG]: Reading from /proc/mounts (quiet=False)
2020-01-14 20:48:01,467 - util.py[DEBUG]: Read 2062 bytes from /proc/mounts
2020-01-14 20:48:01,468 - util.py[DEBUG]: Fetched {'sysfs': {'fstype': 'sysfs', 'mountpoint': '/sys', 'opts': 'rw,nosuid,nodev,noexec,relatime'}, 'proc': {'fstype': 'proc', 'mountpoint': '/proc', 'opts': 'rw,nosuid,nodev,noexec,relatime'}, 'udev': {'fstype': 'devtmpfs', 'mountpoint': '/dev', 'opts': 'rw,nosuid,relatime,size=1913472k,nr_inodes=156992,mode=755'}, 'devpts': {'fstype': 'devpts', 'mountpoint': '/dev/pts', 'opts': 'rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000'}, 'tmpfs': {'fstype': 'tmpfs', 'mountpoint': '/etc/machine-id', 'opts': 'ro,relatime,size=391432k,mode=755'}, '/dev/mmcblk0p2': {'fstype': 'ext4', 'mountpoint': '/', 'opts': 'rw,relatime'}, 'securityfs': {'fstype': 'securityfs', 'mountpoint': '/sys/kernel/security', 'opts': 'rw,nosuid,nodev,noexec,relatime'}, 'cgroup': {'fstype': 'cgroup', 'mountpoint': '/sys/fs/cgroup/freezer', 'opts': 'rw,nosuid,nodev,noexec,relatime,freezer'}, 'pstore': {'fstype': 'pstore', 'mountpoint': '/sys/fs/pstore', 'opts': 'rw,nosuid,nodev,noexec,relatime'}, 'systemd-1': {'fstype': 'autofs', 'mountpoint': '/proc/sys/fs/binfmt_misc', 'opts': 'rw,relatime,fd=26,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=1728'}, 'debugfs': {'fstype': 'debugfs', 'mountpoint': '/sys/kernel/debug', 'opts': 'rw,relatime'}, 'mqueue': {'fstype': 'mqueue', 'mountpoint': '/dev/mqueue', 'opts': 'rw,relatime'}, 'tracefs': {'fstype': 'tracefs', 'mountpoint': '/sys/kernel/debug/tracing', 'opts': 'rw,relatime'}, 'configfs': {'fstype': 'configfs', 'mountpoint': '/sys/kernel/config', 'opts': 'rw,relatime'}, 'fusectl': {'fstype': 'fusectl', 'mountpoint': '/sys/fs/fuse/connections', 'opts': 'rw,relatime'}} mounts from proc
2020-01-14 20:48:01,468 - util.py[DEBUG]: Running command ['mount', '-o', 'ro', '-t', 'auto', '/dev/mmcblk0p1', '/run/cloud-init/tmp/tmp2z8rueu4'] with allowed return codes [0] (shell=False, capture=True)
2020-01-14 20:48:01,529 - util.py[DEBUG]: Failed mount of '/dev/mmcblk0p1' as 'auto': Unexpected error while running command.
Command: ['mount', '-o', 'ro', '-t', 'auto', '/dev/mmcblk0p1', '/run/cloud-init/tmp/tmp2z8rueu4']
Exit code: 32
Reason: -
Stdout:
Stderr: mount: /run/cloud-init/tmp/tmp2z8rueu4: /dev/mmcblk0p1 already mounted on /boot/firmware.
2020-01-14 20:48:01,530 - util.py[WARNING]: Failed to mount /dev/mmcblk0p1 when looking for data
2020-01-14 20:48:01,533 - util.py[DEBUG]: Failed to mount /dev/mmcblk0p1 when looking for data
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceNoCloud.py", line 128, in _get_data
    pp2d_kwargs)
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1758, in mount_cb
    (device, tmpd, failure_reason))
cloudinit.util.MountFailedError: Failed mounting /dev/mmcblk0p1 to /run/cloud-init/tmp/tmp2z8rueu4 due to: Unexpected error while running command.
Command: ['mount', '-o', 'ro', '-t', 'auto', '/dev/mmcblk0p1', '/run/cloud-init/tmp/tmp2z8rueu4']
Exit code: 32
Reason: -
Stdout:
Stderr: mount: /run/cloud-init/tmp/tmp2z8rueu4: /dev/mmcblk0p1 already mounted on /boot/firmware.
2020-01-14 20:48:01,549 - __init__.py[DEBUG]: Datasource DataSourceNoCloud [seed=None][dsmode=net] not updated for events: New instance first boot
2020-01-14 20:48:01,549 - handlers.py[DEBUG]: finish: init-local/search-NoCloud: SUCCESS: no local data found from DataSourceNoCloud
2020-01-14 20:48:01,549 - main.py[DEBUG]: No local datasource found
...

The relevant code is mount_cb() in util.py which is calling mounts() to read /proc/mounts then attempting to mount the device when it's not found (leading to a classic race between the check and the action).

It might be argued that it should simply attempt the mount, and check /proc/mounts in the event of failure, but I'm wary that mounting FS' is a potentially costly exercise (can the seed be on a remote mount?), and there's probably edge cases here I'm unaware of (is it possible, perhaps dangerous, to double-mount certain devices?).

Perhaps it could be adjusted to check /proc/mounts, attempt the mount, then *re-check* /proc/mounts in the case of failure?

Nevertheless, there's a trivial work-around in our case: just add an override for "RequiresMountFor=/boot/firmware" to our systemd config for cloud-init-local, thus ensuring the mount is definitely available prior to cloud-init looking for its seed. This in turn argues the issue could simply be viewed as mis-configuration on our part: we *know* the seed is on /dev/mmcblk0p1 as that's where we've configured it to be; should we not also configure cloud-init in such a way that we guarantee the device containing its seed is definitely mounted prior to it running?

Revision history for this message
Łukasz Zemczak (sil2100) wrote :

So the mentioned workaround seems fair, let's proceed with it if possible. What I find mysterious is that we did not see it in eoan or focal so far, where since this is an obvious race condition, I'd expect it to pop up here and there.

tags: added: id-5db0502bcb3f2112497e1b92
Dan Watkins (oddbloke)
Changed in cloud-init (Ubuntu):
status: New → Triaged
Revision history for this message
Ryan Harper (raharper) wrote :

> Nevertheless, there's a trivial work-around in our case: just add an override for
> "RequiresMountFor=/boot/firmware" to our systemd config for cloud-init-local

Yes, this is the correct configuration that's needed, and it's related to your configuration change and where you're putting your seed.

> should we not also configure cloud-init in such a way that we guarantee the device containing its seed is definitely mounted prior to it running?

This is how cloud-init is configured by default which has:

RequiresMountsFor=/var/lib/cloud

Which is the default location for the NoCloud on-disk seed.

So I don't think this is a bug, however, let's take this bug as an opportunity to add to our documentation around using NoCloud's 'seedfrom' configuration

https://cloudinit.readthedocs.io/en/latest/topics/datasources/nocloud.html

Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Dave, or anyone else affected,

Accepted livecd-rootfs into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/livecd-rootfs/2.525.39 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

description: updated
Changed in cloud-init (Ubuntu):
status: Triaged → Fix Committed
status: Fix Committed → Invalid
Changed in livecd-rootfs (Ubuntu):
status: New → Fix Committed
Changed in livecd-rootfs (Ubuntu Bionic):
status: New → In Progress
importance: Undecided → Critical
Changed in livecd-rootfs (Ubuntu):
importance: Undecided → High
Changed in livecd-rootfs (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package livecd-rootfs - 2.637

---------------
livecd-rootfs (2.637) focal; urgency=medium

  * Ensure seed partition is mounted on no-cloud images which use system-boot
    as their seed (LP: #1860046)
  * Have getty wait for cloud-init to complete to ensure that the default
    user exists before presenting a login prompt

 -- Dave Jones <email address hidden> Fri, 24 Jan 2020 15:17:56 +0000

Changed in livecd-rootfs (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Let me triple-confirm with Paul, but from our talk yesterday it looks like this bug is no longer reproducible, which would make it verified.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for livecd-rootfs has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package livecd-rootfs - 2.525.39

---------------
livecd-rootfs (2.525.39) bionic; urgency=medium

  * Ensure seed partition is mounted on no-cloud images which use system-boot
    as their seed (LP: #1860046)
  * Have getty wait for cloud-init to complete to ensure that the default
    user exists before presenting a login prompt

livecd-rootfs (2.525.38) bionic; urgency=medium

  * Support generating a .disk/info file via ubuntu-image from the passed-in
    datestamp parameter (using the $NOW environment variable). (LP: #1856684)
  * Switch raspi2-hwe-18.04-edge to raspi2-hwe-18.04. (LP: #1859478)

livecd-rootfs (2.525.37) bionic; urgency=medium

  * Use the raspi2-hwe-18.04-edge kernel flavour for the raspi3 images. This
    kernel is required for proper pi4 support. (LP: #1859478)

 -- Dave Jones <email address hidden> Fri, 24 Jan 2020 15:17:56 +0000

Changed in livecd-rootfs (Ubuntu Bionic):
status: Fix Committed → Fix Released
Mathew Hodson (mhodson)
no longer affects: cloud-init (Ubuntu Bionic)
no longer affects: cloud-init (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.