Comment 20 for bug 1602192

Revision history for this message
Martin Pitt (pitti) wrote : Re: deploy 30 nodes on lxd, machines never leave pending

Some observations/notes:

 * I can reproduce the bug using Scott's script on my laptop (x-013 fails). No ZFS, just one huge btrfs partition for everything, so it seems unrelated to the host file system.

 * The test script removes "working" instances; after that, rebooting the failed container in a loop always works. Also, given that fstab, the generated .mount etc. are all identical, I don't believe that it's an actual difference in the container root fs. This is more likely to be related to some race condition that things get slower with the number of running containers in parallel.

 * The root fs for containers is already supposed to be mounted for containers -- after all, that is just a bind mount. So -.mount should immediately jump from inactive to active at container boot -- in the case when it actually tries to "mount" it's doomed to fail as /dev/disk/ does not really work in a container. So the bug is that in this case it thinks that the root fs is not mounted yet.

 * systemd-remount-fs.service fails in all cases ("mount: can't find LABEL=cloudimg-rootfs", plus "mount: permission denied"). This is a wart, but unrelated to this bug.

For the failure case there is no journal, as the boot fails too early for that. Being able to set systemd.log_target= and systemd.log_level= kernel/pid1 arguments for the container would be very useful, not sure if that actually works with lxd.

Next steps: I'll try to reproduce this with plain LXC (more easily observable/accessible/debuggable) and ask for how to pass additional arguments to pid 1 with lxd.