cloud-init-nonet runs 'start networking' explicitly

Bug #1031065 reported by Scott Moser
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
Medium
Scott Moser
Precise
Fix Released
Medium
Unassigned

Bug Description

== Begin SRU Information ==
[Impact]
Cloud-init in 12.04 has an upstart job named 'cloud-init-nonet' that
calls 'start networking' explicitly. This was done to fix a boot
deadlock (bug 800824), but it was not the proper fix.

A much more correct fix is now possible because of improvements
that have been made in mountall (bug 643289) and have been brought
back to 12.04.

calling 'start networking' from cloud-init-nonet could cause issues
because other upstart 'start on' conditions might not be met at this
point in boot.

The fix here is the same as is now applied in quantal and raring.
It more correctly addresses the root issue, that network-device-added
events were not being emitted in a container. There is now a job named
cloud-init-container that will emit 'network-device-added' events if
it is inside a container *and* its sanity checks show that the given
device has not already been brought up.

[Test Case]
You can demonstrate why this change was necessary, and that the
provided change fixes the issue by doing the following:

## create a 'source' (pristine) root ##
sudo lxc-create -t ubuntu-cloud --name source-precise-amd64 -- \
   --release precise --arch amd64 --stream daily

# set up 2 copies of the pristine root
# * 'nostart' just has cloud-init's call to 'start-networking' disabled
# * 'patched' contains the full upgraded cloud-init
sudo lxc-clone -o source-precise-amd64 -n nostart
f="/var/lib/lxc/nostart/etc/init/cloud-init-nonet.conf"
if [ ! -e "$f.dist" ]; then
   # disable 'start networking' in the cloud-init-nonet job
   sudo cp "$f" "$f.dist"
   sudo sed -i 's,^\([ ]\+start networking.*\),#\1,' "$f"
fi

sudo lxc-clone -o source-precise-amd64 -n patched
deb="cloud-init_0.6.3-0ubuntu1.5~ppa1_all.deb"
rpath=/var/lib/lxc/patched/rootfs
sudo cp $deb "/var/lib/lxc/patched/rootfs/tmp"
sudo LANG=C chroot "/var/lib/lxc/patched/rootfs" dpkg -i "/tmp/$deb"

## Now, start both. the 'nostart' root will hang on
## cloud-init waiting for networking to come up.
## the 'patched' will come all the way up quickly.
## You can stop them with 'sudo lxc-stop -n <name>'
sudo lxc-start -n patched -- /sbin/init --verbose
sudo lxc-start -n nostart -- /sbin/init --verbose

[Regression Potential]
Regressions would be likely to occur in one of 2 places:
a.) inside a container, where cloud-init-container caused a
    problem by emitting its network-device-added events

    Here, the problem would be very limited to lxc containers
    that have cloud-init inside them. This is likely very small.

b.) outside a container, where 'start networking' was previously
    "fixing" a boot deadlock that could have occurred.
    Here, the changes are basically making precise work like 12.10
    and 13.04, so hopefully issues would have been shaken out there.
== End SRU Information ==

In development of 'overlayroot' package, I was mounting / as rw in the initramfs.
This was causing a different order of execution of mounts, and as a result a different order of networking resolvconf and networking bringup.

It exposed a bug in resolvconf, which was being called on boot in this order:
==== Mon Jul 30 19:08:58 UTC 2012 /sbin/resolvconf -a lo.inet ====
==== Mon Jul 30 19:08:59 UTC 2012 /sbin/dhclient-script ====
==== Mon Jul 30 19:08:59 UTC 2012 /sbin/resolvconf -a eth0.dhclient ====
==== Mon Jul 30 19:08:59 UTC 2012 /sbin/resolvconf -a eth0.inet ====
==== Mon Jul 30 19:08:59 UTC 2012 /sbin/resolvconf --enable-updates ====
==== Mon Jul 30 19:08:59 UTC 2012 /etc/resolvconf/update.d/libc -u ====

The normal order is for resolvconf --enable-updates to be called (from /etc/init/resolvconf.conf) before anything else. As a result, I was seeing errors like:
  resolvconf: Error: /run/resolvconf/interface either does not exist or is not a directory

This may be exposing a more grave issue, in that I believe the reason for dhclient coming up before resolvconf.conf started was that it was being run as a result of /etc/init/network-interface.conf. I don't immediately see how that is guarnateed to have /run ounted at all.

ProblemType: Bug
DistroRelease: Ubuntu 12.10
Package: resolvconf 1.67ubuntu1
ProcVersionSignature: User Name 3.5.0-6.6-generic 3.5.0
Uname: Linux 3.5.0-6-generic x86_64
Architecture: amd64
Date: Mon Jul 30 20:07:04 2012
PackageArchitecture: all
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: resolvconf
UpgradeStatus: No upgrade log present (probably fresh install)

Related Bugs:
 * bug 800824: cloud-init-nonet times out in lxc
 * bug 925122: container's udevadm trigger --add affects the host
 * bug 643289: idmapd does not starts to work after system reboot

Related branches

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Stéphane Graber (stgraber) wrote :

As suggested on IRC, I'm wondering if we couldn't fix the race by doing the following changes:

 - change network-interface to: "start on net-device-added and mounted MOUNTPOINT=/run"
 - change resolvconf to: "start on mounted MOUNTPOINT=/run or starting network-interface"

networking.conf should be fine as it's "local-filesystems and (stopped udevtrigger or container))" which should be triggered quite a bit after virtual-filesystems, so resolvconf should have run already.

Revision history for this message
Steve Langasek (vorlon) wrote :

I don't think there's been sufficient analysis of the root cause here to start changing start conditions on any of the standard jobs.

/etc/init/resolvconf.conf is:

  start on mounted MOUNTPOINT=/run

and /etc/init/udev.conf is:

  start on virtual-filesystems

The 'mounted' event is a hook, not a signal; mountall will not continue processing mounts, or emit any other signals, until all jobs depending on 'mounted' have finished.

This means that the resolvconf job is *guaranteed* to create the /run/resolvconf/interface directory before udev is started. And until udev is started, there should be no 'net-device-added' events, so no instances of /etc/init/network-interface.conf.

So something else is going on here. I don't see any bugs in resolvconf.

Revision history for this message
Stéphane Graber (stgraber) wrote :

Problem was tracked down to cloud-init calling "start networking" directly from one of its upstart jobs, which bypasses the start condition of the networking job and triggers the bug.

21:16 < stgraber> [ 4.598897] networking
21:16 < stgraber> [ 5.016864] init: cloud-init-nonet main process (241) killed by TERM signal
21:16 < stgraber> [ 5.632677] resolvconf
21:16 < stgraber> [ 5.955406] virtual
21:16 < stgraber> [ 5.989643] local
21:17 < stgraber> ubuntu@ip-10-202-249-151:~$ grep -r 'start networking' /etc/init/
21:17 < stgraber> /etc/init/cloud-init-nonet.conf: start networking >/dev/null
21:17 < stgraber> smoser: ^
21:18 < stgraber> "start on mounted MOUNTPOINT=/ and stopped cloud-init-local" with clout-init-local being "start on mounted MOUNTPOINT=/"
21:18 < stgraber> so basically as soon as / is mounted, "start networking" is directly triggered by cloud-init-nonet bypassing the start condition

http://paste.ubuntu.com/1120257/ shows the process tree at the time.

affects: resolvconf (Ubuntu) → cloud-init (Ubuntu)
Revision history for this message
Scott Moser (smoser) wrote :

Some conversation in ubuntu-devel yesterday [1]. This seems to have been caused by the fix for bug 800824. That change [2] was done to address failure of networking to appear as started a container. Per Steve, "that looks like a wrong workaround for bug #925122".

Hopefully what that means is we can remove this code and lxc containers still boot correctly even in precise (in which case we should probably SRU that).

--
[1] http://irclogs.ubuntu.com/2012/07/31/%23ubuntu-devel.html#t01:41
[2] http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/trunk/revision/405

summary: - /sbin/resolvconf -a depends on /run/resolvconf/interface but it may not
- exist
+ cloud-init-nonet runs 'start networking' explicitly
Scott Moser (smoser)
description: updated
description: updated
Scott Moser (smoser)
Changed in cloud-init (Ubuntu):
assignee: nobody → Scott Moser (smoser)
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Scott Moser (smoser) wrote :

I was able to recreate the original problem of bug 800824 by doing the following on a precise instance:
 - setup the instance, get source roots of the ubuntu-cloud lxc root:
   for r in oneiric precise quantal; do \
     sudo lxc-create -t ubuntu-cloud -n source-$r-amd64 -- \
        --release $r --arch amd64 --stream daily; done
 - RELEASE="oneiric"
 - sudo apt-get install lxc
 - sudo lxc-clone -o source-$RELEASE-amd64 $RELEASE-amd64
 - edit /var/lib/lxc/$RELEASE-amd64/rootfs/etc/init/cloud-init-nonet.conf
   commenting out the 'start networking' line.
 - sudo lxc-start --name=$RELEASE-amd64
 - Here, you will see the cloud-init nonet timeout, and
   wait for 120 seconds before seeing a tty prompt.
 - sudo lxc-stop -n $RELEASE-amd64
 - sudo lxc-destroy --name=$RELEASE-amd64

If you do *not* comment the line out in cloud-init-nonet.conf, you will
see the cloud-init output and boot quickly.

Repeating the above process for RELEASE=precise will show both cases working
 RELEASE=oneiric with 'start networking' : boots quickly
 RELEASE=oneiric without 'start-networking': timeout
 RELEASE=precise with 'start networking' : boots quickly
 RELEASE=precise without 'start-networking': timeout
 RELEASE=quantal with 'start networking' : boots quickly
 RELEASE=quantal without 'start-networking': timeout

On quantal, the cloud-init output in the timeout case looks like:
  Cloud-init v. 0.7 running 'init-local' at Wed, 22 Aug 2012 20:09:07 +0000. Up 5131.22 seconds.
  cloud-init-nonet waiting 120 seconds for a network device.

What is the correct fix here?

Changed in cloud-init (Ubuntu):
status: Triaged → Confirmed
Revision history for this message
Scott Moser (smoser) wrote :

So, after much discussion [1] we came to the realization of why 'start networking' was here.
The issue was that the normal case for cloud-images (and ubuntu as a whole) is that / is mounted "ro" when /sbin/init starts. When this is the case, /run will be mounted before /. However, if / is mounted "rw" then "mountall first scans for all filesystems that don't require further processing and reports them mounted", so "mounted MOUNTPOINT=/" occurs before virtual-filesystems.

if mounted MOUNTPOINT=/ occurs before virtual-filesystems, then udev will not start because it depends on virtual-filesystems. That means means networking wont get started either, as ifup occurs via udev events for discovered network adapters. As a result, we hang in cloud-init-nonet waiting for networking devices to come up that never will.

Note, slangasek pointed out that while /run is guaranteed to be mounted before / (except withotu 'ro'), virtual-filesystems is *not* guaranteed. So, even with 'ro' root, a race condition could occur here. It would also, then imply that cloud-init cannot actually use some filesystems in cloud-init-local (such has /proc or /sys).

slangasek suggested that a fix for bug 643289 would improve the situation here as the virtual-filesystems could all move forward even while the 'mounted MOUNTPOINT=/' were processing (currently it blocks *everything*). That would mean we wouldn't deadlock here.

--
[1] http://irclogs.ubuntu.com/2012/09/05/%23ubuntu-devel.html#t18:16

description: updated
Revision history for this message
Steve Langasek (vorlon) wrote :

After reviewing the patches on bug #643289, I see that some further work will be needed because those patches only adjust the 'mounting' event, not the 'mounted' event. IMHO we should do both.

Steve Langasek (vorlon)
Changed in cloud-init (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Scott Moser (smoser) wrote :

Finding a better solution to the 'start networking' that we have here has lead me to open bug 1047712.
The case that I was debugging was actually not even improved by the 'start networking'. The issue that lead us here was that the ephemeral images for maas that we were trying to refresh have had issues coming up and woudl hit this timeout.

slangasek helped me to come up with a solution that has gotten us closer.
I've replaced 'start networking' with "virtual-filesystems-emitter" (see attached) in the cloud-init-nonet job.
Essentially that just gets virtual-filesystems event emitted if it hasnt already been emitted, which jump starts the networking code even while mounted MOUNTPOINT=/ is being processed.

Revision history for this message
Steve Langasek (vorlon) wrote :

Scott, can you please try a test package of mountall built from lp:ubuntu/mountall? I think I have both the mounting and mounted events handled correctly there (and in my tests, seemingly faster).

Testing in scenarios other than the cloud boot case would also certainly be helpful.

Revision history for this message
Scott Moser (smoser) wrote :

I built a deb from lp:ubuntu/mountall revno 415.
installed it into a ephemeral image for quantal
commented out 'start networking' from /etc/init/cloud-init-nonet.conf, and got the /var/log/kern.log at http://paste.ubuntu.com/1198891/ .

it seems something is blockign virutal-filesystems event from occuring still.

Revision history for this message
Scott Moser (smoser) wrote :

Steve,
I've also now tried your mountall in an lxc chroot, and it did not resolve my issue there either.
I've set up a lxc container as described above, commented out the 'start networking' and then:
$ sudo ln -sf console /path/to/rootfs/dev/kmsg
$ sudo lxc-start --name=$REL-amd64 --console console.log -- /sbin/init --verbose

The console.log is at http://paste.ubuntu.com/1199472/

Revision history for this message
Steve Langasek (vorlon) wrote :

Attaching plaintext console log.

Revision history for this message
Steve Langasek (vorlon) wrote :

The log for the container seems to show reasonable ordering for the events. It also shows:

cloud-init-nonet waiting 120 seconds for a network device.
cloud-init-nonet gave up waiting for a network device.
<6>init: cloud-init-nonet main process (31) exited normally
<6>init: cloud-init-nonet goal changed from start to stop
<6>init: cloud-init-nonet state changed from running to stopping
<6>init: Handling stopping event

However, this is a container so AFAICS there *isn't* a network interface to be brought up. Isn't that a different problem?

Revision history for this message
Steve Langasek (vorlon) wrote :

network-interface (lo) is the only interface job shown, so either something is going wrong when bringing up this interface that prevents /etc/network/if-up.d/upstart from emitting the static-network-up event, or there are other network interfaces configured in /etc/network/interfaces that have no corresponding udev events, but do come up successfully in response to ifup -a. Can you attach /etc/network/interfaces from this container, for reference?

Revision history for this message
Scott Moser (smoser) wrote :

There is an 'eth0' defined in /etc/network/interfaces.
And, there is a network device eth0

$ sed -e "/^$/d" -e '/^#/d' etc/network/interfaces
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet dhcp

$ ls -l /run/network
total 4
-rw-r--r-- 1 root root 16 Sep 12 18:33 ifstate
-rw-r--r-- 1 root root 0 Sep 12 18:33 ifup.eth0
-rw-r--r-- 1 root root 0 Sep 12 18:31 ifup.lo
drwxr-xr-x 2 root root 40 Sep 12 18:33 static-network-up-emitted

Note, that eth0 came up 2 minutes (120 seconds) after lo.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.0~bzr659-0ubuntu1

---------------
cloud-init (0.7.0~bzr659-0ubuntu1) quantal; urgency=low

  * New upstream snapshot.
    * add signal handlers to more cleanly exit
    * add logging fallback in case logging fails
    * documentation fix for landscape config (LP: #1042764)
    * do not write a comment in /etc/hostname (LP: #1052664)
    * fix incorrect ubuntu mirrors for 'ports' arches [Robbie Basak]
    * fix generation of hostname based on ip address in datasource
      (LP: #1050962) [Andy Grimm]
    * remove 'start networking' from cloud-init-nonet and replace it
      with container specific fixes (LP: #1031065)
    * fix landscape configuration so client will run (LP: #1042758)
    * enable all available datasources (adding AltCloud and None)
    * fix bug where user data scripts re-ran on upgrade from 10.04 versions
      (LP: #1049146)
 -- Scott Moser <email address hidden> Wed, 19 Sep 2012 22:08:51 -0400

Changed in cloud-init (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Scott Moser (smoser) wrote :

Just a note, an SRU of this depends on SRU of mountall 643289.

Changed in cloud-init (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Scott Moser (smoser) wrote :

I built a precise cloud-init with the 'start networking' removed and tried starting an lxxc instace with that from current precise daily (with mountall 2.36.3). This issue persisted.

Steve, did you tihnk this *should* be fixed in precise?

Revision history for this message
Steve Langasek (vorlon) wrote :

With the current mountall in precise, 2.36.3, yes - there shouldn't be any further need to call 'start networking' directly. Are the symptoms exactly the same as before the mountall fix?

Scott Moser (smoser)
description: updated
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Please test proposed package

Hello Scott, or anyone else affected,

Accepted cloud-init into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/cloud-init/0.6.3-0ubuntu1.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in cloud-init (Ubuntu Precise):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Scott Moser (smoser) wrote :

Verified. See below.

$ sudo apt-get update && sudo apt-get install -y lxc
$ lsb_release -c
Codename: precise
$ dpkg-query --show lxc
lxc 0.7.5-3ubuntu66

$ sudo lxc-clone -o source-precise-amd64 -n patched
$ cat /var/lib/lxc/source-precise-amd64/rootfs/etc/cloud/build.info
build_name: server
serial: 20130218

$ sudo lxc-clone -o source-precise-amd64 -n patched
$ url="https://launchpad.net/ubuntu/+source/cloud-init/0.6.3-0ubuntu1.5/+build/4312778/+files/cloud-init_0.6.3-0ubuntu1.5_all.deb"
$ deb="cloud-init_0.6.3-0ubuntu1.5_all.deb"

$ wget -O "$deb" "$url"
$ sudo cp $deb "/var/lib/lxc/patched/rootfs/tmp"
$ sudo LANG=C chroot "/var/lib/lxc/patched/rootfs" dpkg -i "/tmp/$deb"
$ sudo lxc-start -n patched -- /sbin/init --verbose

Verify can log in as ubuntu/ubuntu.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Colin Watson (cjwatson) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.6.3-0ubuntu1.5

---------------
cloud-init (0.6.3-0ubuntu1.5) precise-proposed; urgency=low

  * debian/update-grub-legacy-ec2: consider kernels bootable on ec2
    that are named -generic, in addition to -virtual. This fixes a problem
    where the kernels installed by linux-lts-quantal were not added to
    /boot/grub/menu.lst (LP: #1005551)
  * debian/patches/lp-1077020-fix-ca-certificates-blanklines.patch: fix
    adding of empty lines in ca-certificates file (LP: #1077020)
  * debian/patches/lp-1031065-nonet-not-start-networking.patch: do not 'start
    networking' in cloud-init-nonet upstart job. Doing so can cause networking
    to be started earlier than it should be. Instead, add a
    cloud-init-container job that runs only in a container and emits
    net-device-added (LP: #1031065).
  * debian/patches/lp-1037567-add-config-drive-v2-support.conf:
    backport support for config-drive-v2 which is part of Openstack Nova in
    Folsom and later. (LP: #1037567) (LP: #1100545)
 -- Scott Moser <email address hidden> Wed, 16 Jan 2013 19:37:57 -0500

Changed in cloud-init (Ubuntu Precise):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.