cloud-init output does not get to console when booted with pv-grub and ramdisk

Bug #606373 reported by Scott Moser
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
High
Scott Moser
Maverick
Fix Released
High
Scott Moser
linux (Ubuntu)
Invalid
Undecided
John Johansen
Maverick
Invalid
Undecided
John Johansen
plymouth (Ubuntu)
New
Undecided
Unassigned
Maverick
Won't Fix
Undecided
Unassigned

Bug Description

Binary package hint: cloud-init

cloud-init is run from an upstart job like:
| start on (mounted MOUNTPOINT=/ and net-device-up IFACE=eth0)
| task
| console output
| exec /usr/bin/cloud-init start

Recently, we've switched to using grub as the loader [1], for the blueprint [2].

That change actually changed 3 things:
a.) used grub as the loader
b.) started using ramdisks again (and booting with root=LABEL=uec-rootfs)
c.) kernel command line changed from "root=/dev/sda1 ro 4" to "root=LABEL=uec-rootfs ro"

'b' and 'c' are because the kernel installed creates a ramdisk and update-grub writes the ramdisk stanza for grub to read.

One of these changes has caused output from cloud-init to not get to the console. (output is seen/collected with ec2-get-console-output).
cloud-init has code like:

    msg = "cloud-init running: %s. up %s seconds" % (now, uptime)
    sys.stderr.write(msg + "\n")
    sys.stderr.flush()

to report on the console how long it took to get to this point in the boot cycle. After the change above, it no longer is visible.

I've verified that :
 - ec2-run-instances ami-80e10ae9 --kernel aki-078e676e
   messages are seen
 - ec2-run-instances ami-80e10ae9
   messages are not.

aki-078e676e ubuntu-maverick-i386-linux-image-2.6.32-305-ec2-v-2.6.32-305.9-kernel
ami-80e10ae9 ubuntu-maverick-daily-i386-server-20100716

So, it definitely was either using a ramdisk or using grub as the loader.

Other notable information,
- output of cloud-config and cloud-run-user-script.conf *do* get to console.
- both cloud-config and cloud-run-user-script also have 'console output'
- cloud-config runs much later in boot: start on (filesystem and started rsyslogd)

I'm trying to determine which of the changes caused this regression.
--
[1] http://aws.typepad.com/aws/2010/07/use-your-own-kernel-with-amazon-ec2.html
[2] https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-cloud-kernel-upgrades

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: cloud-init 0.5.12-0ubuntu7
ProcVersionSignature: User Name 2.6.32-305.9-ec2 2.6.32.11+drm33.2
Uname: Linux 2.6.32-305-ec2 i686
Architecture: i386
Date: Fri Jul 16 16:27:28 2010
Ec2AMI: ami-80e10ae9
Ec2AMIManifest: ubuntu-images-testing-us/ubuntu-maverick-daily-i386-server-20100716.manifest.xml
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.small
Ec2Kernel: aki-407d9529
Ec2Ramdisk: unavailable
PackageArchitecture: all
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init

Revision history for this message
Scott Moser (smoser) wrote :
Scott Moser (smoser)
summary: - cloud-init output does not get to console when booted with pv-grub
+ cloud-init output does not get to console when booted with pv-grub and
+ ramdisk
Revision history for this message
Scott Moser (smoser) wrote :

I modified /boot/grub/menu.lst with the following stanzas to boot with and without a ramdisk with the same kernel cmdline.

| title Ubuntu maverick (development branch), kernel 2.6.32-305-ec2
| root (hd0)
| kernel /boot/vmlinuz-2.6.32-305-ec2 root=/dev/sda1 ro smnord
| savedefault 0
|
| title Ubuntu maverick (development branch), kernel 2.6.32-305-ec2
| root (hd0)
| kernel /boot/vmlinuz-2.6.32-305-ec2 root=/dev/sda1 ro smrd
| initrd /boot/initrd.img-2.6.32-305-ec2
| savedefault 0

the 'smnord' or 'smrd' is just so i would definitively know that I had booted this stanza by looking at /proc/cmdline.

The result is that cloud-init output does get to the console with smnord and *does* get there with 'smrd'.

So, definitely, the use of ramdisk is stopping output from cloud-init from getting to the console.

Revision history for this message
Scott Moser (smoser) wrote :

On a whim, I modified /usr/share/initramfs-tools/init to not redirect output on exec of root's /sbin/init. Ie:

| # Chain to real filesystem
| echo "==== running init but not redirecting ===="
| #exec run-init ${rootmnt} ${init} "$@" <${rootmnt}/dev/console >${rootmnt}/dev/console 2>&1
| exec run-init ${rootmnt} ${init} "$@"

My intent was to leave stderr, stdout going to the same places they were when the ramdisk loaded, as *its* messages go to the console. However, the result is the same as without the change. In the console I see:

| Begin: Running /scripts/init-bottom ... done.^M
| ==== running init but not redirecting ====^M
| ^[%G2010-07-16 18:05:56,155 - cloud-init-run-module[ERROR]: user-scripts already ran once-per-instance
| user-scripts already ran once-per-instance

There should be 'cloud-init' output in after the 'running init' but before the cluod-init-run-module output.

Revision history for this message
Scott Moser (smoser) wrote :

I'm attaching output of a boot with '--verbose'.
It almost appears as if cloud-init is not running at all here. I know it *does* run though, as I see log messages with correct time stamps in /var/log/cloud-init.log like:
[CLOUDINIT] 2010-07-16 18:39:47,137 - cloud-init[INFO]: cloud-init running: Fri, 16 Jul 2010 18:39:47 +0000. up 4.56 seconds

Revision history for this message
Scott Moser (smoser) wrote :

This is presenting problems for alpha-3 testing. It seems that based on timing, other data is not getting to console. It is causing failures in the 'ud-multipart' test.

Revision history for this message
Scott Moser (smoser) wrote :

Even bigger problems, the ssh keys may not get to console output. It seems to affect x86_64 more than i386.

Changed in cloud-init (Ubuntu):
importance: Undecided → High
status: New → Confirmed
tags: added: iso-testing
Revision history for this message
Scott Moser (smoser) wrote :

apw suggests that bug 554172 may be related.

Thierry Carrez (ttx)
Changed in linux-ec2 (Ubuntu Maverick):
assignee: nobody → John Johansen (jjohansen)
tags: added: server-mro
Thierry Carrez (ttx)
Changed in linux-ec2 (Ubuntu Maverick):
importance: Undecided → High
status: New → Confirmed
Scott Moser (smoser)
affects: linux-ec2 (Ubuntu Maverick) → linux (Ubuntu Maverick)
Revision history for this message
Scott Moser (smoser) wrote :

The current belief from the kernel team is that this is fallout from bug 554172.
I've not been able to work around it so far.

It was suggested that we just need to make sure the references to /dev/console never goes to zero. I tried ensuring that with:

$ sudo mv /sbin/init /sbin/init.real
$ cat /sbin/init
#!/bin/sh

/bin/sh -c 'echo "$(date): hi world $$"; exec sleep 100' </dev/console >/dev/console 2>&1
exec /sbin/init.real "$@"
$ sudo chmod 755 /sbin/init

Then, rebooting.
I see the 'hi world' message in console output, and can ssh in while the 'sleep 100' is still running, but I do not get the cloud-init output to the console. I verify that cloud-init wrote to /var/log/cloud-init.log, data like:
[CLOUDINIT] 2010-09-14 19:48:11,763 - cloud-init[INFO]: cloud-init start running: Tue, 14 Sep 2010 19:48:11 +0000. up 2.01 seconds

That *should* also get to stderr, which *should* be /dev/console as the cloud-init job looks like:

# upstart job
start on (mounted MOUNTPOINT=/ and net-device-up IFACE=eth0 and \
   stopped cloud-init-local )
task
console output
exec /usr/bin/cloud-init start

/usr/bin/cloud-init does:
    msg = "cloud-init %s running: %s. up %s seconds" % (cmd, now, uptime)
    sys.stderr.write(msg + "\n")
    sys.stderr.flush()

Revision history for this message
Scott Moser (smoser) wrote :

I'm attaching an init wrapper for debugging purposes. I placed this script in /sbin/init, and moved upstart to /sbin/init.real.
Then, rebooted.

The console shows (snipped):
Loading, please wait...
[ 0.373808] udev[61]: starting version 162
Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[ 0.534353] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts:
(null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
Tue Sep 14 21:03:16 UTC 2010: hi world 191 ()
 21:03:16 up 0 min, 0 users, load average: 0.00, 0.00, 0.00
total 0
lr-x------ 1 root root 64 Sep 14 21:03 0 -> /dev/null
lrwx------ 1 root root 64 Sep 14 21:03 1 -> /dev/console
lrwx------ 1 root root 64 Sep 14 21:03 2 -> /dev/console
mountall: Disconnected from Plymouth
191 reports i=3
191 reports i=4
191 reports i=5

Note, that 'i=3' this is missing output for 'i=0', 1, and 2;
For 3 seconds, this process was writing data to /dev/console that was getting lost.
echo "wrote to ${out} after $i"

(note, the above console came from a boot before I added redirections of /dev/console for
  echo "$i, could not write to ${out}"
and
  echo "wrote to ${out} after $i"
So, those messages would not have appeared.

Revision history for this message
Scott Moser (smoser) wrote :

Per Andy's advice, I removed the closing of the file descriptors in the wrapper script (attaching here), and rebooted.

The output is in the ec2 console output looks like:

| [ 0.676065] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
| Begin: Running /scripts/local-bottom ... done.
| done.
| Begin: Running /scripts/init-bottom ... done.
| [1]: Wed Sep 15 14:38:21 UTC 2010: hi world 185 ()
| [1]: 14:38:21 up 0 min, 0 users, load average: 0.00, 0.00, 0.00
| [1]: ls -l /proc/185/fd
| total 0
| lr-x------ 1 root root 64 Sep 15 14:38 0 -> /dev/null
| lrwx------ 1 root root 64 Sep 15 14:38 1 -> /dev/console
| lrwx------ 1 root root 64 Sep 15 14:38 2 -> /dev/console
| lr-x------ 1 root root 64 Sep 15 14:38 3 -> pipe:[5788]
| MY_HACK: line 13: /handler: Read-only file system
| [1]: 0, could not write to /handler
| MY_HACK: line 13: /handler: Read-only file system
| [1]: 1, could not write to /handler
| mountall: Disconnected from Plymouth
| [7]: 185 reports i=6
| [8]: 185 reports i=7
| [9]: 185 reports i=8
| [10]: 185 reports i=9

The interesting output information there is that the missing data is replaced by 'mountall: Disconnected from Plymouth'.

Given that, Keybuk suggested looking in /var/log/boot.log. It contained the missing messages.

I then ran:
sudo sh -c 'for x in /etc/init/plymouth*.conf; do mv $x $x.disabled; done'

and rebooted.

The rebooted console log contained all the data.

So, we're looking squarely at plymouth at the moment, not repeating data to the console.

Revision history for this message
Scott Moser (smoser) wrote :

On UEC instances, we boot with 'console=ttyS0'. On EC2 we boot with no 'console=' parameters. I believe that the hvc0 driver has somehow been modified so that it will get data written to /dev/console without any additional effort. I'm not sure of that, jjohansen would know more. One way or another, all "normal" messages ('echo HELLO WORLD | sudo tee /dev/console') correctly get to the ec2 console.

I enabled plymouth by adding '--attach-to-session --debug --debug-file=/var/log/plymouthd-debug.log' in /etc/init/plymouth.conf .

Looking in the output log, I see
EC2:
 [main.c] check_for_consoles:checking for consoles
 [main.c] check_for_consoles:There are currently 0 text displays
 [main.c] redirect_standard_io_to_device:redirecting stdio to /dev/tty7
 ..
 [main.c] check_for_consoles:There are currently 0 text displays

UEC:
 [main.c] check_for_consoles:checking for consoles
 [main.c] check_for_consoles:There are currently 0 text displays
 [main.c] redirect_standard_io_to_device:redirecting stdio to ttyS0
 ..
 [main.c] check_for_consoles:There are currently 1 text displays

I verified, that by adding 'console=hvc0' on ec2, all text correctly will correctly get to the EC2 console.

We do have the ability to boot UEC and EC2 images with different kernel command lines. On EC2 all our 10.10 images and later will be booted from pv-grub, and we generate/manage /boot/grub/menu.list via 'legacy-grub-ec2' package. On UEC, we can control the kernel command line in 1 of 2 situations (the case where we're booting with a grub loader, which only works on 10.10 hosts).

The point of the last paragraph, is that while I *can* change the kernel command line to have a specific console=XXXX at this point, I'd really rather not. Its somewhat happenstance that we could fix this bug. I would much rather be able to configure plymouth to write data to /dev/console just as it will write it to /dev/hvc0 if 'console=hvc0' is passed.

Revision history for this message
Scott Moser (smoser) wrote :

This is functional and tested in maverick AMIs with cloud-init 0.5.15-0ubuntu2 (20100916.1 and later).
I decided that I basically had 3 options:
a.) disable plymouth
b.) fix/add support to plymouth to, when configured as such, write data to /dev/console as it would have written to /dev/tty7 (or /dev/XXXX when it sees console=XXXX on the command line).
c.) boot with console=hvc0

The easiest solution was to boot with console=hvc0. If there comes a time when we are forced to merge multiple platforms (ec2 and eucalyptus) into a single pre-generated bootloader configuration file, then we'll cross that bridge then.
b.) boot

cloud-init (0.5.15-0ubuntu2) maverick; urgency=low
  * grub-legacy-ec2: boot with console=hvc0 (LP: #606373)
 -- Scott Moser <email address hidden> Wed, 15 Sep 2010 16:41:48 -0400

Changed in cloud-init (Ubuntu Maverick):
assignee: nobody → Scott Moser (smoser)
status: Confirmed → Fix Released
Changed in linux (Ubuntu Maverick):
importance: High → Undecided
status: Confirmed → Invalid
Revision history for this message
Colin Watson (cjwatson) wrote :

Since this has been worked around, I'll mark the plymouth task Won't Fix for Maverick, but we should revisit this later to find out why this hack was necessary.

Changed in plymouth (Ubuntu Maverick):
status: New → Won't Fix
Thierry Carrez (ttx)
tags: removed: server-mro
Revision history for this message
Scott Moser (smoser) wrote :

I've opened bug 682831 against plymouth, and nominated it for Lucid.
I believe it must be separate from this bug because booting with 'console=hvc0' does not fix the problem.

I'm only mentioning that new bug for reference.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.