precise kernels do not boot on ec2 without idle=halt

Bug #881076 reported by Scott Moser
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Stefan Bader

Bug Description

I tried booting a precise kernel today on EC2, and heres what I found, considering
the ubuntu-precise-daily-amd64-server-20111024 and ubuntu-precise-daily-i386-server-20111024 builds, which boot a kernel from
$ dpkg -S /boot/vmlinuz-3.1.0-1-virtual
linux-image-3.1.0-1-virtual: /boot/vmlinuz-3.1.0-1-virtual

PASS: amd64 | inst | m1.large
FAIL: amd64 | ebs_ | m1.large
FAIL: amd64 | ebs_ | t1.micro
FAIL: i386__| inst | m1.small
FAIL: i386__| ebs_ | t1.micro

Thats all I bothered testing. I'll attach the successful log and as many logs as I can, but some of them produced empty console output.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.1.0-1-virtual 3.1.0-1.1
ProcVersionSignature: User Name 3.1.0-1.1-virtual 3.1.0-rc10
Uname: Linux 3.1.0-1-virtual x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Oct 24 19:49 seq
 crw-rw---- 1 root audio 116, 33 Oct 24 19:49 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 1.24-0ubuntu1
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CurrentDmesg: [13893341.485666] eth0: no IPv6 routers present
Date: Mon Oct 24 20:04:00 2011
Ec2AMI: ami-db2ae5b2
Ec2AMIManifest: ubuntu-images-testing-us/ubuntu-precise-daily-amd64-server-20111024.manifest.xml
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.large
Ec2Kernel: aki-825ea7eb
Ec2Ramdisk: unavailable
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: root=LABEL=cloudimg-rootfs ro console=hvc0
ProcModules: acpiphp 24080 0 - Live 0x0000000000000000
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

t1.micro log of failed boot at: http://paste.ubuntu.com/718188/
My failed boots did not produce console output even after reboot or terminate.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 881076

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Scott Moser (smoser)
Changed in linux (Ubuntu):
importance: Undecided → High
status: Incomplete → Confirmed
Revision history for this message
Stefan Bader (smb) wrote : Re: precise kernels do not boot on ec2

Hm, wasn't there some issue related to incorrectly using mwait before...? Trying to find something there. The passed amd64 run definitely does not try to use mwait in idle threads..

Revision history for this message
Stefan Bader (smb) wrote :

Thinking more it may have been the use of some Intel specific driver before. Test locally do show no problem but I am running an AMD host.

Revision history for this message
Stefan Bader (smb) wrote :

One more detail here, then Xen versions:
m1.large, passed: 3.4.3-2.6.18 (preserve-AD)
m1.large, failed: 3.0.3-rc5-8.1.14.f
t1.micro32, failed: 3.1.2-128.1.10.el5 (preserve-AD)

The primary bug likely is domU selecting mwait_idle. This should be hlt. I may be possible that some change tried to make it available for dom0 and does not fence correctly. At least there seems to be a bit of a relation to the hypervisor version of the host as well.

Revision history for this message
Stefan Bader (smb) wrote :

The following commit changes calls to pm_idle into first trying cpuidle_call_idle() and if that returns non-zero to fall back to
call pm_idle().

commit a0bfa1373859e9d11dc92561a8667588803e42d8
Author: Len Brown <email address hidden>
Date: Fri Apr 1 19:34:59 2011 -0400

    cpuidle: stop depending on pm_idle

However cpuidle_call_idle() will return -ENODEV if it is supposed to be disabled by cpuidle.off. Which then causes pm_idle() to be called.

This has some bad interaction with the following change that tries to make use of disabling cpuidle in Xen to fall back to hlt.

commit d91ee5863b71e8c90eaf6035bff3078a85e2e7b5
Author: Len Brown <email address hidden>
Date: Fri Apr 1 18:28:35 2011 -0400

    cpuidle: replace xen access to x86 pm_idle and default_idle

The problem I see is that select_idle_routine() is called from arch/x86/kernel/cpu/common.c and since Xen setup does not set pm_idle anymore, it can cause mwait_idle or amd_e400_idle functions get selected.
In testing it seem amd_e400_idle in PVM domU at least does not immediately cause problems, but mwait_idle just causes crashes. From the reports I have this may be related to older Hypervisors (3.1 and older) not clearing the mwait capability. But overall there seems something wrong in the interaction.

I am not really sure whether the logic of calling pm_idle() on all errors from cpuidle_call_idle() is already flawed or the assumption in the Xen patch about being able to prevent the wrong idle function by turning cpuidle off is wrong.

Stefan Bader (smb)
Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
status: Confirmed → Triaged
Revision history for this message
Stefan Bader (smb) wrote :

A suggested workaround was to add idle=halt to the pv-grub commandline while the underlying issue is tried to be resolved. Is that an option to get testing started?

Revision history for this message
Scott Moser (smoser) wrote :

Tomorrow's builds of precise should have 'ide=halt' turned on. I verified that adding that to /boot/grub/menu.lst allowed me to boot on a t1.micro.
ubuntu@ip-10-243-42-222:~$ uname -m
x86_64
ubuntu@ip-10-243-42-222:~$ cat /proc/version_signature
Ubuntu 3.1.0-2.3-virtual 3.1.0
ubuntu@ip-10-243-42-222:~$ cat /proc/cmdline
root=LABEL=cloudimg-rootfs ro console=hvc0 idle=halt
ubuntu@ip-10-243-42-222:~$ ec2metadata --instance-type
t1.micro

Revision history for this message
Scott Moser (smoser) wrote :

Also just verified that an amd64 m1.large still booted with 'ide=halt' so the temporary hack doesn't seem to just trade one for the other.

Revision history for this message
Scott Moser (smoser) wrote :

ok, so as you can see i actually added 'ide=halt', not 'idle=halt'. and somehow that allowed me to boot a t1.micro. so I must have just been lucky.

However, I just tested 'idle=halt' and that doesn't seem to make anything worse, so I'll check that change in.

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Scott Moser (smoser) wrote :

Ok, testing with cmdline including 'idle=halt' on the following:
 us-east-1 ami-e789418e ubuntu-precise-daily-amd64-server-20111116.2
 us-east-1 ami-e98d4580 ebs/ubuntu-precise-daily-amd64-server-20111116.2
 us-east-1 ami-8f8840e6 ubuntu-precise-daily-i386-server-20111116.2
 us-east-1 ami-038d456a ebs/ubuntu-precise-daily-i386-server-20111116.2
 us-east-1 ami-658f470c hvm/ubuntu-precise-daily-amd64-server-20111116.2

PASS: amd64 | inst | m1.large
PASS: amd64 | ebs_ | m1.large
PASS: amd64 | ebs_ | t1.micro
PASS: i386__| inst | m1.small
PASS: i386__| ebs_ | t1.micro
PASS: amd64 | ebs_ | cc1.4xlarge (hvm)

I have attached a tarball of all the consoles for potential later reference.

In short, we are currenty *worked around* this bug.

Scott Moser (smoser)
summary: - precise kernels do not boot on ec2
+ precise kernels do not boot on ec2 without idle=halt
Revision history for this message
Stefan Bader (smb) wrote :

Upstream 3.2-rc5 contains the following commit which is supposed to fix this:

commit e5fd47bfab2df0c2184cc0bf4245d8e1bb7724fb
Author: Konrad Rzeszutek Wilk <email address hidden>
Date: Mon Nov 21 18:02:02 2011 -0500

    xen/pm_idle: Make pm_idle be default_idle under Xen.

This should be in the rebase to rc5 which was done with 3.2.0-4.10. As far as I can see this should be uploaded already.

Revision history for this message
Scott Moser (smoser) wrote :

per Stefan's comment just now in #ubuntu-server, this is in 3.2.0-4.10.
I just tested on a t1.micro in us-east-1, and reboot without idle=halt worked.

So, I'm going to back the workaround out of automated-ec2-builds and mark this fix-released.

Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

As it is uploaded I set the status to released.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.