Kernel Panic on EC2 After Upgrading from 14.04 to 16.04 via do-release-upgrade -d

Bug #1573231 reported by Will Buckner
144
This bug affects 23 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Critical
Joseph Salisbury
Xenial
Fix Released
Critical
Joseph Salisbury
Yakkety
Fix Released
Critical
Joseph Salisbury

Bug Description

[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 4.4.0-21-generic (buildd@lgw01-21) (gcc version 5.3.1 20160413 (Ubuntu 5.3.1-14ubuntu2) ) #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC 2016 (Ubuntu 4.4.0-21.37-generic 4.4.6)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-21-generic root=UUID=8ea401db-b84b-4cd6-a628-d72f30bbf1e5 ro console=tty1 console=ttyS0
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Centaur CentaurHauls
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x01: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x02: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x04: 'AVX registers'
[ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[ 0.000000] x86/fpu: Using 'eager' FPU context switches.
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000efffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x00000001efffffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] SMBIOS 2.4 present.
[ 0.000000] Hypervisor detected: Xen
[ 0.000000] Xen version 4.2.
[ 0.000000] Netfront and the Xen platform PCI driver have been compiled for this kernel: unplug emulated NICs.
[ 0.000000] Blkfront and the Xen platform PCI driver have been compiled for this kernel: unplug emulated disks.
[ 0.000000] You might have to change the root device
[ 0.000000] from /dev/hd[a-d] to /dev/xvd[a-d]
[ 0.000000] in your root= kernel command line option
[ 0.000000] e820: last_pfn = 0x1f0000 max_arch_pfn = 0x400000000
[ 0.000000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WC UC- WT
[ 0.000000] e820: last_pfn = 0xf0000 max_arch_pfn = 0x400000000
[ 0.000000] found SMP MP-table at [mem 0x000fbba0-0x000fbbaf] mapped at [ffff8800000fbba0]
[ 0.000000] Scanning 1 areas for low memory corruption
[ 0.000000] RAMDISK: [mem 0x3407e000-0x36036fff]
[ 0.000000] ACPI: Early table checksum verification disabled
[ 0.000000] ACPI: RSDP 0x00000000000EA020 000024 (v02 Xen )
[ 0.000000] ACPI: XSDT 0x00000000FC00F5A0 000054 (v01 Xen HVM 00000000 HVML 00000000)
[ 0.000000] ACPI: FACP 0x00000000FC00F260 0000F4 (v04 Xen HVM 00000000 HVML 00000000)
[ 0.000000] ACPI: DSDT 0x00000000FC0035E0 00BBF6 (v02 Xen HVM 00000000 INTL 20090123)
[ 0.000000] ACPI: FACS 0x00000000FC0035A0 000040
[ 0.000000] ACPI: FACS 0x00000000FC0035A0 000040
[ 0.000000] ACPI: APIC 0x00000000FC00F360 0000D8 (v02 Xen HVM 00000000 HVML 00000000)
[ 0.000000] ACPI: HPET 0x00000000FC00F4B0 000038 (v01 Xen HVM 00000000 HVML 00000000)
[ 0.000000] ACPI: WAET 0x00000000FC00F4F0 000028 (v01 Xen HVM 00000000 HVML 00000000)
[ 0.000000] ACPI: SSDT 0x00000000FC00F520 000031 (v02 Xen HVM 00000000 INTL 20090123)
[ 0.000000] ACPI: SSDT 0x00000000FC00F560 000031 (v02 Xen HVM 00000000 INTL 20090123)
[ 0.000000] No NUMA configuration found
[ 0.000000] Faking a node at [mem 0x0000000000000000-0x00000001efffffff]
[ 0.000000] NODE_DATA(0) allocated [mem 0x1efff7000-0x1efffbfff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.000000] Normal [mem 0x0000000100000000-0x00000001efffffff]
[ 0.000000] Device empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009dfff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x00000000efffffff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x00000001efffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x00000001efffffff]
[ 0.000000] ACPI: PM-Timer IO Port: 0xb008
[ 0.000000] IOAPIC[0]: apic_id 1, version 17, address 0xfec00000, GSI 0-47
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 low level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 low level)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 low level)
[ 0.000000] Using ACPI (MADT) for SMP configuration information
[ 0.000000] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[ 0.000000] smpboot: Allowing 15 CPUs, 13 hotplug CPUs
[ 0.000000] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[ 0.000000] PM: Registered nosave memory: [mem 0x0009e000-0x0009ffff]
[ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000dffff]
[ 0.000000] PM: Registered nosave memory: [mem 0x000e0000-0x000fffff]
[ 0.000000] PM: Registered nosave memory: [mem 0xf0000000-0xfbffffff]
[ 0.000000] PM: Registered nosave memory: [mem 0xfc000000-0xffffffff]
[ 0.000000] e820: [mem 0xf0000000-0xfbffffff] available for PCI devices
[ 0.000000] Booting paravirtualized kernel on Xen HVM
[ 0.000000] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[ 0.000000] setup_percpu: NR_CPUS:256 nr_cpumask_bits:256 nr_cpu_ids:15 nr_node_ids:1
[ 0.000000] PERCPU: Embedded 33 pages/cpu @ffff8801e7a00000 s98008 r8192 d28968 u262144
[ 0.000000] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes)
[ 0.000000] Built 1 zonelists in Node order, mobility grouping on. Total pages: 1935240
[ 0.000000] Policy zone: Normal
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-21-generic root=UUID=8ea401db-b84b-4cd6-a628-d72f30bbf1e5 ro console=tty1 console=ttyS0
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] Memory: 7621704K/7863924K available (8356K kernel code, 1278K rwdata, 3920K rodata, 1476K init, 1292K bss, 242220K reserved, 0K cma-reserved)
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=15, Nodes=1
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] Build-time adjustment of leaf fanout to 64.
[ 0.000000] RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=15.
[ 0.000000] RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=15
[ 0.000000] NR_IRQS:16640 nr_irqs:952 16
[ 0.000000] xen:events: Using 2-level ABI
[ 0.000000] xen:events: Xen HVM callback vector for event delivery is enabled
[ 0.000000] Console: colour VGA+ 80x25
[ 0.000000] console [tty1] enabled
[ 0.000000] Cannot get hvm parameter CONSOLE_EVTCHN (18): -22!
[ 0.000000] console [ttyS0] enabled
[ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 30580167144 ns
[ 0.000000] tsc: Detected 2500.038 MHz processor
[ 0.012000] Calibrating delay loop (skipped), value calculated using timer frequency.. 5000.07 BogoMIPS (lpj=10000152)
[ 0.016003] pid_max: default: 32768 minimum: 301
[ 0.020008] ACPI: Core revision 20150930
[ 0.029145] ACPI: 3 ACPI AML tables successfully acquired and loaded
[ 0.032827] Security Framework initialized
[ 0.036002] Yama: becoming mindful.
[ 0.040021] AppArmor: AppArmor initialized
[ 0.044434] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[ 0.053113] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[ 0.056758] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes)
[ 0.064010] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes)
[ 0.068206] Initializing cgroup subsys io
[ 0.072005] Initializing cgroup subsys memory
[ 0.076009] Initializing cgroup subsys devices
[ 0.080004] Initializing cgroup subsys freezer
[ 0.082912] Initializing cgroup subsys net_cls
[ 0.084004] Initializing cgroup subsys perf_event
[ 0.088004] Initializing cgroup subsys net_prio
[ 0.092004] Initializing cgroup subsys hugetlb
[ 0.095088] Initializing cgroup subsys pids
[ 0.096059] CPU: Physical Processor ID: 0
[ 0.100763] mce: CPU supports 2 MCE banks
[ 0.104024] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8
[ 0.108002] Last level dTLB entries: 4KB 512, 2MB 0, 4MB 0, 1GB 4
[ 0.113286] Freeing SMP alternatives memory: 28K (ffffffff820b2000 - ffffffff820b9000)
[ 0.125920] ftrace: allocating 31878 entries in 125 pages
[ 0.160059] divide error: 0000 [#1] SMP
[ 0.163586] Modules linked in:
[ 0.164000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.4.0-21-generic #37-Ubuntu
[ 0.164000] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/07/2015
[ 0.164000] task: ffff8801e6618000 ti: ffff8801e6620000 task.ti: ffff8801e6620000
[ 0.164000] RIP: 0010:[<ffffffff81f6f5de>] [<ffffffff81f6f5de>] smp_store_boot_cpu_info+0x51/0x17f
[ 0.164000] RSP: 0000:ffff8801e6623ea8 EFLAGS: 00010286
[ 0.164000] RAX: 000000000000000e RBX: ffffffff81f34f60 RCX: 0000000000000000
[ 0.164000] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801e7a0a180
[ 0.164000] RBP: ffff8801e6623ec8 R08: 0000000000000000 R09: 0000000000007fff
[ 0.164000] R10: ffffffff81a11ee0 R11: ffffffff81a11ec0 R12: 00000000ffffffff
[ 0.164000] R13: 000000000000a0a0 R14: 000000000000a192 R15: 0000000000000000
[ 0.164000] FS: 0000000000000000(0000) GS:ffff8801e7a00000(0000) knlGS:0000000000000000
[ 0.164000] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.164000] CR2: ffff8801effff000 CR3: 0000000001e0a000 CR4: 00000000001406f0
[ 0.164000] Stack:
[ 0.164000] ffffffff81f34f60 0000000000000100 000000000000a0a0 0000000000000000
[ 0.164000] ffff8801e6623ef8 ffffffff81f6f763 ffffffff82089ef8 ffff8801e66186a8
[ 0.164000] 0000000000000001 0000000000000000 ffff8801e6623f08 ffffffff81f64e10
[ 0.164000] Call Trace:
[ 0.164000] [<ffffffff81f6f763>] native_smp_prepare_cpus+0x57/0x2eb
[ 0.164000] [<ffffffff81f64e10>] xen_hvm_smp_prepare_cpus+0x9/0x2e
[ 0.164000] [<ffffffff81f5a0e5>] kernel_init_freeable+0xb3/0x212
[ 0.164000] [<ffffffff81817f30>] ? rest_init+0x80/0x80
[ 0.164000] [<ffffffff81817f3e>] kernel_init+0xe/0xe0
[ 0.164000] [<ffffffff8182488f>] ret_from_fork+0x3f/0x70
[ 0.164000] [<ffffffff81817f30>] ? rest_init+0x80/0x80
[ 0.164000] Code: 53 41 83 cc ff 49 c7 c6 92 a1 00 00 48 89 c7 f3 a5 66 c7 80 da 00 00 00 00 00 0f b7 35 b4 36 fc ff 8b 05 16 88 27 00 8d 44 06 ff <f7> f6 31 d2 89 05 78 4a fc ff 8d 86 ff 7f 00 00 f7 f6 be c0 00
[ 0.164000] RIP [<ffffffff81f6f5de>] smp_store_boot_cpu_info+0x51/0x17f
[ 0.164000] RSP <ffff8801e6623ea8>
[ 0.336005] ---[ end trace 1a9aebc980234339 ]---
[ 0.340017] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 0.340017]

System won't boot at all, so there's a limit to the additional information I can get, but if nothing can be done from the above, let me know what/how to provide additional information needed. Thanks!

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1573231/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1573231

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Revision history for this message
Will Buckner (wbuckner) wrote :

Sorry, I don't think there's any possible way to do it. The machine won't boot. I could try to reproduce the issue again on a different VM, but I still wouldn't be able to collect logs.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Rasmus Larsen (rla-2) wrote :

Currently also encountering this issue (the same kernel panic) with ami-3079f543 on eu-west-1 with a c3.large instance.

So this seems to be a general ubuntu 16.04 on AWS issue and not an upgrade specific issue. I haven't seen this issue with the beta2 AMIs.

Interestingly I seem to be able to start the AMI on t2.micro instances, so this could well be instance type specific.

Revision history for this message
Rasmus Larsen (rla-2) wrote :

I've tested with some instance types with the ami-3079f543 image (eg. ubuntu 16.04 final with Ubuntu 4.4.0-21.37-generic 4.4.6)

The following instance types fail to start with a kernel panic:

r3.large
c3.large

The following instance types seem to work:

t2.micro
c4.large
m3.medium

I haven't tested with all sizes, but I suspect all t2, c4 and m3 instances are good.

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Revision history for this message
Lasse Westh-Nielsen (lassewesthneo) wrote :

FWIW, m3.large is also affected.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.6 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-rc5-wily/

Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Changed in linux (Ubuntu Xenial):
status: Triaged → Confirmed
Changed in linux (Ubuntu):
status: Triaged → Confirmed
tags: added: kernel-key
Revision history for this message
Rasmus Larsen (rla-2) wrote :

I can still reproduce this in 4.6-rc5. See attached boot log.

I might do a bisect on this if I find the time...

tags: added: kernel-bug-exists-upstream
Revision history for this message
Rasmus Larsen (rla-2) wrote :

Bisecting...

Current candidate seems to be 31c2013e4ea2e594522980acc3d20e88664b19f1

Revision history for this message
Rasmus Larsen (rla-2) wrote :

According to my bisect this commit is responsible:

commit 8ae10f463b7ae3455e9d0507176349c76580995f
Author: Vikas Shivappa <email address hidden>
Date: Thu Mar 10 15:32:09 2016 -0800

    perf/x86/mbm: Add Intel Memory B/W Monitoring enumeration and init

    BugLink: http://bugs.launchpad.net/bugs/1397880

    The MBM init patch enumerates the Intel MBM (Memory b/w monitoring)
    and initializes the perf events and datastructures for monitoring the
    memory b/w.

    Its based on original patch series by Tony Luck and Kanaka Juvva.

    Memory bandwidth monitoring (MBM) provides OS/VMM a way to monitor
    bandwidth from one level of cache to another. The current patches
    support L3 external bandwidth monitoring. It supports both 'local
    bandwidth' and 'total bandwidth' monitoring for the socket. Local
    bandwidth measures the amount of data sent through the memory controller
    on the socket and total b/w measures the total system bandwidth.

    Extending the cache quality of service monitoring (CQM) we add two
    more events to the perf infrastructure:

      intel_cqm_llc/local_bytes - bytes sent through local socket memory controller
      intel_cqm_llc/total_bytes - total L3 external bytes sent

    The tasks are associated with a Resouce Monitoring ID (RMID) just like
    in CQM and OS uses a MSR write to indicate the RMID of the task during
    scheduling.

    Signed-off-by: Vikas Shivappa <email address hidden>
    Signed-off-by: Peter Zijlstra (Intel) <email address hidden>
    Reviewed-by: Tony Luck <email address hidden>
    Acked-by: Thomas Gleixner <email address hidden>
    Cc: Alexander Shishkin <email address hidden>
    Cc: Andy Lutomirski <email address hidden>
    Cc: Arnaldo Carvalho de Melo <email address hidden>
    Cc: Borislav Petkov <email address hidden>
    Cc: Brian Gerst <email address hidden>
    Cc: David Ahern <email address hidden>
    Cc: Denys Vlasenko <email address hidden>
    Cc: H. Peter Anvin <email address hidden>
    Cc: Jiri Olsa <email address hidden>
    Cc: Linus Torvalds <email address hidden>
    Cc: Matt Fleming <email address hidden>
    Cc: Namhyung Kim <email address hidden>
    Cc: Peter Zijlstra <email address hidden>
    Cc: Stephane Eranian <email address hidden>
    Cc: Vince Weaver <email address hidden>
    Cc: <email address hidden>
    Cc: <email address hidden>
    Cc: <email address hidden>
    Cc: <email address hidden>
    Link: http://lkml<email address hidden>
    Signed-off-by: Ingo Molnar <email address hidden>
    (back ported from commit 33c3cc7acfd95968d74247f1a4e1b0727a07ed43)
    Signed-off-by: Tim Gardner <email address hidden>

     Conflicts:
        arch/x86/kernel/cpu/common.c

Revision history for this message
Rasmus Larsen (rla-2) wrote :

Verified it on multiple instances.

31c2013e4ea2e594522980acc3d20e88664b19f1 is the last good commit.
8ae10f463b7ae3455e9d0507176349c76580995f is the first bad commit.

Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with a revert of 33c3cc7(8ae10f463 in Xenial). To revert this commit, I also had to revert e7ee3e8, 2d4de83 and 87f01cc.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573231/

Can you test this kernel and see if it resolves this bug?

Revision history for this message
Rasmus Larsen (rla-2) wrote :

I get the same issue with the build, which seemed weird so I retested and it looks like I must have mislabeled one of my builds, because now when I build 31c2013e4ea2e594522980acc3d20e88664b19f1, I still get the issue, but if I go back to a6ebb4464659d35e0516900a8493f9720dd77d67 it works.
Somehow I must have swapped some builds...

a6ebb4464659d35e0516900a8493f9720dd77d67 Good
31c2013e4ea2e594522980acc3d20e88664b19f1 Bad
8ae10f463b7ae3455e9d0507176349c76580995f Bad

Really sorry about the detour Joseph.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with a revert of 1f12e32f4. To revert this commit, I also had to revert e7ee3e8, 2d4de83, 87f01cc and 33c3cc7

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1573231/

Can you test this kernel and see if it resolves this bug?

Revision history for this message
Rasmus Larsen (rla-2) wrote :

I can confirm that I can boot with this kernel on a C3.large instance.

If anyone else wants to verify this, the procedure is to boot one of the failing instances with the Beta 2 AMIs and install the build (and reboot).

Revision history for this message
Joseph Salisbury (jsalisbury) wrote : [v4.6-rc1 Regression] x86/topology: Create logical package id

Hi Thomas,

A kernel bug report was opened against Ubuntu [0]. After a kernel
bisect, it was found that reverting the following commit resolved this bug:

commit 1f12e32f4cd5243ae46d8b933181be0d022c6793
Author: Thomas Gleixner <email address hidden>
Date: Mon Feb 22 22:19:15 2016 +0000

    x86/topology: Create logical package id

To build successfully with this commit reverted, I also had to revert
commits: e7ee3e8,2d4de83,87f01cc and 33c3cc7.

The regression was introduced as of v4.6-rc1.

I was hoping to get your feedback, since you are the patch author. Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?

Thanks,

Joe

[0] http://pad.lv/1573231

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

comment #16 is the message I sent to upstream.

Revision history for this message
tglx (tglx) wrote :

On Fri, 6 May 2016, Joseph Salisbury wrote:
> A kernel bug report was opened against Ubuntu [0]. After a kernel
> bisect, it was found that reverting the following commit resolved this bug:
>
> commit 1f12e32f4cd5243ae46d8b933181be0d022c6793
> Author: Thomas Gleixner <email address hidden>
> Date: Mon Feb 22 22:19:15 2016 +0000
>
> x86/topology: Create logical package id
>
> To build successfully with this commit reverted, I also had to revert
> commits: e7ee3e8,2d4de83,87f01cc and 33c3cc7.
>
> The regression was introduced as of v4.6-rc1.
>
> I was hoping to get your feedback, since you are the patch author. Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request?

Yuck. That dies with a divide error. And that looks like XEN is supplying crap
data in the CPUID.

Does the patch below cure the issue?

Thanks,

        tglx

8<---------------

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -332,6 +332,11 @@ static void __init smp_init_package_map(
   * primary cores.
   */
  ncpus = boot_cpu_data.x86_max_cores;
+ if (!ncpus) {
+ pr_warn("x86_max_cores == zero !?!?");
+ ncpus = 1;
+ }
+
  __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);

  /*

Revision history for this message
Nish Aravamudan (nacc) wrote :

@Joe, based upon the identified commit, I wonder if it would be worth testing backports of follow-on fixes specifically for that commit?

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b5d5f27d938fb6fc8d3202704e699d2694a02da6
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=63d1e995be455ae9196270eb4b789de21afd42ed
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b5d5f27d938fb6fc8d3202704e699d2694a02da6

Not sure if all of them are necessary but the second one esp. might be relevant as it changes the divisor.

Revision history for this message
Nish Aravamudan (nacc) wrote :

Err, apologies Joe, it does seem like upon more careful reading of the bug report, the second mentioned commit (comment #18) specifically introduced the regression? That seems unlikely at best given it's contents, rights? That is, if the divisor was zero after that commit, it was zero before it, too, unless there's another division going on that's implicit.

Revision history for this message
Boris Ostrovsky (boris-ostrovsky) wrote :

On 05/06/2016 02:48 PM, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Joseph Salisbury wrote:
>> A kernel bug report was opened against Ubuntu [0]. After a kernel
>> bisect, it was found that reverting the following commit resolved this bug:
>>
>> commit 1f12e32f4cd5243ae46d8b933181be0d022c6793
>> Author: Thomas Gleixner <email address hidden>
>> Date: Mon Feb 22 22:19:15 2016 +0000
>>
>> x86/topology: Create logical package id
>>
>> To build successfully with this commit reverted, I also had to revert
>> commits: e7ee3e8,2d4de83,87f01cc and 33c3cc7.
>>
>> The regression was introduced as of v4.6-rc1.
>>
>> I was hoping to get your feedback, since you are the patch author. Do
>> you think gathering any additional data will help diagnose this issue,
>> or would it be best to submit a revert request?
> Yuck. That dies with a divide error. And that looks like XEN is supplying crap
> data in the CPUID.

Joe, do you have

ed6069b xen/apic: Provide Xen-specific version of cpu_present_to_apicid
APIC op

-boris

>
> Does the patch below cure the issue?
>
> Thanks,
>
> tglx
>
> 8<---------------
>
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -332,6 +332,11 @@ static void __init smp_init_package_map(
> * primary cores.
> */
> ncpus = boot_cpu_data.x86_max_cores;
> + if (!ncpus) {
> + pr_warn("x86_max_cores == zero !?!?");
> + ncpus = 1;
> + }
> +
> __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
>
> /*

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 05/06/2016 03:13 PM, Boris Ostrovsky wrote:
> On 05/06/2016 02:48 PM, Thomas Gleixner wrote:
>> On Fri, 6 May 2016, Joseph Salisbury wrote:
>>> A kernel bug report was opened against Ubuntu [0]. After a kernel
>>> bisect, it was found that reverting the following commit resolved this bug:
>>>
>>> commit 1f12e32f4cd5243ae46d8b933181be0d022c6793
>>> Author: Thomas Gleixner <email address hidden>
>>> Date: Mon Feb 22 22:19:15 2016 +0000
>>>
>>> x86/topology: Create logical package id
>>>
>>> To build successfully with this commit reverted, I also had to revert
>>> commits: e7ee3e8,2d4de83,87f01cc and 33c3cc7.
>>>
>>> The regression was introduced as of v4.6-rc1.
>>>
>>> I was hoping to get your feedback, since you are the patch author. Do
>>> you think gathering any additional data will help diagnose this issue,
>>> or would it be best to submit a revert request?
>> Yuck. That dies with a divide error. And that looks like XEN is supplying crap
>> data in the CPUID.
> Joe, do you have
>
> ed6069b xen/apic: Provide Xen-specific version of cpu_present_to_apicid
> APIC op
>
> -boris
Yes the commit is in the 4.4 based Ubuntu kernel. This bug also happens
with the vanilla 4.6-rc5 kernel, which also has that commit.

>
>
>> Does the patch below cure the issue?
>>
>> Thanks,
>>
>> tglx
>>
>> 8<---------------
>>
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -332,6 +332,11 @@ static void __init smp_init_package_map(
>> * primary cores.
>> */
>> ncpus = boot_cpu_data.x86_max_cores;
>> + if (!ncpus) {
>> + pr_warn("x86_max_cores == zero !?!?");
>> + ncpus = 1;
>> + }
>> +
>> __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
>>
>> /*
>

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 05/06/2016 02:48 PM, Thomas Gleixner wrote:
> On Fri, 6 May 2016, Joseph Salisbury wrote:
>> A kernel bug report was opened against Ubuntu [0]. After a kernel
>> bisect, it was found that reverting the following commit resolved this bug:
>>
>> commit 1f12e32f4cd5243ae46d8b933181be0d022c6793
>> Author: Thomas Gleixner <email address hidden>
>> Date: Mon Feb 22 22:19:15 2016 +0000
>>
>> x86/topology: Create logical package id
>>
>> To build successfully with this commit reverted, I also had to revert
>> commits: e7ee3e8,2d4de83,87f01cc and 33c3cc7.
>>
>> The regression was introduced as of v4.6-rc1.
>>
>> I was hoping to get your feedback, since you are the patch author. Do
>> you think gathering any additional data will help diagnose this issue,
>> or would it be best to submit a revert request?
> Yuck. That dies with a divide error. And that looks like XEN is supplying crap
> data in the CPUID.
>
> Does the patch below cure the issue?
>
> Thanks,
>
> tglx
>
> 8<---------------
>
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -332,6 +332,11 @@ static void __init smp_init_package_map(
> * primary cores.
> */
> ncpus = boot_cpu_data.x86_max_cores;
> + if (!ncpus) {
> + pr_warn("x86_max_cores == zero !?!?");
> + ncpus = 1;
> + }
> +
> __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
>
> /*
I'll have this patch tested and report back.

Thanks,

Joe

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with the patch suggested by Thomas Gleixner. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1573231/

Can you test this kernel and see if it resolves this bug?

Revision history for this message
Rasmus Larsen (rla-2) wrote :

Kernel boots correctly on a c3.large with Thomas patch.

I've attached the output of the cpuid command (apt-get install cpuid) on the instance, in case it has any interest.

Revision history for this message
Rasmus Larsen (rla-2) wrote :

CPUID output posted before was truncated...

Revision history for this message
Boris Ostrovsky (boris-ostrovsky) wrote :

On 05/06/2016 03:38 PM, Joseph Salisbury wrote:
> On 05/06/2016 03:13 PM, Boris Ostrovsky wrote:
>> On 05/06/2016 02:48 PM, Thomas Gleixner wrote:
>>>
>>> Yuck. That dies with a divide error. And that looks like XEN is supplying crap
>>> data in the CPUID.
>> Joe, do you have
>>
>> ed6069b xen/apic: Provide Xen-specific version of cpu_present_to_apicid
>> APIC op
>>
>> -boris
> Yes the commit is in the 4.4 based Ubuntu kernel. This bug also happens
> with the vanilla 4.6-rc5 kernel, which also has that commit.

Can you post guest's cpuid -1 -r ? (I guess after you verify Thomas' patch)

Thanks.
-boris

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 05/06/2016 03:38 PM, Joseph Salisbury wrote:
> On 05/06/2016 02:48 PM, Thomas Gleixner wrote:
>> On Fri, 6 May 2016, Joseph Salisbury wrote:
>>> A kernel bug report was opened against Ubuntu [0]. After a kernel
>>> bisect, it was found that reverting the following commit resolved this bug:
>>>
>>> commit 1f12e32f4cd5243ae46d8b933181be0d022c6793
>>> Author: Thomas Gleixner <email address hidden>
>>> Date: Mon Feb 22 22:19:15 2016 +0000
>>>
>>> x86/topology: Create logical package id
>>>
>>> To build successfully with this commit reverted, I also had to revert
>>> commits: e7ee3e8,2d4de83,87f01cc and 33c3cc7.
>>>
>>> The regression was introduced as of v4.6-rc1.
>>>
>>> I was hoping to get your feedback, since you are the patch author. Do
>>> you think gathering any additional data will help diagnose this issue,
>>> or would it be best to submit a revert request?
>> Yuck. That dies with a divide error. And that looks like XEN is supplying crap
>> data in the CPUID.
>>
>> Does the patch below cure the issue?
>>
>> Thanks,
>>
>> tglx
>>
>> 8<---------------
>>
>> --- a/arch/x86/kernel/smpboot.c
>> +++ b/arch/x86/kernel/smpboot.c
>> @@ -332,6 +332,11 @@ static void __init smp_init_package_map(
>> * primary cores.
>> */
>> ncpus = boot_cpu_data.x86_max_cores;
>> + if (!ncpus) {
>> + pr_warn("x86_max_cores == zero !?!?");
>> + ncpus = 1;
>> + }
>> +
>> __max_logical_packages = DIV_ROUND_UP(total_cpus, ncpus);
>>
>> /*
> I'll have this patch tested and report back.
>
> Thanks,
>
> Joe
Yes, your patch does in fact fix the bug. Would you like any additional
information regarding the bug?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

On 05/06/2016 04:46 PM, Boris Ostrovsky wrote:
> On 05/06/2016 03:38 PM, Joseph Salisbury wrote:
>> On 05/06/2016 03:13 PM, Boris Ostrovsky wrote:
>>> On 05/06/2016 02:48 PM, Thomas Gleixner wrote:
>>>> Yuck. That dies with a divide error. And that looks like XEN is supplying crap
>>>> data in the CPUID.
>>> Joe, do you have
>>>
>>> ed6069b xen/apic: Provide Xen-specific version of cpu_present_to_apicid
>>> APIC op
>>>
>>> -boris
>> Yes the commit is in the 4.4 based Ubuntu kernel. This bug also happens
>> with the vanilla 4.6-rc5 kernel, which also has that commit.
>
> Can you post guest's cpuid -1 -r ? (I guess after you verify Thomas' patch)
>
> Thanks.
> -boris
>
>
>
Thomas' patch does resolve the bug. The cpuid info can be seen here:
https://launchpadlibrarian.net/258234267/cpuid_full.txt

Thanks,

Joe

Revision history for this message
Boris Ostrovsky (boris-ostrovsky) wrote :

On 05/06/2016 04:51 PM, Joseph Salisbury wrote:
> Thomas' patch does resolve the bug. The cpuid info can be seen here:
> https://launchpadlibrarian.net/258234267/cpuid_full.txt

Any chance you could post it raw (cpuid -1 -r)?

Thanks.
-boris

Revision history for this message
Rasmus Larsen (rla-2) wrote :

Here you go, cpuid -1 -r output.

Changed in linux (Ubuntu Xenial):
importance: High → Critical
Changed in linux (Ubuntu Yakkety):
importance: High → Critical
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Revision history for this message
Alexander Temerev (sorhed) wrote :

Hi — any idea when the AMIs will be updated?

Xenial can't be launched on half of AWS instances so far. :(

Revision history for this message
Rasmus Larsen (rla-2) wrote :

Well, they're probably waiting for the patch to hit the LTS kernel 4.4 branch, which will probably happen on the 18th May... Or something like that.

But you can easily run it until then...

Use the 16.04 Beta 2 AMIs, but just disable kernel updates (say, with "apt-mark hold") and update everything else. That will basically give you all the final stuff, just without the last few kernel updates, including the one that broke most EC2 instances.

When a working kernel gets here, just apt-mark unhold and you should be good.

You might want to temporarily disable automatic security updates until then if you do it though...

Revision history for this message
JP Barbosa (jpbarbosa) wrote :

Hello,

I'm getting kernel panic on m3.large with these AMI:

- ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20160420.3 (ami-840910ee)
- ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20160516.1 (ami-13be557e)

"Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b"

Both use HVM. With Xenial paravirtual AMI I don't get kernel panic.

Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Rasmus Larsen (rla-2) wrote :

I've verified that the linux-image-4.4.0-23-generic image in xenial-proposed boots correctly on a c3.large ec2 instance.
uname -v: #41-Ubuntu SMP Mon May 16 23:04:25 UTC 2016

root@ip-10-33-53-135:~# apt-cache showpkg linux-image-4.4.0-23-generic
Package: linux-image-4.4.0-23-generic
Versions:
4.4.0-23.41 (/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_xenial-proposed_main_binary-amd64_Packages)
 Description Language:
                 File: /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_xenial-proposed_main_binary-amd64_Packages
                  MD5: 301993998617cbcae03feb3fe5d3aa55
 Description Language: en
                 File: /var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_xenial-proposed_main_i18n_Translation-en
                  MD5: 301993998617cbcae03feb3fe5d3aa55

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (16.9 KiB)

This bug was fixed in the package linux - 4.4.0-23.41

---------------
linux (4.4.0-23.41) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1582431

  * zfs: disable module checks for zfs when cross-compiling (LP: #1581127)
    - [Packaging] disable zfs module checks when cross-compiling

  * Xenial update to v4.4.10 stable release (LP: #1580754)
    - Revert "UBUNTU: SAUCE: (no-up) ACPICA: Dispatcher: Update thread ID for
      recursive method calls"
    - Revert "UBUNTU: SAUCE: nbd: ratelimit error msgs after socket close"
    - Revert: "powerpc/tm: Check for already reclaimed tasks"
    - RDMA/iw_cxgb4: Fix bar2 virt addr calculation for T4 chips
    - ipvs: handle ip_vs_fill_iph_skb_off failure
    - ipvs: correct initial offset of Call-ID header search in SIP persistence
      engine
    - ipvs: drop first packet to redirect conntrack
    - mfd: intel-lpss: Remove clock tree on error path
    - nbd: ratelimit error msgs after socket close
    - ata: ahci_xgene: dereferencing uninitialized pointer in probe
    - mwifiex: fix corner case association failure
    - CNS3xxx: Fix PCI cns3xxx_write_config()
    - clk-divider: make sure read-only dividers do not write to their register
    - soc: rockchip: power-domain: fix err handle while probing
    - clk: rockchip: free memory in error cases when registering clock branches
    - clk: meson: Fix meson_clk_register_clks() signature type mismatch
    - clk: qcom: msm8960: fix ce3_core clk enable register
    - clk: versatile: sp810: support reentrance
    - clk: qcom: msm8960: Fix ce3_src register offset
    - lpfc: fix misleading indentation
    - ath9k: ar5008_hw_cmn_spur_mitigate: add missing mask_m & mask_p
      initialisation
    - mac80211: fix statistics leak if dev_alloc_name() fails
    - tracing: Don't display trigger file for events that can't be enabled
    - MD: make bio mergeable
    - Minimal fix-up of bad hashing behavior of hash_64()
    - mm, cma: prevent nr_isolated_* counters from going negative
    - mm/zswap: provide unique zpool name
    - ARM: EXYNOS: Properly skip unitialized parent clock in power domain on
    - ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernel
    - xen: Fix page <-> pfn conversion on 32 bit systems
    - xen/balloon: Fix crash when ballooning on x86 32 bit PAE
    - xen/evtchn: fix ring resize when binding new events
    - HID: wacom: Add support for DTK-1651
    - HID: Fix boot delay for Creative SB Omni Surround 5.1 with quirk
    - Input: zforce_ts - fix dual touch recognition
    - proc: prevent accessing /proc/<PID>/environ until it's ready
    - mm: update min_free_kbytes from khugepaged after core initialization
    - batman-adv: fix DAT candidate selection (must use vid)
    - batman-adv: Check skb size before using encapsulated ETH+VLAN header
    - batman-adv: Fix broadcast/ogm queue limit on a removed interface
    - batman-adv: Reduce refcnt of removed router when updating route
    - writeback: Fix performance regression in wb_over_bg_thresh()
    - MAINTAINERS: Remove asterisk from EFI directory names
    - x86/tsc: Read all ratio bits from MSR_PLATFORM_INFO
    - ARM: cpuidle: Pass on arm_cpuidle_s...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: jalapeno
Revision history for this message
cfelde (cfelde) wrote :

Any update on when this is going to get released for Xenial?

Revision history for this message
Travis Johnson (conslo) wrote :

the kernel version this is fixed in is proposed, but has urgency=low, which seems contradictory to the "critical" importance of this bug. Not sure if that should be changed / how that would be changed.

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Travis, the urgency= field is completely ignored in all Ubuntu tooling.

Revision history for this message
Andrey Kislyuk (weaver) wrote :

The Ubuntu 14.04 cloud images remain broken on m3.large and c3.large. Is there a way to expedite building them with the fix? It seems that the Ubuntu release process is broken, if bugs like this make it into the LTS images and then don't get a clear timeline or priority for getting fixed.

Revision history for this message
Travis Johnson (conslo) wrote :

it seems the fix is in proposed according to this page: https://launchpad.net/ubuntu/+source/linux. But when I enable the proposed repository the version there is old, so the version on that page must just be the source version, but the package hasn't actually been put in the repo?

Clearly I don't understand how this proposed/update/release process works. I'd love to be able to install this from proposed and bake my own base AMI.

I'm going to find time to poke in IRC and see if someone can explain this process/stages to me, but regardless I'd love to know a timeline or at least where this sits in terms of priorities to be released (be it in proposed or preferably updates).

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.4.0-24.43

---------------
linux (4.4.0-24.43) xenial; urgency=low

  [ Kamal Mostafa ]

  * CVE-2016-1583 (LP: #1588871)
    - ecryptfs: fix handling of directory opening
    - SAUCE: proc: prevent stacking filesystems on top
    - SAUCE: ecryptfs: forbid opening files without mmap handler
    - SAUCE: sched: panic on corrupted stack end

  * arm64: statically link rtc-efi (LP: #1583738)
    - [Config] Link rtc-efi statically on arm64

 -- Kamal Mostafa <email address hidden> Fri, 03 Jun 2016 10:02:16 -0700

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.