[Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2 and Kunpeng920

Bug #1852723 reported by Ike Panhc
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
High
Ike Panhc

Bug Description

[Impact]
4.15.0-71-generic can not boot on ThunderX2 and Kunpeng920. Kernel oops durning SMP init

console logs are available here
https://pastebin.ubuntu.com/p/By4m3PtKsG/
https://pastebin.ubuntu.com/p/vN2c3CFVXR/

[Test Case]
Boot kernel with earlycon. See if kernel oops while booting.

[Regression Risk]
TBD

Ike Panhc (ikepanhc)
description: updated
Changed in linux (Ubuntu Bionic):
assignee: nobody → Ike Panhc (ikepanhc)
Changed in linux (Ubuntu):
status: New → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue can be found on our ThunderX nodes starmie-kernel / wright-kernel for SRU testing.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

This issue also can be reproduced on hisilicon kunpeng 920 machine but not reproduce-able on hi1616 machine.

So far as I know this can be reproduced on crb1, sabre and d06. But not on d05

Revision history for this message
Ike Panhc (ikepanhc) wrote :

I am building test kernel to bisect patches. Shall find the root cause in a day or 2

dann frazier (dannf)
Changed in linux (Ubuntu Bionic):
importance: Undecided → Critical
Revision history for this message
dann frazier (dannf) wrote :

With earlycon on, I am able to get more output. This looks like a smoking gun:

[ 0.460249] CPU features: CPU1: Detected conflict for capability 11 (Virtualization Host Extensions), System: 0, CPU: 1

git blame for this error message points at:
e988af0188e30 arm64: capabilities: Unify the verification

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks Dann,

It's very close.

The root cause is somewhere between these commits and e988af01 is just few commits below.

5d0f174e40a6 <email address hidden> 2019-11-12 19:04:49 +0100 arm64: enable generic CPU vulnerabilites support
f94f9d3a3e8b <email address hidden> 2019-11-12 19:04:49 +0100 arm64: add sysfs vulnerability show for meltdown
c288f6b5788d <email address hidden> 2019-11-12 19:04:48 +0100 arm64: Add sysfs vulnerability show for spectre-v1
f2485ae5fd84 <email address hidden> 2019-11-12 19:04:48 +0100 arm64: fix SSBS sanitization
1931a913df7e <email address hidden> 2019-11-12 19:04:48 +0100 KVM: arm64: Set SCTLR_EL2.DSSBS if SSBD is forcefully disabled and !vhe
fd872fd82e12 <email address hidden> 2019-11-12 19:04:48 +0100 arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3
2a3135c3033c <email address hidden> 2019-11-12 19:04:48 +0100 arm64: cpufeature: Detect SSBS and advertise to userspace
78dc3acb34fa <email address hidden> 2019-11-12 19:04:48 +0100 arm64: Get rid of __smccc_workaround_1_hvc_*
5c43fb65359d <email address hidden> 2019-11-12 19:04:48 +0100 arm64: don't zero DIT on signal return
c6c07232325a <email address hidden> 2019-11-12 19:04:48 +0100 arm64: KVM: Use SMCCC_ARCH_WORKAROUND_1 for Falkor BP hardening
274adba3ccf6 <email address hidden> 2019-11-12 19:04:47 +0100 arm64: capabilities: Add support for checks based on a list of MIDRs
f34e57c35b72 <email address hidden> 2019-11-12 19:04:47 +0100 arm64: Add MIDR encoding for Arm Cortex-A55 and Cortex-A35
8d811d39465c <email address hidden> 2019-11-12 19:04:47 +0100 arm64: Add helpers for checking CPU MIDR against a range
b2eddaf65384 <email address hidden> 2019-11-12 19:04:47 +0100 arm64: capabilities: Clean up midr range helpers
628859e8621c <email address hidden> 2019-11-12 19:04:47 +0100 arm64: capabilities: Change scope of VHE to Boot CPU feature
3bf4ffd98cc4 <email address hidden> 2019-11-12 19:04:47 +0100 arm64: capabilities: Add support for features enabled early

Revision history for this message
dann frazier (dannf) wrote :

My bisect landed on:
628859e8621cb arm64: capabilities: Change scope of VHE to Boot CPU feature

I'm doing a build w/ that patch reverted to verify.

Revision history for this message
dann frazier (dannf) wrote :

Yes, reverting just that change seems to fix it for me on a Sabre board.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Good news.

Patch 628859e8621cb landed in upstream kernel since 4.16 but we have no problem on 5.0 kernel. IIRC we don't have this issue on 4.18. Looks like this is a backport problem and to revert this patch might be the fast/best way.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Test kernel with commit 628859e8 reverted is at

  https://kernel.ubuntu.com/~ikepanhc/lp1852723/

Revision history for this message
Stefan Bader (smb) wrote :

Just to reason lowering the priority: I consider critical those issues which cause loss of data. Not being able to boot is "just" loss of functionality and so high.

Changed in linux (Ubuntu Bionic):
importance: Critical → High
Ike Panhc (ikepanhc)
description: updated
Revision history for this message
Ike Panhc (ikepanhc) wrote :

For more information,

1) 628859e8621cb is part of large patchset landed in 4.16 kernel and 4.14.151 stable release
2) 4.14.151 mainline build kernel is ok to bring up SMP CPUs (See #5, kernel oops on SMP initial)
3) 4.16 mainline build kernel is ok to bring up SMP CPUs

So, we can not report 4.14.151 and 4.16 has this issue. I am going to check anything missing between 4.14 and 4.15

Revision history for this message
Andrew Cloke (andrew-cloke) wrote : Re: [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2, ThunderX and Kunpeng920

Updated title to reflect the fact that this impact more than one ARM64 server SoC.

summary: - [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2
+ [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2,
+ ThunderX and Kunpeng920
Revision history for this message
Ike Panhc (ikepanhc) wrote :
Download full text (4.4 KiB)

During v4.17 merge window, patch 830dcc9f9a7c "arm64: capabilities: Change scope of VHE to Boot CPU feature" has been merged into mainline kernel in a patchset with other 37 patches

65896545b69f <email address hidden> 2018-03-28 15:25:44 +0100 arm64: uaccess: Fix omissions from usercopy whitelist
20b8547277a6 <email address hidden> 2018-03-28 15:20:17 +0100 arm64: fpsimd: Split cpu field out from struct fpsimd_state
7f170499f734 <email address hidden> 2018-03-28 15:20:17 +0100 arm64: tlbflush: avoid writing RES0 bits
2a58fca9a7b4 <email address hidden> 2018-03-27 13:15:49 +0100 arm64: cmpxchg: Include linux/compiler.h in asm/cmpxchg.h
c9406e514b95 <email address hidden> 2018-03-27 13:15:29 +0100 arm64: move percpu cmpxchg implementation from cmpxchg.h to percpu.h
e8a2d040fee5 <email address hidden> 2018-03-27 13:14:54 +0100 arm64: cmpxchg: Include build_bug.h instead of bug.h for BUILD_BUG
8a624f145c0d <email address hidden> 2018-03-27 13:14:49 +0100 arm64: lse: Include compiler_types.h and export.h for out-of-line LL/SC
b4f9b3907487 <email address hidden> 2018-03-27 13:14:43 +0100 arm64: fpsimd: include <linux/init.h> in fpsimd.h
65bd053fbf46 <email address hidden> 2018-03-27 13:13:27 +0100 drivers/perf: arm_pmu_platform: do not warn about affinity on uniprocessor
fcd9f8315e6a <email address hidden> 2018-03-27 13:13:11 +0100 perf: arm_spe: include linux/vmalloc.h for vmap()
3f251cf0abec <email address hidden> 2018-03-27 12:04:51 +0100 Revert "arm64: Revert L1_CACHE_SHIFT back to 6 (64-byte cache line size)"
12eb369125ab <email address hidden> 2018-03-27 11:51:12 +0100 arm64: cpufeature: Avoid warnings due to unused symbols
ece1397cbc89 <email address hidden> 2018-03-26 18:01:44 +0100 arm64: Add work around for Arm Cortex-A55 Erratum 1024718
05abb595bbac <email address hidden> 2018-03-26 18:01:44 +0100 arm64: Delay enabling hardware DBM feature
6e616864f211 <email address hidden> 2018-03-26 18:01:43 +0100 arm64: Add MIDR encoding for Arm Cortex-A55 and Cortex-A35
ba7d9233c219 <email address hidden> 2018-03-26 18:01:43 +0100 arm64: capabilities: Handle shared entries
be5b299830c6 <email address hidden> 2018-03-26 18:01:42 +0100 arm64: capabilities: Add support for checks based on a list of MIDRs
1df310505d6d <email address hidden> 2018-03-26 18:01:42 +0100 arm64: Add helpers for checking CPU MIDR against a range
5e7951ce19ab <email address hidden> 2018-03-26 18:01:42 +0100 arm64: capabilities: Clean up midr range helpers
830dcc9f9a7c <email address hidden> 2018-03-26 18:01:41 +0100 arm64: capabilities: Change scope of VHE to Boot CPU feature
fd9d63da17da <email address hidden> 2018-03-26 18:01:41 +0100 arm64: capabilities: Add support for features enabled early
d3aec8a28be3 <email address hidden> 2018-03-26 18:01:40 +0100 arm64: capabilities: Restrict KPTI detection to boot-time CPUs
5c137714dd8c <email address hidden> 2018-03-26 18:01:40 +0100 arm64: capabilities: Introduce weak features based on local CPU
ed478b3f9e4a <email address hidden> 2018-03-26 18:01:40 +0100 arm64: capabilities: Group handling of features and errata workarounds
fbd890b9b849 <email address hidden> 2018-03-26 18:01:39 +0100 arm64: capabilities: Allow features based on local CPU scope
d69fe9a7e721 suzuki.poul...

Read more...

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Compare to mainline commits. These commits in stable kernel repository are a patchset. All 19 patches included.

197a83aaa821 <email address hidden> 2019-10-29 09:17:22 +0100 arm64: capabilities: Add support for checks based on a list of MIDRs
8a82aee7bdfd <email address hidden> 2019-10-29 09:17:22 +0100 arm64: Add MIDR encoding for Arm Cortex-A55 and Cortex-A35
5a5e2f938e2e <email address hidden> 2019-10-29 09:17:21 +0100 arm64: Add helpers for checking CPU MIDR against a range
41b3073644e3 <email address hidden> 2019-10-29 09:17:21 +0100 arm64: capabilities: Clean up midr range helpers
ee0ccd259b4c <email address hidden> 2019-10-29 09:17:20 +0100 arm64: capabilities: Change scope of VHE to Boot CPU feature
6527925caa7f <email address hidden> 2019-10-29 09:17:20 +0100 arm64: capabilities: Add support for features enabled early
808ab828e638 <email address hidden> 2019-10-29 09:17:19 +0100 arm64: capabilities: Restrict KPTI detection to boot-time CPUs
32354dd01c29 <email address hidden> 2019-10-29 09:17:19 +0100 arm64: capabilities: Introduce weak features based on local CPU
f1696036165b <email address hidden> 2019-10-29 09:17:19 +0100 arm64: capabilities: Group handling of features and errata workarounds
33236e444f1c <email address hidden> 2019-10-29 09:17:18 +0100 arm64: capabilities: Allow features based on local CPU scope
0a599aa7daca <email address hidden> 2019-10-29 09:17:18 +0100 arm64: capabilities: Split the processing of errata work arounds
59118c737b47 <email address hidden> 2019-10-29 09:17:17 +0100 arm64: capabilities: Prepare for grouping features and errata work arounds
9e3fa8a15596 <email address hidden> 2019-10-29 09:17:16 +0100 arm64: capabilities: Filter the entries based on a given mask
2a5313330993 <email address hidden> 2019-10-29 09:17:16 +0100 arm64: capabilities: Unify the verification
185b632259e8 <email address hidden> 2019-10-29 09:17:15 +0100 arm64: capabilities: Add flags to handle the conflicts on late CPU
6c21fc25e9b0 <email address hidden> 2019-10-29 09:17:14 +0100 arm64: capabilities: Prepare for fine grained capabilities
e89e2a26f996 <email address hidden> 2019-10-29 09:17:14 +0100 arm64: capabilities: Move errata processing code
d56e7aa41167 <email address hidden> 2019-10-29 09:17:13 +0100 arm64: capabilities: Move errata work around check on boot CPU
0e606f018d76 <email address hidden> 2019-10-29 09:17:12 +0100 arm64: capabilities: Update prototype for enable call back

Revision history for this message
Ike Panhc (ikepanhc) wrote :

On ThunderX, we have different error message and need to file another bug

https://pastebin.ubuntu.com/p/DkdBqbBqqD/

summary: - [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2,
- ThunderX and Kunpeng920
+ [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2 and
+ Kunpeng920
Revision history for this message
Ike Panhc (ikepanhc) wrote :
description: updated
Revision history for this message
Ike Panhc (ikepanhc) wrote :

The patchset is bigger then I expected. There are 45 patches is a patchset for several vulnerability fix.

$ git log --format=oneline 22a4045c5c58..f2549319bd62 | wc -l
45

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Find another 2 patches depends on the arm64 patchset in #18

fc4eb08e5edc <email address hidden> 2019-11-13 18:47:34 -0500 arm64: kpti: Whitelist HiSilicon Taishan v110 CPUs
31140a50f228 <email address hidden> 2019-11-12 19:04:57 +0100 arm64: Enable workaround for Cavium TX2 erratum 219 when running SMT

To revert everything included kpti and erratum from stable update is not a good idea.

Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

The 4.15 proposed kernel (4.15.0-72.81) can now be tested on a maas-deployed ThunderX system. Thanks

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Ike Panhc (ikepanhc) wrote :

4.15.0-72.81 kernel boots ok on ThunderX2 and Kunpeng 920 machines. Thanks.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (28.6 KiB)

This bug was fixed in the package linux - 4.15.0-72.81

---------------
linux (4.15.0-72.81) bionic; urgency=medium

  * bionic/linux: 4.15.0-72.81 -proposed tracker (LP: #1854027)

  * [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX
    (LP: #1853326)
    - Revert "arm64: Use firmware to detect CPUs that are not affected by
      Spectre-v2"
    - Revert "arm64: Get rid of __smccc_workaround_1_hvc_*"

  * [Regression] Bionic kernel 4.15.0-71.80 can not boot on ThunderX2 and
    Kunpeng920 (LP: #1852723)
    - SAUCE: arm64: capabilities: Move setup_boot_cpu_capabilities() call to
      correct place

linux (4.15.0-71.80) bionic; urgency=medium

  * bionic/linux: 4.15.0-71.80 -proposed tracker (LP: #1852289)

  * Bionic update: upstream stable patchset 2019-10-29 (LP: #1850541)
    - panic: ensure preemption is disabled during panic()
    - f2fs: use EINVAL for superblock with invalid magic
    - [Config] updateconfigs for USB_RIO500
    - USB: rio500: Remove Rio 500 kernel driver
    - USB: yurex: Don't retry on unexpected errors
    - USB: yurex: fix NULL-derefs on disconnect
    - USB: usb-skeleton: fix runtime PM after driver unbind
    - USB: usb-skeleton: fix NULL-deref on disconnect
    - xhci: Fix false warning message about wrong bounce buffer write length
    - xhci: Prevent device initiated U1/U2 link pm if exit latency is too long
    - xhci: Check all endpoints for LPM timeout
    - usb: xhci: wait for CNR controller not ready bit in xhci resume
    - USB: adutux: fix use-after-free on disconnect
    - USB: adutux: fix NULL-derefs on disconnect
    - USB: adutux: fix use-after-free on release
    - USB: iowarrior: fix use-after-free on disconnect
    - USB: iowarrior: fix use-after-free on release
    - USB: iowarrior: fix use-after-free after driver unbind
    - USB: usblp: fix runtime PM after driver unbind
    - USB: chaoskey: fix use-after-free on release
    - USB: ldusb: fix NULL-derefs on driver unbind
    - serial: uartlite: fix exit path null pointer
    - USB: serial: keyspan: fix NULL-derefs on open() and write()
    - USB: serial: ftdi_sio: add device IDs for Sienna and Echelon PL-20
    - USB: serial: option: add Telit FN980 compositions
    - USB: serial: option: add support for Cinterion CLS8 devices
    - USB: serial: fix runtime PM after driver unbind
    - USB: usblcd: fix I/O after disconnect
    - USB: microtek: fix info-leak at probe
    - USB: dummy-hcd: fix power budget for SuperSpeed mode
    - usb: renesas_usbhs: gadget: Do not discard queues in
      usb_ep_set_{halt,wedge}()
    - usb: renesas_usbhs: gadget: Fix usb_ep_set_{halt,wedge}() behavior
    - USB: legousbtower: fix slab info leak at probe
    - USB: legousbtower: fix deadlock on disconnect
    - USB: legousbtower: fix potential NULL-deref on disconnect
    - USB: legousbtower: fix open after failed reset request
    - USB: legousbtower: fix use-after-free on release
    - staging: vt6655: Fix memory leak in vt6655_probe
    - iio: adc: ad799x: fix probe error handling
    - iio: adc: axp288: Override TS pin bias current for some models
    - iio: light: opt3001: fix mutex unlock race
    - efivar/ssdt: Don't iterate over EFI va...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.