amdgpu module crash after 5.15 kernel update

Bug #1981883 reported by Henry Goffin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Fix Released
High
Unassigned

Bug Description

[SRU Justification]

Impact: The 5.15 kernel series introduced a regression compared to 5.13 related to virtual display code on AMD GPUs. This causes crashes for cloud/VM users back to Focal (using rolling cloud or HWE kernels).

Fix: Upstream stable 5.15.61 contains a fix for this which is only needed there since later kernel version have that code rewritten again. This change would get picked up normally in the next stable cycle after 2022.09.19. But given the impact we would apply the following patch from 5.15.61 ahead of time:

  commit 27f8f5219fe4658537ba28fd01657e1062ac3960 linux-5.15.y
  "drm/amdgpu: fix check in fbdev init"

Test case: Booting the affected kernel on a VM which passes through AMD GPU hardware.

Regression potential: Worst case some driver functions will not get disabled where they did before.

--- original description ---

The kernel 5.15 amdgpu module crashes on load with a “BUG: kernel NULL pointer dereference” on Amazon EC2 G4ad hardware (custom AMD Radeon V520 Pro datacenter GPU) on focal (HWE) and jammy with kernel 5.15.0-1011 and possibly earlier, up through latest (revision 1015.19). This crash bug did not exist in any of the focal HWE 5.13 kernels.

This is probably an upstream kernel bug, but I am also filing it here because existing focal users on EC2 will suddenly stop having access to their AMD GPUs after a reboot once the new 5.15 HWE kernel is installed.

The full backtrace from dmesg is below. The offending function call which crashes in the 5.15 kernel corresponds to this source (sorry, not the right source tree, but the same driver) https://github.com/torvalds/linux/blob/8bb7eca972ad531c9b149c0a51ab43a417385813/drivers/gpu/drm/amd/amdgpu/amdgpu_fb.c#L345

A workaround that I have discovered is adding “options amdgpu virtual_display=;” to a new modprobe.d configuration file - something which shouldn’t be required, but is at least harmless.

Here is the relevant BUG message and backtrace from dmesg:

[ 318.111721] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 318.115443] #PF: supervisor instruction fetch in kernel mode
[ 318.118443] #PF: error_code(0x0010) - not-present page
[ 318.121177] PGD 0 P4D 0
[ 318.122688] Oops: 0010 [#1] SMP NOPTI
[ 318.124592] CPU: 6 PID: 13667 Comm: modprobe Tainted: G W 5.15.0-1015-aws #19~20.04.1-Ubuntu
[ 318.129711] Hardware name: Amazon EC2 g4ad.2xlarge/, BIOS 1.0 10/16/2017
[ 318.133167] RIP: 0010:0x0
[ 318.134704] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 318.138291] RSP: 0018:ffff9841828d78e0 EFLAGS: 00010246
[ 318.140938] RAX: 0000000000000000 RBX: ffff8a4f16ae8000 RCX: 0000000000000001
[ 318.144604] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8a4f16ae8000
[ 318.148319] RBP: ffff9841828d7908 R08: ffff8a4f02460278 R09: ffff8a4f06422c40
[ 318.152151] R10: c01c42494f8affff R11: ffff8a4f01dcb5b8 R12: ffff8a4f024602e8
[ 318.155929] R13: ffffffffc107e4a0 R14: 0000000000000000 R15: ffff8a4f02460010
[ 318.159685] FS: 00007f8afdc1c740(0000) GS:ffff8a55f0980000(0000) knlGS:0000000000000000
[ 318.163897] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 318.167038] CR2: ffffffffffffffd6 CR3: 000000011817e000 CR4: 00000000003506e0
[ 318.170758] Call Trace:
[ 318.173964] <TASK>
[ 318.176822] __drm_helper_disable_unused_functions+0xe7/0x100 [drm_kms_helper]
[ 318.184230] drm_helper_disable_unused_functions+0x44/0x50 [drm_kms_helper]
[ 318.189761] amdgpu_fbdev_init+0x104/0x110 [amdgpu]
[ 318.194264] amdgpu_device_init.cold+0x7cc/0xc48 [amdgpu]
[ 318.199061] ? pci_read_config_byte+0x27/0x40
[ 318.203206] amdgpu_driver_load_kms+0x1e/0x270 [amdgpu]
[ 318.207901] amdgpu_pci_probe+0x1ea/0x290 [amdgpu]
[ 318.212445] local_pci_probe+0x4b/0x90
[ 318.216386] pci_device_probe+0x182/0x1f0
[ 318.220407] really_probe.part.0+0xcb/0x370
[ 318.224460] really_probe+0x40/0x80
[ 318.228232] __driver_probe_device+0x115/0x190
[ 318.232412] driver_probe_device+0x23/0xa0
[ 318.236436] __driver_attach+0xbd/0x160
[ 318.240348] ? __device_attach_driver+0x110/0x110
[ 318.244637] bus_for_each_dev+0x7e/0xc0
[ 318.248570] driver_attach+0x1e/0x20
[ 318.252474] bus_add_driver+0x161/0x200
[ 318.256412] driver_register+0x74/0xd0
[ 318.260332] __pci_register_driver+0x68/0x70
[ 318.264496] amdgpu_init+0x7c/0x1000 [amdgpu]
[ 318.268841] ? 0xffffffffc146e000
[ 318.272521] do_one_initcall+0x48/0x1d0
[ 318.276446] ? __cond_resched+0x19/0x30
[ 318.280378] ? kmem_cache_alloc_trace+0x15a/0x420
[ 318.284736] do_init_module+0x52/0x230
[ 318.288644] load_module+0x1372/0x1600
[ 318.292529] __do_sys_finit_module+0xbf/0x120
[ 318.296706] ? __do_sys_finit_module+0xbf/0x120
[ 318.300947] __x64_sys_finit_module+0x1a/0x20
[ 318.305265] do_syscall_64+0x5c/0xc0
[ 318.308984] ? do_syscall_64+0x69/0xc0
[ 318.312841] ? do_syscall_64+0x69/0xc0
[ 318.316693] ? do_syscall_64+0x69/0xc0
[ 318.320545] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 318.324975] RIP: 0033:0x7f8afdd6273d
[ 318.328700] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 23 37 0d 00 f7 d8 64 89 01 48
[ 318.343601] RSP: 002b:00007ffe48e5c8c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 318.351197] RAX: ffffffffffffffda RBX: 0000560c79e39260 RCX: 00007f8afdd6273d
[ 318.356678] RDX: 0000000000000000 RSI: 0000560c78e96358 RDI: 000000000000000f
[ 318.362340] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
[ 318.367743] R10: 000000000000000f R11: 0000000000000246 R12: 0000560c78e96358
[ 318.373241] R13: 0000000000000000 R14: 0000560c79e39390 R15: 0000560c79e39260
[ 318.378699] </TASK>
[ 318.381783] Modules linked in: amdgpu(+) iommu_v2 gpu_sched drm_ttm_helper ttm drm_kms_helper cec rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt btrfs blake2b_generic xor zstd_compress raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel psmouse crypto_simd parport_pc input_leds parport cryptd ena serio_raw sch_fq_codel ipmi_devintf ipmi_msghandler msr drm ip_tables x_tables autofs4
[ 318.418449] CR2: 0000000000000000
[ 318.422130] ---[ end trace d6b9efffe55f5322 ]---
[ 318.426391] RIP: 0010:0x0
[ 318.429681] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 318.434994] RSP: 0018:ffff9841828d78e0 EFLAGS: 00010246
[ 318.439489] RAX: 0000000000000000 RBX: ffff8a4f16ae8000 RCX: 0000000000000001
[ 318.444896] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff8a4f16ae8000
[ 318.450518] RBP: ffff9841828d7908 R08: ffff8a4f02460278 R09: ffff8a4f06422c40
[ 318.455937] R10: c01c42494f8affff R11: ffff8a4f01dcb5b8 R12: ffff8a4f024602e8
[ 318.461581] R13: ffffffffc107e4a0 R14: 0000000000000000 R15: ffff8a4f02460010
[ 318.466999] FS: 00007f8afdc1c740(0000) GS:ffff8a55f0980000(0000) knlGS:0000000000000000
[ 318.474744] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 318.479525] CR2: ffffffffffffffd6 CR3: 000000011817e000 CR4: 00000000003506e0

CVE References

Revision history for this message
Henry Goffin (amzn-hgoffin) wrote :

This crash occurs after also firing several warnings with backtraces -

[ 318.108644] WARNING: CPU: 6 PID: 13667 at drivers/gpu/drm/drm_crtc_helper.c:221 drm_helper_disable_unused_functions+0x32/0x50 [drm_kms_helper]

[ 318.109727] WARNING: CPU: 6 PID: 13667 at drivers/gpu/drm/drm_crtc_helper.c:101 drm_helper_encoder_in_use+0x4d/0xe0 [drm_kms_helper]

[ 318.110742] WARNING: CPU: 6 PID: 13667 at drivers/gpu/drm/drm_crtc_helper.c:141 drm_helper_crtc_in_use+0x3c/0xb0 [drm_kms_helper]

All of these warnings are the same code check, which warns when the driver reports atomic modesetting but calls legacy modesetting functions. Beyond that summary, I am way out of my depth, and someone from AMD will probably have to untangle this for 5.15 (later in 5.17 this entire custom fbdev implementation was removed and replaced with common code).

affects: linux-meta-aws-5.15 (Ubuntu) → linux-aws-5.15 (Ubuntu)
Revision history for this message
Henry Goffin (amzn-hgoffin) wrote (last edit ):

Alexander Deucher at AMD came up with a quick fix for the 5.15 branch, which I was able to test and verify - amdgpu_fb.c, line 344:

- if (!amdgpu_device_has_dc_support(adev) && !amdgpu_virtual_display)
+ if (!amdgpu_device_has_dc_support(adev) && !amdgpu_virtual_display && !amdgpu_sriov_vf(adev))

With this change, the amdgpu module loads and runs correctly on AWS EC2 G4ad instances. The change only affects datacenter cards with SRIO-V support and not consumer cards.

Revision history for this message
Henry Goffin (amzn-hgoffin) wrote :

Patch now posted upstream for linux stable kernel 5.15

https://lore<email address hidden>/T/#u

Stefan Bader (smb)
affects: linux-aws-5.15 (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu Jammy):
importance: Undecided → High
status: New → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Stefan Bader (smb)
description: updated
Changed in linux (Ubuntu Jammy):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.15.0-50.56 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Henry Goffin (amzn-hgoffin) wrote :

Sorry I am late to the party on confirming the fix, was traveling. Confirming that everything is working properly with the latest jammy-proposed kernel.

tested package: linux-image-5.15.0-1021-aws 5.15.0-1021.25

ubuntu@ip-172-31-31-116:~$ sudo dmesg | grep 5.15.60
[ 0.000000] Linux version 5.15.0-1021-aws (buildd@lcy02-amd64-074) (gcc (Ubuntu 11.2.0-19ubuntu1) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #25-Ubuntu SMP Fri Sep 23 12:20:42 UTC 2022 (Ubuntu 5.15.0-1021.25-aws 5.15.60)
ubuntu@ip-172-31-31-116:~$ sudo dmesg | grep "zed amd"
[ 5.170324] [drm] Initialized amdgpu 3.42.0 20150101 for 0000:00:1e.0 on minor 0
ubuntu@ip-172-31-31-116:~$ uname -a
Linux ip-172-31-31-116 5.15.0-1021-aws #25-Ubuntu SMP Fri Sep 23 12:20:42 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@ip-172-31-31-116:~$ ls /dev/dri
by-path card0 renderD128
ubuntu@ip-172-31-31-116:~$ apt list | grep linux- | grep installed | grep proposed

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

binutils-x86-64-linux-gnu/jammy-proposed,now 2.38-4ubuntu2 amd64 [installed,automatic]
linux-aws-headers-5.15.0-1021/jammy-proposed,now 5.15.0-1021.25 all [installed,automatic]
linux-aws/jammy-proposed,now 5.15.0.1021.21 amd64 [installed]
linux-firmware/jammy-proposed,now 20220329.git681281e4-0ubuntu3.6 all [installed]
linux-headers-5.15.0-1021-aws/jammy-proposed,now 5.15.0-1021.25 amd64 [installed,automatic]
linux-headers-aws/jammy-proposed,now 5.15.0.1021.21 amd64 [installed]
linux-image-5.15.0-1021-aws/jammy-proposed,now 5.15.0-1021.25 amd64 [installed,automatic]
linux-image-aws/jammy-proposed,now 5.15.0.1021.21 amd64 [installed]
linux-modules-5.15.0-1021-aws/jammy-proposed,now 5.15.0-1021.25 amd64 [installed,automatic]
linux-modules-extra-5.15.0-1021-aws/jammy-proposed,now 5.15.0-1021.25 amd64 [installed,automatic]
linux-modules-extra-aws/jammy-proposed,now 5.15.0.1021.21 amd64 [installed]

tags: added: verification-done-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (42.9 KiB)

This bug was fixed in the package linux - 5.15.0-50.56

---------------
linux (5.15.0-50.56) jammy; urgency=medium

  * jammy/linux: 5.15.0-50.56 -proposed tracker (LP: #1990148)

  * CVE-2022-3176
    - io_uring: refactor poll update
    - io_uring: move common poll bits
    - io_uring: kill poll linking optimisation
    - io_uring: inline io_poll_complete
    - io_uring: correct fill events helpers types
    - io_uring: clean cqe filling functions
    - io_uring: poll rework
    - io_uring: remove poll entry from list when canceling all
    - io_uring: bump poll refs to full 31-bits
    - io_uring: fail links when poll fails
    - io_uring: fix wrong arm_poll error handling
    - io_uring: fix UAF due to missing POLLFREE handling

  * ip/nexthop: fix default address selection for connected nexthop
    (LP: #1988809)
    - selftests/net: test nexthop without gw

  * ip/nexthop: fix default address selection for connected nexthop
    (LP: #1988809) // icmp_redirect.sh in ubuntu_kernel_selftests failed on
    Jammy 5.15.0-49.55 (LP: #1990124)
    - ip: fix triggering of 'icmp redirect'

linux (5.15.0-49.55) jammy; urgency=medium

  * jammy/linux: 5.15.0-49.55 -proposed tracker (LP: #1989785)

  * amdgpu module crash after 5.15 kernel update (LP: #1981883)
    - drm/amdgpu: fix check in fbdev init

  * scsi: hisi_sas: Increase debugfs_dump_index after dump is  completed
    (LP: #1982070)
    - scsi: hisi_sas: Increase debugfs_dump_index after dump is completed

  * [UBUNTU 22.04] s390/qeth: cache link_info for ethtool (LP: #1984103)
    - s390/qeth: cache link_info for ethtool

  * WARN in trace_event_dyn_put_ref (LP: #1987232)
    - tracing/perf: Fix double put of trace event when init fails

  * Jammy update: v5.15.60 upstream stable release (LP: #1989221)
    - x86/speculation: Make all RETbleed mitigations 64-bit only
    - selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads
    - selftests/bpf: Check dst_port only on the client socket
    - block: fix default IO priority handling again
    - tools/vm/slabinfo: Handle files in debugfs
    - ACPI: video: Force backlight native for some TongFang devices
    - ACPI: video: Shortening quirk list by identifying Clevo by board_name only
    - ACPI: APEI: Better fix to avoid spamming the console with old error logs
    - crypto: arm64/poly1305 - fix a read out-of-bound
    - KVM: x86: do not report a vCPU as preempted outside instruction boundaries
    - KVM: x86: do not set st->preempted when going back to user space
    - KVM: selftests: Make hyperv_clock selftest more stable
    - tools/kvm_stat: fix display of error when multiple processes are found
    - selftests: KVM: Handle compiler optimizations in ucall
    - KVM: x86/svm: add __GFP_ACCOUNT to __sev_dbg_{en,de}crypt_user()
    - arm64: set UXN on swapper page tables
    - btrfs: zoned: prevent allocation from previous data relocation BG
    - btrfs: zoned: fix critical section of relocation inode writeback
    - Bluetooth: hci_bcm: Add BCM4349B1 variant
    - Bluetooth: hci_bcm: Add DT compatible for CYW55572
    - dt-bindings: bluetooth: broadcom: Add BCM4349B1 DT binding
    - Bluetooth: btusb: Add support of IMC Netw...

Changed in linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-gkeop-5.15/5.15.0-1005.7~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-bluefield/5.15.0-1010.12 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-bluefield verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia/5.15.0-1011.11 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-mtk/5.15.0-1030.34 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-done-jammy-linux-mtk'. If the problem still exists, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-failed-jammy-linux-mtk'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-mtk-v2 verification-needed-jammy-linux-mtk
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.