linux-nvidia-6.2 on DGX servers: "WARNING: CPU: 0 PID: 0 at init/main.c:1065 start_kernel+0x4da/0x540"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux-nvidia-6.2 (Ubuntu) |
Fix Released
|
Undecided
|
Tushar Dave |
Bug Description
We started testing the jammy/linux-
[ 7.690486] ------------[ cut here ]------------
[ 7.690487] Interrupts were enabled early
[ 7.690490] WARNING: CPU: 0 PID: 0 at init/main.c:1065 start_kernel+
[ 7.690498] Modules linked in:
[ 7.690501] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-1004-nvidia #4~22.04.1-Ubuntu
[ 7.690504] Hardware name: NVIDIA NVIDIA DGX-2/NVIDIA DGX-2, BIOS 0.29 06/07/2021
[ 7.690505] RIP: 0010:start_
[ 7.690508] Code: ff 48 c7 c7 e8 26 f0 97 e8 b3 59 a8 fd 0f 0b e9 96 fd ff ff e8 a7 1d 04 00 e9 7c fe ff ff 48 c7 c7 18 27 f0 97 e8 96 59 a8 fd <0f> 0b e9 ed fd ff ff 48 c7 c7 b0 26 f0 97 e8 83 59 a8 fd 0f 0b ff
[ 7.690510] RSP: 0000:ffffffff98
[ 7.690512] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 7.690513] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 7.690514] RBP: ffffffff98803f20 R08: 0000000000000000 R09: 0000000000000000
[ 7.690515] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000e0
[ 7.690516] R13: 000000005a1ccde0 R14: 000000005a1c7469 R15: 000000005a1d7ee0
[ 7.690518] FS: 000000000000000
[ 7.690520] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.690521] CR2: ffff970bfffff000 CR3: 000000ecd7810001 CR4: 00000000000606f0
[ 7.690522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7.690523] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 7.690524] Call Trace:
[ 7.690526] <TASK>
[ 7.690529] x86_64_
[ 7.690536] secondary_
[ 7.690544] </TASK>
[ 7.690544] ---[ end trace 0000000000000000 ]---
I also see pretty much the same thing on some Ampere based arm64 servers:
[ 0.000519] ------------[ cut here ]------------
[ 0.000521] Interrupts were enabled early
[ 0.000525] WARNING: CPU: 0 PID: 0 at init/main.c:1065 start_kernel+
[ 0.000531] Modules linked in:
[ 0.000535] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-1004-nvidia #4~22.04.1-Ubuntu
[ 0.000538] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 0.000540] pc : start_kernel+
[ 0.000543] lr : start_kernel+
[ 0.000545] sp : ffffdec5ff733e60
[ 0.000546] x29: ffffdec5ff733e60 x28: 00000819aa09baac x27: 0000403ffdd124e0
[ 0.000549] x26: 00000000bfdf3788 x25: 000000009b6fc000 x24: 00000000001dba7b
[ 0.000552] x23: 00005ec57c980000 x22: 00000819ab2a0000 x21: ffffdec5ff749140
[ 0.000555] x20: ffffdec5ff73d9c0 x19: ffffdec5ffbe4000 x18: ffffdec5ff74a1c8
[ 0.000558] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 0.000560] x14: 0000000000000000 x13: 0a796c7261652064 x12: 656c62616e652065
[ 0.000563] x11: 656820747563205b x10: 2d2d2d2d2d2d2d2d x9 : 0000000000000000
[ 0.000565] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 0.000568] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 0.000571] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[ 0.000573] Call trace:
[ 0.000574] start_kernel+
[ 0.000577] __primary_
[ 0.000580] ---[ end trace 0000000000000000 ]---
The warning does not appear on an older thunderx2 server.
Changed in linux-nvidia-6.2 (Ubuntu): | |
assignee: | nobody → Tushar Dave (tdavenvidia) |
Changed in linux-nvidia-6.2 (Ubuntu): | |
status: | New → In Progress |
I ran through several kernels on our DGX-2 server, only the latest 6.2.0-1004-nvidia kernel emitted the warning. Here are all the kernels I tried:
Lunar 6.2.0-24.24 generic - PASS
Jammy 5.15.0-1028-nvidia - PASS
Jammy 5.19.0-46-generic - PASS
Jammy 5.19.0-1014-nvidia - PASS
Jammy 6.2.0-25-generic - PASS
Jammy 6.2.0-1003-nvidia - PASS
Jammy 6.2.0-1004-nvidia - FAIL