linux-nvidia-6.2 on DGX servers: "WARNING: CPU: 0 PID: 0 at init/main.c:1065 start_kernel+0x4da/0x540"

Bug #2026891 reported by Francis Ginther
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-nvidia-6.2 (Ubuntu)
Fix Released
Undecided
Tushar Dave

Bug Description

We started testing the jammy/linux-nvidia-6.2 kernels on the nvidia servers (DGX-1/DGX-2/H100) and hit the following warning during boot:

[ 7.690486] ------------[ cut here ]------------
[ 7.690487] Interrupts were enabled early
[ 7.690490] WARNING: CPU: 0 PID: 0 at init/main.c:1065 start_kernel+0x4da/0x540
[ 7.690498] Modules linked in:
[ 7.690501] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-1004-nvidia #4~22.04.1-Ubuntu
[ 7.690504] Hardware name: NVIDIA NVIDIA DGX-2/NVIDIA DGX-2, BIOS 0.29 06/07/2021
[ 7.690505] RIP: 0010:start_kernel+0x4da/0x540
[ 7.690508] Code: ff 48 c7 c7 e8 26 f0 97 e8 b3 59 a8 fd 0f 0b e9 96 fd ff ff e8 a7 1d 04 00 e9 7c fe ff ff 48 c7 c7 18 27 f0 97 e8 96 59 a8 fd <0f> 0b e9 ed fd ff ff 48 c7 c7 b0 26 f0 97 e8 83 59 a8 fd 0f 0b ff
[ 7.690510] RSP: 0000:ffffffff98803f08 EFLAGS: 00010246
[ 7.690512] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 7.690513] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 7.690514] RBP: ffffffff98803f20 R08: 0000000000000000 R09: 0000000000000000
[ 7.690515] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000e0
[ 7.690516] R13: 000000005a1ccde0 R14: 000000005a1c7469 R15: 000000005a1d7ee0
[ 7.690518] FS: 0000000000000000(0000) GS:ffff964900600000(0000) knlGS:0000000000000000
[ 7.690520] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.690521] CR2: ffff970bfffff000 CR3: 000000ecd7810001 CR4: 00000000000606f0
[ 7.690522] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7.690523] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 7.690524] Call Trace:
[ 7.690526] <TASK>
[ 7.690529] x86_64_start_kernel+0x102/0x180
[ 7.690536] secondary_startup_64_no_verify+0xe5/0xeb
[ 7.690544] </TASK>
[ 7.690544] ---[ end trace 0000000000000000 ]---

I also see pretty much the same thing on some Ampere based arm64 servers:

[ 0.000519] ------------[ cut here ]------------
[ 0.000521] Interrupts were enabled early
[ 0.000525] WARNING: CPU: 0 PID: 0 at init/main.c:1065 start_kernel+0x3ac/0x514
[ 0.000531] Modules linked in:
[ 0.000535] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.2.0-1004-nvidia #4~22.04.1-Ubuntu
[ 0.000538] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 0.000540] pc : start_kernel+0x3ac/0x514
[ 0.000543] lr : start_kernel+0x3ac/0x514
[ 0.000545] sp : ffffdec5ff733e60
[ 0.000546] x29: ffffdec5ff733e60 x28: 00000819aa09baac x27: 0000403ffdd124e0
[ 0.000549] x26: 00000000bfdf3788 x25: 000000009b6fc000 x24: 00000000001dba7b
[ 0.000552] x23: 00005ec57c980000 x22: 00000819ab2a0000 x21: ffffdec5ff749140
[ 0.000555] x20: ffffdec5ff73d9c0 x19: ffffdec5ffbe4000 x18: ffffdec5ff74a1c8
[ 0.000558] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 0.000560] x14: 0000000000000000 x13: 0a796c7261652064 x12: 656c62616e652065
[ 0.000563] x11: 656820747563205b x10: 2d2d2d2d2d2d2d2d x9 : 0000000000000000
[ 0.000565] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[ 0.000568] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 0.000571] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[ 0.000573] Call trace:
[ 0.000574] start_kernel+0x3ac/0x514
[ 0.000577] __primary_switched+0xc0/0xc8
[ 0.000580] ---[ end trace 0000000000000000 ]---

The warning does not appear on an older thunderx2 server.

Changed in linux-nvidia-6.2 (Ubuntu):
assignee: nobody → Tushar Dave (tdavenvidia)
Revision history for this message
Francis Ginther (fginther) wrote :

I ran through several kernels on our DGX-2 server, only the latest 6.2.0-1004-nvidia kernel emitted the warning. Here are all the kernels I tried:

Lunar 6.2.0-24.24 generic - PASS
Jammy 5.15.0-1028-nvidia - PASS
Jammy 5.19.0-46-generic - PASS
Jammy 5.19.0-1014-nvidia - PASS
Jammy 6.2.0-25-generic - PASS
Jammy 6.2.0-1003-nvidia - PASS
Jammy 6.2.0-1004-nvidia - FAIL

Revision history for this message
Tushar Dave (tdavenvidia) wrote :

yeah I suspect that.. there are couple of irq patches in the 6.2.0-1004-nvidia could be the cause..
I will update here shortly!

Revision history for this message
Tushar Dave (tdavenvidia) wrote :

Can you try the below commit from linus's "linux" tree and see if the warning goes away?

commit f5451547b8310868f5b5acff7cd4aa7c0267edb3
Author: Thomas Gleixner <email address hidden>
Date: Tue Feb 7 15:16:53 2023 +0100

    mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early

Revision history for this message
Francis Ginther (fginther) wrote :

I built and tested a 6.2.0-1004-nvidia based kernel with this patch applied and did not see the warning message on boot. I'll follow up further with Ian on Monday.

Revision history for this message
Tushar Dave (tdavenvidia) wrote :

Thanks. I see the same behavior (i.e. no warning) with the patch.
I will add the patch 'commit f5451547b8310868f5b5acff7cd4aa7c0267edb3' to linux-nvidia-6.2 then..

Revision history for this message
Brad Figg (brad-figg) wrote :

The following changes since commit 3d28f6c10d6940b0c6a497482fe90cc4dbd5549a:

  UBUNTU: Ubuntu-nvidia-6.2-6.2.0-1004.4~22.04.1 (2023-07-03 10:01:31 -0700)

are available in the Git repository at:

  https://github.com/NVIDIA-BaseOS-6/linux-nvidia-6.2/pull/new/bfigg-lp2026891

for you to fetch changes up to 8029e7fc883e8a86076e1bb4379f8d6d0236ab97:

  mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early (2023-07-15 19:30:31 -0700)

----------------------------------------------------------------
Thomas Gleixner (1):
      mm, slab/slub: Ensure kmem_cache_alloc_bulk() is available early

 mm/slab.c | 18 ++++++++++--------
 mm/slub.c | 9 +++++----
 2 files changed, 15 insertions(+), 12 deletions(-)

dann frazier (dannf)
Changed in linux-nvidia-6.2 (Ubuntu):
status: New → In Progress
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-6.2/6.2.0-1006.6~22.04.2 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-6.2 verification-needed-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-nvidia-6.2 - 6.2.0-1009.9

---------------
linux-nvidia-6.2 (6.2.0-1009.9) jammy; urgency=medium

  * jammy/linux-nvidia-6.2: 6.2.0-1009.9 -proposed tracker (LP: #2031342)

  * Pull-request to address ARM SMMU issue (LP: #2031320)
    - NVIDIA: SAUCE: iommu/arm-smmu-v3: Allow default substream bypass with a
      pasid support

  * GDS: Add NFS patches to optimized kernel (LP: #1982519)
    - NVMe/MVMEeOF: Patch NVMe/NVMeOF driver to support GDS on Linux 6.2 Kernel

  * Miscellaneous upstream changes
    - Revert "NVIDIA: SAUCE: Add NVMe Patches to enable GDS"
    - NVIDIA: [Config] CONFIG_NR_CPUS=512 for Grace
    - NVIDIA: [Config] CONFIG_MTD_SPI_NOR=y for Grace

 -- Ian May <email address hidden> Mon, 14 Aug 2023 18:45:28 -0500

Changed in linux-nvidia-6.2 (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.