Pull-request: Apply mm/mglru patches to fix soft lockup

Bug #2055060 reported by Brad Figg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-nvidia-6.5 (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

[ 1918.995157] watchdog: BUG: soft lockup - CPU#0 stuck for 1725s! [kswapd0:42]
[ 1919.002366] Modules linked in: raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor xor_neon async_tx raid6_pq raid1 raid0 multipath linear scsi_dh_alua scsi_dh_emc scsi_dh_rdac nvme nvme_core nvme_common
[ 1919.023319] CPU: 0 PID: 42 Comm: kswapd0 Tainted: G L 6.5.0-1011-nvidia #11-Ubuntu
[ 1919.032483] Hardware name: NVIDIA Grace Hopper x4 P4496/UT2.1 DP Chassis, BIOS 01.02.00 20240120
[ 1919.042180] pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1919.049300] pc : __rcu_read_unlock+0x10/0x70
[ 1919.053666] lr : shrink_many+0x280/0x468
[ 1919.057675] sp : ffff80008149bb70
[ 1919.061060] x29: ffff80008149bb70 x28: ffff00003ddfa600 x27: ffffdeee387571e0
[ 1919.068366] x26: ffffdeee383154f8 x25: 0000000000000001 x24: ffff301c90bf4400
[ 1919.075671] x23: ffffffffffffffff x22: 0000000000000000 x21: ffff301c90bf4400
[ 1919.082975] x20: 00000000000002d3 x19: ffff80008149bd68 x18: 0000000000000000
[ 1919.090281] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 1919.097585] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 1919.104890] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdeee355c2008
[ 1919.112195] x8 : ffff80008149bde0 x7 : 0000000000000000 x6 : 0000000000000020
[ 1919.119500] x5 : 0000000000000001 x4 : 0000000000000000 x3 : 0000000000000000
[ 1919.126804] x2 : 0000000000000000 x1 : 0000000000000001 x0 : ffff301c90bf4400
[ 1919.134109] Call trace:
[ 1919.136606] __rcu_read_unlock+0x10/0x70
[ 1919.140615] lru_gen_shrink_node+0x180/0x218
[ 1919.144979] shrink_node+0x400/0x470
[ 1919.148633] balance_pgdat+0x2c8/0x810
[ 1919.152464] kswapd+0x12c/0x268
[ 1919.155672] kthread+0x104/0x110
[ 1919.158970] ret_from_fork+0x10/0x20
[ 1942.995157] watchdog: BUG: soft lockup - CPU#0 stuck for 1747s! [kswapd0:42]
[ 1943.002366] Modules linked in: raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor xor_neon async_tx raid6_pq raid1 raid0 multipath linear scsi_dh_alua scsi_dh_emc scsi_dh_rdac nvme nvme_core nvme_common
[ 1943.023319] CPU: 0 PID: 42 Comm: kswapd0 Tainted: G L 6.5.0-1011-nvidia #11-Ubuntu
[ 1943.032483] Hardware name: NVIDIA Grace Hopper x4 P4496/UT2.1 DP Chassis, BIOS 01.02.00 20240120
[ 1943.042180] pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 1943.049300] pc : lru_gen_shrink_node+0x60/0x218
[ 1943.053932] lr : lru_gen_shrink_node+0x1b8/0x218
[ 1943.058651] sp : ffff80008149bbf0
[ 1943.062037] x29: ffff80008149bbf0 x28: ffff00003ddfa600 x27: ffffdeee387571e0
[ 1943.069342] x26: ffffdeee383154f8 x25: ffff80008149bde0 x24: 00000000000007c0
[ 1943.076647] x23: 0000000000000001 x22: 0000000000000000 x21: ffff00003ddfe600
[ 1943.083952] x20: ffff00003ddfa600 x19: ffff80008149bd68 x18: 0000000000000000
[ 1943.091257] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 1943.098562] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 1943.105867] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdeee359e4bb8
[ 1943.113172] x8 : ffff80008149bde0 x7 : 0000000000000000 x6 : 0000000000000000
[ 1943.120477] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[ 1943.127782] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff301c90bf4400
[ 1943.135086] Call trace:
[ 1943.137583] lru_gen_shrink_node+0x60/0x218
[ 1943.141858] shrink_node+0x400/0x470
[ 1943.145511] balance_pgdat+0x2c8/0x810
[ 1943.149342] kswapd+0x12c/0x268
[ 1943.152551] kthread+0x104/0x110
[ 1943.155849] ret_from_fork+0x10/0x20

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-6.5/6.5.0-1014.14 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-nvidia-6.5' to 'verification-done-jammy-linux-nvidia-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-nvidia-6.5' to 'verification-failed-jammy-linux-nvidia-6.5'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-6.5-v2 verification-needed-jammy-linux-nvidia-6.5
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (249.6 KiB)

This bug was fixed in the package linux-nvidia-6.5 - 6.5.0-1014.14

---------------
linux-nvidia-6.5 (6.5.0-1014.14) jammy; urgency=medium

  * jammy/linux-nvidia-6.5: 6.5.0-1014.14 -proposed tracker (LP: #2055581)

  * Packaging resync (LP: #1786013)
    - [Packaging] update variants
    - [Packaging] drop ABI data
    - debian.nvidia-6.5/dkms-versions -- update from kernel-versions
      (main/2024.03.04)

  * Pull-request to address bug in mm/page_alloc.c (LP: #2055712)
    - mm/page_alloc: fix min_free_kbytes calculation regarding ZONE_MOVABLE

  * Pull-request: Apply mm/mglru patches to fix soft lockup (LP: #2055060)
    - mm/mglru: try to stop at high watermarks
    - mm/mglru: respect min_ttl_ms with memcgs

  * Pull request: Enable support of ETE and TRBE in ACPI environment.
    (LP: #2054984)
    - NVIDIA: [Config] CORESIGHT configuration changes
    - coresight: trbe: Add a representative coresight_platform_data for TRBE
    - coresight: trbe: Enable ACPI based TRBE devices
    - arm_pmu: acpi: Refactor arm_spe_acpi_register_device()
    - arm_pmu: acpi: Add a representative platform device for TRBE
    - perf cs-etm: Fix incorrect or missing decoder for raw trace

  [ Ubuntu: 6.5.0-27.28 ]

  * mantic/linux: 6.5.0-27.28 -proposed tracker (LP: #2055584)
  * Packaging resync (LP: #1786013)
    - [Packaging] drop ABI data
    - [Packaging] update annotations scripts
    - debian.master/dkms-versions -- update from kernel-versions (main/2024.03.04)
  * CVE-2024-26597
    - net: qualcomm: rmnet: fix global oob in rmnet_policy
  * CVE-2024-26599
    - pwm: Fix out-of-bounds access in of_pwm_single_xlate()
  * Drop ABI checks from kernel build (LP: #2055686)
    - [Packaging] Remove in-tree abi checks
  * Cranky update-dkms-versions rollout (LP: #2055685)
    - [Packaging] remove update-dkms-versions
    - Move debian/dkms-versions to debian.master/dkms-versions
    - [Packaging] Replace debian/dkms-versions with $(DEBIAN)/dkms-versions
  * linux: please move erofs.ko (CONFIG_EROFS for EROFS support) from linux-
    modules-extra to linux-modules (LP: #2054809)
    - UBUNTU [Packaging]: Include erofs in linux-modules instead of linux-modules-
      extra
  * performance: Scheduler: ratelimit updating of load_avg (LP: #2053251)
    - sched/fair: Ratelimit update to tg->load_avg
  * IB peer memory feature regressed in 6.5 (LP: #2055082)
    - SAUCE: RDMA/core: Introduce peer memory interface
  * linux-tools-common: man page of usbip[d] is misplaced (LP: #2054094)
    - [Packaging] rules: Put usbip manpages in the correct directory
  * CVE-2024-23851
    - dm: limit the number of targets and parameter size area
  * CVE-2024-23850
    - btrfs: do not ASSERT() if the newly created subvolume already got read
  * x86: performance: tsc: Extend watchdog check exemption to 4-Sockets platform
    (LP: #2054699)
    - x86/tsc: Extend watchdog check exemption to 4-Sockets platform
  * linux: please move dmi-sysfs.ko (CONFIG_DMI_SYSFS for SMBIOS support) from
    linux-modules-extra to linux-modules (LP: #2045561)
    - [Packaging] Move dmi-sysfs.ko into linux-modules
  * Fix AMD brightness issue on AUO panel (LP: #2054773)
    - drm/amdgpu: make dam...

Changed in linux-nvidia-6.5 (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.