IB peer memory feature regressed in 6.5

Bug #2055082 reported by dann frazier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned
Mantic
Fix Released
Medium
dann frazier

Bug Description

[Impact]
The GPU Direct over Infiniband feature of NVIDIA GPUs no longer works on jammy/hwe kernels that have migrated to 6.5 *except* for the -nvidia kernel, which pulled in support via bug 2049537.

We have carried this patch since 5.4 (see bug 1923104). We do not plan to carry this patch into 6.8 or later - we are working on a deprecation post for that to give users some time to migrate.

[Test Case]
https://git.launchpad.net/~canonical-kernel-team/+git/autotest-client-tests/tree/ubuntu_performance_gpudirect_rdma

dann frazier (dannf)
Changed in linux (Ubuntu):
status: New → Won't Fix
Changed in linux (Ubuntu Mantic):
status: New → In Progress
assignee: nobody → dann frazier (dannf)
Stefan Bader (smb)
Changed in linux (Ubuntu Mantic):
importance: Undecided → Medium
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.5.0-27.28 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux' to 'verification-done-mantic-linux'. If the problem still exists, change the tag 'verification-needed-mantic-linux' to 'verification-failed-mantic-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-mantic-linux-v2 verification-needed-mantic-linux
Revision history for this message
dann frazier (dannf) wrote :
Download full text (9.1 KiB)

= Verification =

$ cat /proc/version
Linux version 6.5.0-27-generic (buildd@lcy02-amd64-059) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.41) #28-Ubuntu SMP PREEMPT_DYNAMIC Thu Mar 7 18:21:00 UTC 2024

ubuntu@ubuntu:~/autotest-client-tests/ubuntu_performance_gpudirect_rdma/nvidia-peermem-test$ ./nvidia-peermem-test.sh -m peermem
Repository: 'Types: deb
URIs: https://ppa.launchpadcontent.net/canonical-nvidia/perftest+cuda/ubuntu/
Suites: mantic
Components: main
'
Description:
Used internal for kernel regression testing
More info: https://launchpad.net/~canonical-nvidia/+archive/ubuntu/perftest+cuda
Adding repository.
Found existing deb entry in /etc/apt/sources.list.d/canonical-nvidia-ubuntu-perftest_cuda-mantic.sources
Hit:1 http://archive.ubuntu.com/ubuntu mantic InRelease
Hit:2 http://archive.ubuntu.com/ubuntu mantic-updates InRelease
Hit:3 http://archive.ubuntu.com/ubuntu mantic-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu mantic-backports InRelease
Hit:5 http://archive.ubuntu.com/ubuntu mantic-proposed InRelease
Hit:6 https://ppa.launchpadcontent.net/canonical-nvidia/perftest+cuda/ubuntu mantic InRelease
Hit:7 https://ppa.launchpadcontent.net/dannf/dannf/ubuntu mantic InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
perftest is already the newest version (24.01.0+0.38-1+perftest+cuda.1~ubuntu23.10.1).
0 upgraded, 0 newly installed, 0 to remove and 10 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
opensm is already the newest version (3.3.23-2).
0 upgraded, 0 newly installed, 0 to remove and 10 not upgraded.
      --use_cuda=<cuda device id> Use CUDA specific device for GPUDirect RDMA testing
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 07:00
CUDA device 1: PCIe address is 0F:00
CUDA device 2: PCIe address is 47:00
CUDA device 3: PCIe address is 4E:00
CUDA device 4: PCIe address is 87:00
CUDA device 5: PCIe address is 90:00
CUDA device 6: PCIe address is B7:00
CUDA device 7: PCIe address is BD:00

Picking device No. 1
[pid = 15582, dev = 1] device name = [NVIDIA A100-SXM4-40GB]
creating CUDA Ctx
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 07:00
CUDA device 1: PCIe address is 0F:00
CUDA device 2: PCIe address is 47:00
CUDA device 3: PCIe address is 4E:00
CUDA device 4: PCIe address is 87:00
CUDA device 5: PCIe address is 90:00
CUDA device 6: PCIe address is B7:00
CUDA device 7: PCIe address is BD:00

Picking device No. 0
[pid = 15576, dev = 0] device name = [NVIDIA A100-SXM4-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007c0146000000 pointer=0x7c0146000000
----------------------------------------...

Read more...

tags: added: verification-done-mantic-linux
removed: verification-needed-mantic-linux
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (41.0 KiB)

This bug was fixed in the package linux - 6.5.0-27.28

---------------
linux (6.5.0-27.28) mantic; urgency=medium

  * mantic/linux: 6.5.0-27.28 -proposed tracker (LP: #2055584)

  * Packaging resync (LP: #1786013)
    - [Packaging] drop ABI data
    - [Packaging] update annotations scripts
    - debian.master/dkms-versions -- update from kernel-versions (main/2024.03.04)

  * CVE-2024-26597
    - net: qualcomm: rmnet: fix global oob in rmnet_policy

  * CVE-2024-26599
    - pwm: Fix out-of-bounds access in of_pwm_single_xlate()

  * Drop ABI checks from kernel build (LP: #2055686)
    - [Packaging] Remove in-tree abi checks

  * Cranky update-dkms-versions rollout (LP: #2055685)
    - [Packaging] remove update-dkms-versions
    - Move debian/dkms-versions to debian.master/dkms-versions
    - [Packaging] Replace debian/dkms-versions with $(DEBIAN)/dkms-versions

  * linux: please move erofs.ko (CONFIG_EROFS for EROFS support) from linux-
    modules-extra to linux-modules (LP: #2054809)
    - UBUNTU [Packaging]: Include erofs in linux-modules instead of linux-modules-
      extra

  * performance: Scheduler: ratelimit updating of load_avg (LP: #2053251)
    - sched/fair: Ratelimit update to tg->load_avg

  * IB peer memory feature regressed in 6.5 (LP: #2055082)
    - SAUCE: RDMA/core: Introduce peer memory interface

  * linux-tools-common: man page of usbip[d] is misplaced (LP: #2054094)
    - [Packaging] rules: Put usbip manpages in the correct directory

  * CVE-2024-23851
    - dm: limit the number of targets and parameter size area

  * CVE-2024-23850
    - btrfs: do not ASSERT() if the newly created subvolume already got read

  * x86: performance: tsc: Extend watchdog check exemption to 4-Sockets platform
    (LP: #2054699)
    - x86/tsc: Extend watchdog check exemption to 4-Sockets platform

  * linux: please move dmi-sysfs.ko (CONFIG_DMI_SYSFS for SMBIOS support) from
    linux-modules-extra to linux-modules (LP: #2045561)
    - [Packaging] Move dmi-sysfs.ko into linux-modules

  * Fix AMD brightness issue on AUO panel (LP: #2054773)
    - drm/amdgpu: make damage clips support configurable

  * Mantic update: upstream stable patchset 2024-02-28 (LP: #2055199)
    - f2fs: explicitly null-terminate the xattr list
    - pinctrl: lochnagar: Don't build on MIPS
    - ALSA: hda - Fix speaker and headset mic pin config for CHUWI CoreBook XPro
    - mptcp: fix uninit-value in mptcp_incoming_options
    - wifi: cfg80211: lock wiphy mutex for rfkill poll
    - wifi: avoid offset calculation on NULL pointer
    - wifi: mac80211: handle 320 MHz in ieee80211_ht_cap_ie_to_sta_ht_cap
    - debugfs: fix automount d_fsdata usage
    - nvme-core: fix a memory leak in nvme_ns_info_from_identify()
    - drm/amd/display: update dcn315 lpddr pstate latency
    - drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer
    - smb: client, common: fix fortify warnings
    - blk-mq: don't count completed flush data request as inflight in case of
      quiesce
    - nvme-core: check for too small lba shift
    - hwtracing: hisi_ptt: Handle the interrupt in hardirq context
    - hwtracing: hisi_ptt: Don't try to attach a task
    - ASoC: wm8974:...

Changed in linux (Ubuntu Mantic):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-raspi/6.5.0-1014.17 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux-raspi' to 'verification-done-mantic-linux-raspi'. If the problem still exists, change the tag 'verification-needed-mantic-linux-raspi' to 'verification-failed-mantic-linux-raspi'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-mantic-linux-raspi-v2 verification-needed-mantic-linux-raspi
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.