Ubuntu 22.04 raise abnormal NIC MSI-X requests with larger CPU cores (256)

Bug #2012335 reported by xijunli
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Luke Nowakowski-Krijger
Kinetic
Fix Released
Undecided
Luke Nowakowski-Krijger

Bug Description

SRU Justification:

[Impact]

There is a user reporting errors in setup with their Intel E810 NIC with
error messages saying that the driver cannot allocate enough MSI-X vectors
on their 256 cpu-count system.

It seems the ICE ethernet driver has an all or nothing approach to
allocating MSI-X vectors and could request more MSI-X vectors than it
finds available, which could lead to the driver failing to initialize and
start.

[Fix]

The patch that fixes this allocates as many MSI-X vectors as it can to continue
functionality by reducing the number of requested MSI-X vectors if it does
not have enough to do full allocation.

[Backport]

In Jammy we do not carry patches for switchdev support in the driver so do not
allocate the switchdev MSI-X vector for it. Also in Jammy use the older
way of checking RDMA support by testing the RDMA bit is set as opposed to the newer
ice_is_rdma_ena that the patch uses.

[Test Plan]

Install and startup Ice driver with an Intel 800 series NIC and check that we
do not have the failure:

Not enough device MSI-X vectors, requested = 260, available = 253

and check that everything works as expected.

The backported patch for Jammy has been tested by the original user who
submited the bug report with their high cpu count system and confirmed no errors.

[Where problems could occur]

There could be problems with the logic of reducing the MSI-X vector
usage leading to more errors in the driver, but otherwise minimal
regression potential as the code is mostly refactoring initial MSI-X
setup.

----------------------------------

System Configuration
    OS: Ubuntu 22.04 LTS
    Kernel: 5.15.0-25-generic
    CPUs: 256
    NIC: Intel E810 NIC with 512 MSIx vectors each function

Errors
    Not enough device MSI-X vectors, requested = 260, available = 253

Findings
    (1) the current ice kernel driver (ice_main.c) will pre-allocate all required number of msix (even it's not enough for big core CPUs)
    (2) the commit https://github.com/torvalds/linux/commit/ce4626131112e1d0066a890371e14d8091323f99 has improved this logic, and it seems merged into kernel version from v6.1

So for supporting the new CPUs with more than 252 vCPUs, will Ubuntu kernel backport above patch to the current kernel (v5.15) ?

Revision history for this message
xijunli (xijunli) wrote :
affects: ubuntu-realtime → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2012335

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
xijunli (xijunli) wrote :

It's not a system crash, but a kernel driver related issues for PCIe NIC

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
xijunli (xijunli) wrote :

Hello, is there anyone who can help here ?

Changed in linux (Ubuntu Jammy):
status: New → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Jammy):
status: Confirmed → In Progress
assignee: nobody → Luke Nowakowski-Krijger (lukenow)
Revision history for this message
Luke Nowakowski-Krijger (lukenow) wrote :

hey @xijunli ! Could you please test that the driver now works as expected? I have placed the files here https://kernel.ubuntu.com/~lukenow/lp2012335/ , but you should only need to install the modules. Let me know if that works for you

Thanks,
- Luke

Revision history for this message
xijunli (xijunli) wrote :

@lukenow, after installing the modules you provided, the issue is gone

when you have the final kernel version included this, please update here, thanks

Revision history for this message
Luke Nowakowski-Krijger (lukenow) wrote :

Thank you for confirming @xijunli , once the kernel with the fix is released a bot should post here which version contains the patch.

- Luke

Changed in linux (Ubuntu Kinetic):
status: New → Confirmed
assignee: nobody → Luke Nowakowski-Krijger (lukenow)
Changed in linux (Ubuntu Kinetic):
status: Confirmed → In Progress
description: updated
Revision history for this message
xijunli (xijunli) wrote :

@lukenow, do you have estimated time required to complete this ticket, it is under "in progress" currently, thanks

Revision history for this message
Luke Nowakowski-Krijger (lukenow) wrote :

hey @xijunli , the patches have been accepted however have yet to be committed to our trees. I'll personally make sure they get committed this cycle, which means that they should release with the next cycles' kernels. So they should be officially released approximately in the beginning of June if everything goes well.

Revision history for this message
xijunli (xijunli) wrote :

Thank you for confirmation, will check it again at that time (the beginning of June)

Changed in linux (Ubuntu Jammy):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Kinetic):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.15.0-74.81 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux verification-needed-jammy
Revision history for this message
xijunli (xijunli) wrote :

Done verification with linux/5.15.0-74.81, and got a PASS result, thanks

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.19.0-44.45 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux verification-needed-kinetic
xijunli (xijunli)
tags: added: verification-done-kinetic
removed: verification-needed-kinetic
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-5.19/5.19.0-1014.14 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-5.19 verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-intel-iotg-5.15/5.15.0-1033.38~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-intel-iotg-5.15 verification-needed-focal
xijunli (xijunli)
tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (40.5 KiB)

This bug was fixed in the package linux - 5.15.0-75.82

---------------
linux (5.15.0-75.82) jammy; urgency=medium

  * jammy/linux: 5.15.0-75.82 -proposed tracker (LP: #2023065)

  * Jammy update: v5.15.102 upstream stable release (LP: #2020393)
    - wifi: cfg80211: Partial revert "wifi: cfg80211: Fix use after free for wext"

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] resync getabis

  * fix typo in config-checks invocation (LP: #2020413)
    - [Packaging] fix typo when calling the old config-check
    - [Packaging] fix typo in 4-checks.mk

  * support python < 3.9 with annotations (LP: #2020531)
    - [Packaging] kconfig/annotations.py: support older way of merging dicts

linux (5.15.0-74.81) jammy; urgency=medium

  * jammy/linux: 5.15.0-74.81 -proposed tracker (LP: #2019420)

  * smartpqi: Update 22.04 driver to include recent bug fixes and support
    current generation devices (LP: #1998643)
    - scsi: smartpqi: Switch to attribute groups
    - scsi: smartpqi: Fix rmmod stack trace
    - scsi: smartpqi: Add PCI IDs
    - scsi: smartpqi: Enable SATA NCQ priority in sysfs
    - scsi: smartpqi: Eliminate drive spin down on warm boot
    - scsi: smartpqi: Quickly propagate path failures to SCSI midlayer
    - scsi: smartpqi: Fix a name typo and cleanup code
    - scsi: smartpqi: Fix a typo in func pqi_aio_submit_io()
    - scsi: smartpqi: Resolve delay issue with PQI_HZ value
    - scsi: smartpqi: Avoid drive spin-down during suspend
    - scsi: smartpqi: Update volume size after expansion
    - scsi: smartpqi: Speed up RAID 10 sequential reads
    - scsi: smartpqi: Expose SAS address for SATA drives
    - scsi: smartpqi: Fix NUMA node not updated during init
    - scsi: smartpqi: Fix BUILD_BUG_ON() statements
    - scsi: smartpqi: Fix hibernate and suspend
    - scsi: smartpqi: Fix lsscsi -t SAS addresses
    - scsi: smartpqi: Update version to 2.1.14-035
    - scsi: smartpqi: Fix unused variable pqi_pm_ops for clang
    - scsi: smartpqi: Stop using the SCSI pointer
    - scsi: smartpqi: Fix typo in comment
    - scsi: smartpqi: Shorten drive visibility after removal
    - scsi: smartpqi: Add controller fw version to console log
    - scsi: smartpqi: Add PCI IDs for ramaxel controllers
    - scsi: smartpqi: Close write read holes
    - scsi: smartpqi: Add driver support for multi-LUN devices
    - scsi: smartpqi: Fix PCI control linkdown system hang
    - scsi: smartpqi: Add PCI ID for Adaptec SmartHBA 2100-8i
    - scsi: smartpqi: Add PCI IDs for Lenovo controllers
    - scsi: smartpqi: Stop logging spurious PQI reset failures
    - scsi: smartpqi: Fix RAID map race condition
    - scsi: smartpqi: Add module param to disable managed ints
    - scsi: smartpqi: Update deleting a LUN via sysfs
    - scsi: smartpqi: Add ctrl ready timeout module parameter
    - scsi: smartpqi: Update copyright to current year
    - scsi: smartpqi: Update version to 2.1.18-045
    - scsi: smartpqi: Convert to host_tagset
    - scsi: smartpqi: Add new controller PCI IDs
    - scsi: smartpqi: Correct max LUN number
    - scsi: smartpqi: Change sysfs raid_level attribute to N/A for controllers
    - scsi: smar...

Changed in linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (43.3 KiB)

This bug was fixed in the package linux - 5.19.0-45.46

---------------
linux (5.19.0-45.46) kinetic; urgency=medium

  * kinetic/linux: 5.19.0-45.46 -proposed tracker (LP: #2023057)

  * Kinetic update: upstream stable patchset 2023-05-23 (LP: #2020599)
    - wifi: cfg80211: Partial revert "wifi: cfg80211: Fix use after free for wext"

linux (5.19.0-44.45) kinetic; urgency=medium

  * kinetic/linux: 5.19.0-44.45 -proposed tracker (LP: #2019827)

  * Linux 5.19 amdgpu: NULL pointer on GCN2 and invalid load on GCN1
    (LP: #2018470)
    - drm/amdgpu: Fix for BO move issue

  * CVE-2023-32233
    - netfilter: nf_tables: deactivate anonymous set from preparation phase

  * CVE-2023-2612
    - SAUCE: shiftfs: prevent lock unbalance in shiftfs_create_object()

  * CVE-2023-31436
    - net: sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg

  * CVE-2023-1380
    - wifi: brcmfmac: slab-out-of-bounds read in brcmf_get_assoc_ies()

  * conntrack mark is not advertised via netlink (LP: #2016269)
    - netfilter: ctnetlink: revert to dumping mark regardless of event type

  * 5.19 not reporting cgroups v1 blkio.throttle.io_serviced (LP: #2016186)
    - SAUCE: blk-throttle: Fix io statistics for cgroup v1

  * [SRU] Backport request for hpwdt from upstream 6.1 to Jammy (LP: #2008751)
    - watchdog/hpwdt: Enable HP_WATCHDOG for ARM64 systems.
    - watchdog/hpwdt: Include nmi.h only if CONFIG_HPWDT_NMI_DECODING
    - [Config] Add arm64 option to CONFIG_HP_WATCHDOG

  * vmwgfx fails to reserve graphics buffer on aarch64 leading to blank display
    (LP: #2007001)
    - SAUCE: Revert "video/aperture: Disable and unregister sysfb devices via
      aperture helpers"

  * Ubuntu 22.04 raise abnormal NIC MSI-X requests with larger CPU cores (256)
    (LP: #2012335)
    - ice: Allow operation with reduced device MSI-X

  * Dell: Enable speaker mute hotkey LED indicator (LP: #2015972)
    - platform/x86: dell-laptop: Register ctl-led for speaker-mute

  * [SRU]With "Performance per Watt (DAPC)" enabled in the BIOS, Bootup time is
    taking longer than expected (LP: #2008527)
    - cpufreq: ACPI: Defer setting boost MSRs

  * [SRU][Jammy] CONFIG_PCI_MESON is not enabled (LP: #2007745)
    - [Config] arm64: Enable PCI_MESON module

  * Kinetic update: upstream stable patchset 2023-05-08 (LP: #2018948)
    - HID: asus: use spinlock to protect concurrent accesses
    - HID: asus: use spinlock to safely schedule workers
    - powerpc/mm: Rearrange if-else block to avoid clang warning
    - ARM: OMAP2+: Fix memory leak in realtime_counter_init()
    - arm64: dts: qcom: qcs404: use symbol names for PCIe resets
    - arm64: dts: qcom: msm8996-tone: Fix USB taking 6 minutes to wake up
    - arm64: dts: qcom: sm8150-kumano: Panel framebuffer is 2.5k instead of 4k
    - arm64: dts: qcom: sm6125: Reorder HSUSB PHY clocks to match bindings
    - arm64: dts: imx8m: Align SoC unique ID node unit address
    - ARM: zynq: Fix refcount leak in zynq_early_slcr_init
    - arm64: dts: mediatek: mt8183: Fix systimer 13 MHz clock description
    - arm64: dts: qcom: sdm845-db845c: fix audio codec interrupt pin name
    - arm64: dts: qcom: sc7180: correct SPMI bus addres...

Changed in linux (Ubuntu Kinetic):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra/5.15.0-1015.15 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-tegra
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-tegra-igx/5.15.0-1001.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-nvidia-tegra-igx
xijunli (xijunli)
tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.15.0-1043.50 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure verification-needed-jammy
removed: verification-done-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws/5.15.0-1041.46 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-aws
xijunli (xijunli)
tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws-5.15/5.15.0-1046.51~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-done-focal-linux-aws-5.15'. If the problem still exists, change the tag 'verification-needed-focal-linux-aws-5.15' to 'verification-failed-focal-linux-aws-5.15'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-aws-5.15-v2 verification-needed-focal-linux-aws-5.15
xijunli (xijunli)
tags: added: verification-done-focal-linux-aws-5.15
removed: verification-needed-focal-linux-aws-5.15
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-mtk/5.15.0-1030.34 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-done-jammy-linux-mtk'. If the problem still exists, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-failed-jammy-linux-mtk'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-mtk-v2 verification-needed-jammy-linux-mtk
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.