NVMe devices fail to probe due to ACPI power state change

Bug #1942624 reported by Chris de CLAVERIE
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
HWE Next
Fix Released
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Invalid
Undecided
Unassigned
Impish
Fix Released
Medium
Heitor Alves de Siqueira
linux-oem-5.14 (Ubuntu)
Invalid
Medium
Heitor Alves de Siqueira
Focal
Fix Released
Medium
Heitor Alves de Siqueira
Impish
Invalid
Undecided
Unassigned

Bug Description

[Impact]
* Specific NVMe devices fail to probe and become unusable after boot
* Caused by an ACPI regression that doesn't correctly handle power states
* Upstream regression commit:
  7e4fdeafa61f ACPI: power: Turn off unused power resources unconditionally
* Regression window for Ubuntu kernels includes 5.13 and 5.14

[Test Plan]
* Boot affected kernel and validate whether NVMe device is usable
* Check kernel logs for failed probe message:
  "can't change power state from D3Cold to D0 (config space inaccessible)"

[Fix]
* Fixed by not turning off power resources in unknown state
* Fix was introduced by commit:
  bc2836859643 ACPI: PM: Do not turn off power resources in unknown state
* Kernels starting with 5.15 (e.g. Jammy) not affected, as they already contain the fix above

[Regression Potential]
* NVMe devices continue failing to probe
* Other devices become unusable after power state changes
* Further regressions would affect power state of devices, possibly after boot

--
[Original Description]
NVME "can't change power state from D3Cold to D0 (config space inaccessible)"

Bug with kernels after version 5.11.0-18 on Lenovo Ideapad 330-15ICH. The NVME drive with my root partition cannot be mounted at boot with an error "can't change power state from D3Cold to D0 (config space inaccessible)". I'm willing to help find a root cause if I don't need to spent too many hours. All Ubuntu kernels after 5.11.0-18 exhibit this bug, but I could boot properly with the official linux kernel 5.13.0. Thanks a lot for your help

ProblemType: Bug
DistroRelease: Ubuntu 21.04
Package: linux-image-5.11.0-18-generic 5.11.0-18.19
ProcVersionSignature: Ubuntu 5.11.0-18.19-generic 5.11.17
Uname: Linux 5.11.0-18-generic x86_64
ApportVersion: 2.20.11-0ubuntu65.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: chris 7503 F.... pulseaudio
 /dev/snd/pcmC0D0p: chris 7503 F...m pulseaudio
CasperMD5CheckResult: unknown
CurrentDesktop: ubuntu:GNOME
Date: Fri Sep 3 18:46:35 2021
InstallationDate: Installed on 2019-07-17 (779 days ago)
InstallationMedia: Ubuntu 19.04 "Disco Dingo" - Release amd64 (20190416)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 003: ID 5986:210e Acer, Inc EasyCamera
 Bus 001 Device 004: ID 8087:0a2a Intel Corp. Bluetooth wireless interface
 Bus 001 Device 007: ID 1050:0407 Yubico.com Yubikey 4/5 OTP+U2F+CCID
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: LENOVO 81FK
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.11.0-18-generic root=UUID=9d963312-3a16-428d-8efd-f1323c6528f1 ro quiet nosplash crashkernel=512M-:192M
RelatedPackageVersions:
 linux-restricted-modules-5.11.0-18-generic N/A
 linux-backports-modules-5.11.0-18-generic N/A
 linux-firmware 1.197.3
SourcePackage: linux
UpgradeStatus: Upgraded to hirsute on 2021-06-03 (92 days ago)
dmi.bios.date: 10/24/2018
dmi.bios.release: 1.29
dmi.bios.vendor: LENOVO
dmi.bios.version: 7ZCN29WW
dmi.board.asset.tag: NO Asset Tag
dmi.board.name: LNVNB161216
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40709 WIN
dmi.chassis.asset.tag: NO Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Lenovo ideapad 330-15ICH
dmi.ec.firmware.release: 1.29
dmi.modalias: dmi:bvnLENOVO:bvr7ZCN29WW:bd10/24/2018:br1.29:efr1.29:svnLENOVO:pn81FK:pvrLenovoideapad330-15ICH:rvnLENOVO:rnLNVNB161216:rvrSDK0J40709WIN:cvnLENOVO:ct10:cvrLenovoideapad330-15ICH:
dmi.product.family: ideapad 330-15ICH
dmi.product.name: 81FK
dmi.product.sku: LENOVO_MT_81FK_BU_idea_FM_ideapad 330-15ICH
dmi.product.version: Lenovo ideapad 330-15ICH
dmi.sys.vendor: LENOVO

Revision history for this message
Chris de CLAVERIE (declaverie) wrote :
Revision history for this message
Chris de CLAVERIE (declaverie) wrote :

Pictures of the error messages - Couldn't find a way to get the errors from any log, sorry !

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
description: updated
Revision history for this message
Ilari Nieminen (ilari-t-nieminen) wrote : Re: NVME "can't change power state from D3Cold to D0 (config space inaccessible)"

The upstream bug might be: https://bugzilla.kernel.org/show_bug.cgi?id=214035

The first NVMe drive is not activated, but the second NVMe disk works ok if one is available.

In other words, works correctly in 5.11, but fails in 5.13.

Revision history for this message
Steven Clarkson (sclarkson) wrote :

I also encounter this bug in the impish 5.13 kernel. The machine boots, but without one of the NVMe devices. This is on a very recent Razer Blade 15 laptop (RZ09-0409x) with two NVMe drives. One with windows and one with Ubuntu.

Upstream kernel 5.15 did not have this issue. I was able to bisect the fix back to

bc2836859643 ACPI: PM: Do not turn off power resources in unknown state

This patch is marked as fixing a commit that first landed in 5.14, but this commit was a rework, and the actual broken logic was introduced much earlier in 5.13. As such, the fix requires a few additional cherry picks.

On 5.13.0-22-generic

$ dmesg | grep nvme
[sudo] password for ubuntu:
[ 1.266638] nvme nvme0: pci function 0000:02:00.0
[ 1.266646] nvme 0000:02:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[ 1.266696] nvme nvme1: pci function 0000:03:00.0
[ 1.267189] nvme nvme0: Removing after probe failure status: -19
[ 1.273614] nvme nvme1: missing or invalid SUBNQN field.
[ 1.273633] nvme nvme1: Shutdown timeout set to 8 seconds
[ 1.290515] nvme nvme1: 16/0/0 default/read/poll queues
[ 1.293385] nvme1n1: p1 p2
[ 4.297893] EXT4-fs (nvme1n1p2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
[ 4.584787] EXT4-fs (nvme1n1p2): re-mounted. Opts: errors=remount-ro. Quota mode: none.

First nvme device does not show up in lsblk.

After apply the following patch set to the master-next of the impish kernel,

587024b8210d ACPI: power: Use u8 as the power resource state data type
ca84f18798a4 ACPI: power: Save the last known state of each power resource
6381195ad7d0 ACPI: power: Rework turning off unused power resources
db9b6d87a8d4 ACPI: power: Use dev_dbg() to print some messages
fad40a624854 ACPI: power: Use acpi_handle_debug() to print debug messages
bc2836859643 ACPI: PM: Do not turn off power resources in unknown state

The nvme power state error is no longer present, and the device loads properly.

Also, the fixing patch was marked for inclusion in 5.14 stable, but never made it since it did not cherry pick cleanly. Just the last two patches in the above list are needed for the 5.14 kernel.

Revision history for this message
Steven Clarkson (sclarkson) wrote :
Revision history for this message
Steven Clarkson (sclarkson) wrote :
Revision history for this message
Steven Clarkson (sclarkson) wrote :
tags: added: impish
Changed in linux-oem-5.14 (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Impish):
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in linux (Ubuntu Focal):
status: New → Invalid
Changed in linux-oem-5.14 (Ubuntu Impish):
status: New → Invalid
Changed in linux-oem-5.14 (Ubuntu Focal):
status: New → Confirmed
Changed in linux-oem-5.14 (Ubuntu):
importance: Undecided → Medium
Changed in linux-oem-5.14 (Ubuntu Focal):
importance: Undecided → Medium
Changed in linux-oem-5.14 (Ubuntu):
assignee: nobody → Heitor Alves de Siqueira (halves)
Changed in linux-oem-5.14 (Ubuntu Focal):
assignee: nobody → Heitor Alves de Siqueira (halves)
tags: added: sts
Changed in linux (Ubuntu Impish):
status: Confirmed → In Progress
Changed in linux-oem-5.14 (Ubuntu Focal):
status: Confirmed → In Progress
Changed in linux-oem-5.14 (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Heitor Alves de Siqueira (halves) wrote :

Hi all,

This seems to be related to a regression introduced by the following commit:
* 7e4fdeafa61f ACPI: power: Turn off unused power resources unconditionally

The fix seems to indeed be the commit @sclarkson mentioned:
* bc2836859643 ACPI: PM: Do not turn off power resources in unknown state

Looking at our kernels, it seems the affected version window is between v5.12 and v5.15. I've tagged the relevant packages in this bug, and have built a set of test kernels to validate the fixes. For the 5.13 kernels, we additionally require the commit below:
* 9b7ff25d129d ACPI: power: Refine turning off unused power resources

I'd greatly appreciate if anyone affected by this could give these test kernels a try, as I don't have the appropriate hardware to test it myself. I've uploaded packages for Focal-HWE/Impish (5.13) and Focal-OEM (5.14) to a public PPA, but please consider these packages for testing purposes only. If you can reproduce this bug on a different kernel, please add a comment and I'll look into backporting the required patches there as well.

Cheers,
Heitor

[0] https://launchpad.net/~halves/+archive/ubuntu/test-1942624

Revision history for this message
Mario Limonciello (superm1) wrote :
summary: - NVME "can't change power state from D3Cold to D0 (config space
- inaccessible)"
+ NVMe devices fail to probe due to ACPI power state change
description: updated
description: updated
description: updated
description: updated
Changed in linux (Ubuntu Impish):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.13.0-41.46 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-impish' to 'verification-done-impish'. If the problem still exists, change the tag 'verification-needed-impish' to 'verification-failed-impish'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-impish
Revision history for this message
Maciej Gołuchowski (valherupl) wrote :

It works ! ^_^ :D nice, thanks ! :D

Timo Aaltonen (tjaalton)
Changed in linux-oem-5.14 (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-oem-5.14 (Ubuntu):
status: In Progress → Invalid
AceLan Kao (acelankao)
tags: added: oem-priority originate-from-1969446 pygmy-possum
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oem-5.14/5.14.0-1035.38 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Tagging as verification-done-impish based on comment #12. Thank you!

tags: added: verification-done-impish
removed: verification-needed-impish
AceLan Kao (acelankao)
tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (33.7 KiB)

This bug was fixed in the package linux - 5.13.0-41.46

---------------
linux (5.13.0-41.46) impish; urgency=medium

  * impish/linux: 5.13.0-41.46 -proposed tracker (LP: #1969014)

  * NVMe devices fail to probe due to ACPI power state change (LP: #1942624)
    - ACPI: power: Rework turning off unused power resources
    - ACPI: PM: Do not turn off power resources in unknown state

  * Recent 5.13 kernel has broken KVM support (LP: #1966499)
    - KVM: Add infrastructure and macro to mark VM as bugged
    - KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM
    - KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled

  * LRMv6: add multi-architecture support (LP: #1968774)
    - [Packaging] resync dkms-build{,--nvidia-N}

  * io_uring regression - lost write request (LP: #1952222)
    - io-wq: split bounded and unbounded work into separate lists

  * xfrm interface cannot be changed anymore (LP: #1968591)
    - xfrm: fix the if_id check in changelink

  * Use kernel-testing repo from launchpad for ADT tests (LP: #1968016)
    - [Debian] Use kernel-testing repo from launchpad

  * vmx_ldtr_test in ubuntu_kvm_unit_tests failed (FAIL: Expected 0 for L1 LDTR
    selector (got 50)) (LP: #1956315)
    - KVM: nVMX: Set LDTR to its architecturally defined value on nested VM-Exit

  * audio from external sound card is distorted (LP: #1966066)
    - ALSA: usb-audio: Fix packet size calculation regression

  * Impish update: upstream stable patchset 2022-04-12 (LP: #1968771)
    - cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug
    - btrfs: tree-checker: check item_size for inode_item
    - btrfs: tree-checker: check item_size for dev_item
    - clk: jz4725b: fix mmc0 clock gating
    - vhost/vsock: don't check owner in vhost_vsock_stop() while releasing
    - parisc/unaligned: Fix fldd and fstd unaligned handlers on 32-bit kernel
    - parisc/unaligned: Fix ldw() and stw() unalignment handlers
    - KVM: x86/mmu: make apf token non-zero to fix bug
    - drm/amdgpu: disable MMHUB PG for Picasso
    - drm/i915: Correctly populate use_sagv_wm for all pipes
    - sr9700: sanity check for packet length
    - USB: zaurus: support another broken Zaurus
    - CDC-NCM: avoid overflow in sanity checking
    - x86/fpu: Correct pkru/xstate inconsistency
    - tee: export teedev_open() and teedev_close_context()
    - optee: use driver internal tee_context for some rpc
    - ping: remove pr_err from ping_lookup
    - perf data: Fix double free in perf_session__delete()
    - bnx2x: fix driver load from initrd
    - bnxt_en: Fix active FEC reporting to ethtool
    - hwmon: Handle failure to register sensor with thermal zone correctly
    - bpf: Do not try bpf_msg_push_data with len 0
    - selftests: bpf: Check bpf_msg_push_data return value
    - bpf: Add schedule points in batch ops
    - io_uring: add a schedule point in io_add_buffers()
    - net: __pskb_pull_tail() & pskb_carve_frag_list() drop_monitor friends
    - tipc: Fix end of loop tests for list_for_each_entry()
    - gso: do not skip outer ip header in case of ipip and net_failover
    - openvswitch: Fix setting ipv6 fields causing hw csum failure
   ...

Changed in linux (Ubuntu Impish):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (78.4 KiB)

This bug was fixed in the package linux-oem-5.14 - 5.14.0-1036.40

---------------
linux-oem-5.14 (5.14.0-1036.40) focal; urgency=medium

  * focal/linux-oem-5.14: 5.14.0-1036.40 -proposed tracker (LP: #1971982)

  * AMD APU s2idle is broken after the ASIC reset fix (LP: #1972134)
    - drm/amdgpu: unify BO evicting method in amdgpu_ttm
    - drm/amdgpu: explicitly check for s0ix when evicting resources

  * amd_gpio AMDI0030:00: Failed to translate GPIO pin 0x0000 to IRQ, err -517
    (LP: #1971597)
    - gpio: Request interrupts after IRQ is initialized

linux-oem-5.14 (5.14.0-1035.38) focal; urgency=medium

  * focal/linux-oem-5.14: 5.14.0-1035.38 -proposed tracker (LP: #1969056)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis

  * Mute/mic LEDs no function on EliteBook G9 platfroms (LP: #1970552)
    - ALSA: hda/realtek: Enable mute/micmute LEDs support for HP Laptops

  * Mute/mic LEDs no function on HP EliteBook 845/865 G9 (LP: #1970178)
    - ALSA: hda/realtek: Enable mute/micmute LEDs and limit mic boost on EliteBook
      845/865 G9

  * Focal update: upstream stable patchset 2022-04-22 (LP: #1969892)
    - Revert "swiotlb: rework "fix info leak with DMA_FROM_DEVICE""
    - USB: serial: pl2303: add IBM device IDs
    - dt-bindings: usb: hcd: correct usb-device path
    - USB: serial: pl2303: fix GS type detection
    - USB: serial: simple: add Nokia phone driver
    - mm: kfence: fix missing objcg housekeeping for SLAB
    - HID: logitech-dj: add new lightspeed receiver id
    - HID: Add support for open wheel and no attachment to T300
    - xfrm: fix tunnel model fragmentation behavior
    - ARM: mstar: Select HAVE_ARM_ARCH_TIMER
    - virtio_console: break out of buf poll on remove
    - vdpa/mlx5: should verify CTRL_VQ feature exists for MQ
    - tools/virtio: fix virtio_test execution
    - ethernet: sun: Free the coherent when failing in probing
    - gpio: Revert regression in sysfs-gpio (gpiolib.c)
    - spi: Fix invalid sgs value
    - net:mcf8390: Use platform_get_irq() to get the interrupt
    - Revert "gpio: Revert regression in sysfs-gpio (gpiolib.c)"
    - spi: Fix erroneous sgs value with min_t()
    - Input: zinitix - do not report shadow fingers
    - af_key: add __GFP_ZERO flag for compose_sadb_supported in function
      pfkey_register
    - net: dsa: microchip: add spi_device_id tables
    - selftests: vm: fix clang build error multiple output files
    - locking/lockdep: Avoid potential access of invalid memory in lock_class
    - drm/amdgpu: move PX checking into amdgpu_device_ip_early_init
    - drm/amdgpu: only check for _PR3 on dGPUs
    - iommu/iova: Improve 32-bit free space estimate
    - tpm: fix reference counting for struct tpm_chip
    - usb: typec: tipd: Forward plug orientation to typec subsystem
    - USB: usb-storage: Fix use of bitfields for hardware data in ene_ub6250.c
    - xhci: fix garbage USBSTS being logged in some cases
    - xhci: fix runtime PM imbalance in USB2 resume
    - xhci: make xhci_handshake timeout for xhci_reset() adjustable
    - xhci: fix uninitialized string returned by xhci_decode_ctrl_ctx()
    - mei: me: disable driver on the ign firmware
    - mei: ...

Changed in linux-oem-5.14 (Ubuntu Focal):
status: Fix Committed → Fix Released
Timo Aaltonen (tjaalton)
Changed in hwe-next:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.