Bionic/linux-azure: Call trace on Ubuntu 18.04 VM with Standard NV24

Bug #1952621 reported by Tim Gardner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Invalid
Medium
Unassigned
Bionic
Invalid
Undecided
Unassigned
Focal
Fix Released
Undecided
Tim Gardner
linux-azure-5.4 (Ubuntu)
New
Undecided
Unassigned
Bionic
Fix Released
Medium
Tim Gardner
Focal
Invalid
Undecided
Unassigned

Bug Description

SRU Justification

[Impact]
During large scale deployment testing, we found below call trace when provisioning Ubuntu 18.04 VM with size Standard_NV24. Engineer deployed instance 10 times and encountered once.

It looks like a race condition when probe device, but finally all devices can be probed.

[ 4.938162] sysfs: cannot create duplicate filename '/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0003-0000-3130-444531334632/pci0003:00/0003:00:00.0/config'
[ 4.944816] sr 5:0:0:0: [sr0] scsi3-mmc drive: 0x/0x tray
[ 4.951818] CPU: 0 PID: 135 Comm: kworker/0:2 Not tainted 5.4.0-1061-azure #64~18.04.1-Ubuntu
[ 4.951820] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[ 4.958943] cdrom: Uniform CD-ROM driver Revision: 3.20
[ 4.955812] Workqueue: hv_pri_chan vmbus_add_channel_work
[ 4.955812] Call Trace:
[ 4.955812] dump_stack+0x57/0x6d
[ 4.955812] sysfs_warn_dup+0x5b/0x70
[ 4.955812] sysfs_add_file_mode_ns+0x158/0x180
[ 4.955812] sysfs_create_bin_file+0x64/0x90
[ 4.955812] pci_create_sysfs_dev_files+0x72/0x270
[ 4.955812] pci_bus_add_device+0x30/0x80
[ 4.955812] pci_bus_add_devices+0x31/0x70
[ 4.955812] hv_pci_probe+0x48c/0x650
[ 4.955812] vmbus_probe+0x3e/0x90
[ 4.955812] really_probe+0xf5/0x440
[ 4.955812] driver_probe_device+0x11b/0x130
[ 4.955812] __device_attach_driver+0x7b/0xe0
[ 4.955812] ? driver_allows_async_probing+0x60/0x60
[ 4.955812] bus_for_each_drv+0x6e/0xb0
[ 4.955812] __device_attach+0xe4/0x160
[ 4.955812] device_initial_probe+0x13/0x20
[ 4.955812] bus_probe_device+0x92/0xa0
[ 4.955812] device_add+0x402/0x690
[ 4.955812] device_register+0x1a/0x20
[ 4.955812] vmbus_device_register+0x5e/0xf0
[ 4.955812] vmbus_add_channel_work+0x2c4/0x640
[ 4.955812] process_one_work+0x209/0x400
[ 4.955812] worker_thread+0x34/0x400
[ 4.955812] kthread+0x121/0x140
[ 4.955812] ? process_one_work+0x400/0x400
[ 4.955812] ? kthread_park+0x90/0x90
[ 4.955812] ret_from_fork+0x35/0x40
[ 5.043612] hv_pci 47505500-0004-0001-3130-444531334632: PCI VMBus probing: Using version 0x10002
[ 5.260563] hv_pci 47505500-0004-0001-3130-444531334632: PCI host bridge to bus 0004:00

Dexuan did some research and it looks like this is a longstanding race condition bug in the generic PCI subsystem (due to the timing, there can be more than 1 place where the PCI code tries to create the same ‘config’ sysfs file):
https://patchwork.kernel.org/project/linux-pci/patch/20200716110423.xtfyb3n6tn5ixedh@pali/#23669641
The bug was reported on 7/16/2020, and the last reply was on 6/25/2021. It looks like this has not been fixed after 1+ year…
Business Impact

[Test Case]

Repeated deployment on a Standard_NV24 instance. MS reported the reproduction rate is 3/551 before the patch, and 0/838 with the patch.

[Where things could go wrong]

Deployments could fail for other reasons.

[Other info]

SF: #00321027

CVE References

Tim Gardner (timg-tpi)
affects: linux (Ubuntu) → linux-azure (Ubuntu)
Changed in linux-azure (Ubuntu):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Medium
Tim Gardner (timg-tpi)
Changed in linux-azure-5.4 (Ubuntu Focal):
status: New → Invalid
Changed in linux-azure-5.4 (Ubuntu Bionic):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Bionic):
status: New → Invalid
Changed in linux-azure (Ubuntu):
assignee: Tim Gardner (timg-tpi) → nobody
status: New → Invalid
Changed in linux-azure (Ubuntu Focal):
status: New → In Progress
assignee: nobody → Tim Gardner (timg-tpi)
Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-azure-5.4 (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.4.0-1065.68 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Verified by Microsoft.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (19.8 KiB)

This bug was fixed in the package linux-azure-5.4 - 5.4.0-1065.68~18.04.1

---------------
linux-azure-5.4 (5.4.0-1065.68~18.04.1) bionic; urgency=medium

  * bionic/linux-azure-5.4: 5.4.0-1065.68~18.04.1 -proposed tracker
    (LP: #1952289)

  [ Ubuntu: 5.4.0-1065.68 ]

  * focal/linux-azure: 5.4.0-1065.68 -proposed tracker (LP: #1952290)
  * Re-enable DEBUG_INFO_BTF where it was disabled (LP: #1945632)
    - [Config] azure: enable CONFIG_DEBUG_INFO_BTF
  * Support builtin revoked certificates (LP: #1932029)
    - [Config] azure: set CONFIG_SYSTEM_REVOCATION_KEYS
  * Bionic/linux-azure: Call trace on Ubuntu 18.04 VM with Standard NV24
    (LP: #1952621)
    - PCI/sysfs: Convert "config" to static attribute
  * linux-azure: add Icelake servers support in no-HWP mode to
    cpufreq/intel_pstate driver (LP: #1952234)
    - cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode
  * focal/linux: 5.4.0-92.103 -proposed tracker (LP: #1952316)
  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - debian/dkms-versions -- update from kernel-versions (main/2021.11.29)
  * CVE-2021-4002
    - tlb: mmu_gather: add tlb_flush_*_range APIs
    - hugetlbfs: flush TLBs correctly after huge_pmd_unshare
  * Re-enable DEBUG_INFO_BTF where it was disabled (LP: #1945632)
    - [Config] Enable CONFIG_DEBUG_INFO_BTF on all arches
  * Focal linux-azure: Vm crash on Dv5/Ev5 (LP: #1950462)
    - KVM: VMX: eVMCS: make evmcs_sanitize_exec_ctrls() work again
    - jump_label: Fix usage in module __init
  * Support builtin revoked certificates (LP: #1932029)
    - Revert "UBUNTU: SAUCE: (lockdown) Make get_cert_list() not complain about
      cert lists that aren't present."
    - integrity: Move import of MokListRT certs to a separate routine
    - integrity: Load certs from the EFI MOK config table
    - certs: Add ability to preload revocation certs
    - integrity: Load mokx variables into the blacklist keyring
    - certs: add 'x509_revocation_list' to gitignore
    - SAUCE: Dump stack when X.509 certificates cannot be loaded
    - [Packaging] build canonical-revoked-certs.pem from branch/arch certs
    - [Packaging] Revoke 2012 UEFI signing certificate as built-in
    - [Config] Configure CONFIG_SYSTEM_REVOCATION_KEYS with revoked keys
  * Support importing mokx keys into revocation list from the mok table
    (LP: #1928679)
    - efi: Support for MOK variable config table
    - efi: mokvar-table: fix some issues in new code
    - efi: mokvar: add missing include of asm/early_ioremap.h
    - efi/mokvar: Reserve the table only if it is in boot services data
    - SAUCE: integrity: add informational messages when revoking certs
  * Support importing mokx keys into revocation list from the mok table
    (LP: #1928679) // CVE-2020-26541 when certificates are revoked via
    MokListXRT.
    - SAUCE: integrity: Load mokx certs from the EFI MOK config table
  * Focal update: v5.4.157 upstream stable release (LP: #1951883)
    - ARM: 9133/1: mm: proc-macros: ensure *_tlb_fns are 4B aligned
    - ARM: 9134/1: remove duplicate memcpy() definition
    - ARM: 9139/1: kprobes: fix arch_init_kprobes() prototype
    - ARM: 9141/1:...

Changed in linux-azure-5.4 (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (19.7 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1065.68

---------------
linux-azure (5.4.0-1065.68) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1065.68 -proposed tracker (LP: #1952290)

  * Re-enable DEBUG_INFO_BTF where it was disabled (LP: #1945632)
    - [Config] azure: enable CONFIG_DEBUG_INFO_BTF

  * Support builtin revoked certificates (LP: #1932029)
    - [Config] azure: set CONFIG_SYSTEM_REVOCATION_KEYS

  * Bionic/linux-azure: Call trace on Ubuntu 18.04 VM with Standard NV24
    (LP: #1952621)
    - PCI/sysfs: Convert "config" to static attribute

  * linux-azure: add Icelake servers support in no-HWP mode to
    cpufreq/intel_pstate driver (LP: #1952234)
    - cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode

  [ Ubuntu: 5.4.0-92.103 ]

  * focal/linux: 5.4.0-92.103 -proposed tracker (LP: #1952316)
  * Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper
    - debian/dkms-versions -- update from kernel-versions (main/2021.11.29)
  * CVE-2021-4002
    - tlb: mmu_gather: add tlb_flush_*_range APIs
    - hugetlbfs: flush TLBs correctly after huge_pmd_unshare
  * Re-enable DEBUG_INFO_BTF where it was disabled (LP: #1945632)
    - [Config] Enable CONFIG_DEBUG_INFO_BTF on all arches
  * Focal linux-azure: Vm crash on Dv5/Ev5 (LP: #1950462)
    - KVM: VMX: eVMCS: make evmcs_sanitize_exec_ctrls() work again
    - jump_label: Fix usage in module __init
  * Support builtin revoked certificates (LP: #1932029)
    - Revert "UBUNTU: SAUCE: (lockdown) Make get_cert_list() not complain about
      cert lists that aren't present."
    - integrity: Move import of MokListRT certs to a separate routine
    - integrity: Load certs from the EFI MOK config table
    - certs: Add ability to preload revocation certs
    - integrity: Load mokx variables into the blacklist keyring
    - certs: add 'x509_revocation_list' to gitignore
    - SAUCE: Dump stack when X.509 certificates cannot be loaded
    - [Packaging] build canonical-revoked-certs.pem from branch/arch certs
    - [Packaging] Revoke 2012 UEFI signing certificate as built-in
    - [Config] Configure CONFIG_SYSTEM_REVOCATION_KEYS with revoked keys
  * Support importing mokx keys into revocation list from the mok table
    (LP: #1928679)
    - efi: Support for MOK variable config table
    - efi: mokvar-table: fix some issues in new code
    - efi: mokvar: add missing include of asm/early_ioremap.h
    - efi/mokvar: Reserve the table only if it is in boot services data
    - SAUCE: integrity: add informational messages when revoking certs
  * Support importing mokx keys into revocation list from the mok table
    (LP: #1928679) // CVE-2020-26541 when certificates are revoked via
    MokListXRT.
    - SAUCE: integrity: Load mokx certs from the EFI MOK config table
  * Focal update: v5.4.157 upstream stable release (LP: #1951883)
    - ARM: 9133/1: mm: proc-macros: ensure *_tlb_fns are 4B aligned
    - ARM: 9134/1: remove duplicate memcpy() definition
    - ARM: 9139/1: kprobes: fix arch_init_kprobes() prototype
    - ARM: 9141/1: only warn about XIP address when not compile testing
    - ipv6: use siphash in rt6_exception_hash()
    - i...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.