[linux-azure] Two Fixes For kdump Over Network

Bug #1883261 reported by Joseph Salisbury
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Committed
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

Microsoft would like to request two kdump related fixes in all releases supported on Azure. The two commits are:

c81992e7f4aa1 ("PCI: hv: Retry PCI bus D0 entry on invalid device state")
83cc3508ffaa6 ("PCI: hv: Fix the PCI HyperV probe failure path to release resource properly")

These are in the virtual PCI driver for Hyper-V. The customer visible symptom is that the network is not functional in the kdump kernel, so the dump file must be stored on the local disk and cannot be written over the network.

The problem only occurs when Accelerated Networking is enabled. It’s a relatively obscure scenario, which is why the problem has not surfaced before now. But we have an important customer who wants the “dump-file-over-the-network” functionality to work.

[Test Case]

- Apply requested patches and boot into updated kernel
- Verify Accelerated Networking is enabled
- Set up kdump
- configure kdump to use SSH
- Test the crash dump mechanism and verify the kernel crash dump appears on the selected remote server

Further details for setting up kdump through testing can be found here:
https://ubuntu.com/server/docs/kernel-crash-dump

[Regression Potential]

Patches are only targeted to azure kernels.

Patches are desgiend to release allocated resources remaining after
error cases in hv_pci_probe() or PCI devices not being shut down
properly. if those resources are still not correctly released, then
entering D0 state in kdump kernel could continue to fail.

Potential for finding regression with freeing resources or still failing to enter D0 state in the kdump kernel even after all resources have been
released.

Build & boot tested. Verified kdump works as intended over SSH after patches are applied.

Both 5.4 and 4.15 test kernels were sent to Microsoft. Both kernels signed off on and verified to resolve problem.

Changed in linux-azure (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The following link holds test kernels for 5.4, 5.3, and 4.15:

https://kernel.ubuntu.com/~kms/azure/lp1883261/

5.4 was a clean apply, though 5.3 and 4.15 required some changes. Please test to verify the added patches resolve the issue.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I've been testing with the 4.15 kernel and using the following wiki for guidance of using kdump/ssh:
https://ubuntu.com/server/docs/kernel-crash-dump
I first confirmed I could kdump to local disk. Next, I configured kdump per the wiki to use ssh.

However, everytime I cause a crash, the kexec kernel hangs and seems like it cannot reach the network to write the crash file to. Here is the error I see:
[ 387.778745] kdump-tools[735]: Starting kdump-tools:
[ 387.790249] kdump-tools[763]: Connection closed by 13.77.154.182 port 22
[ 387.794756] kdump-tools[744]: * Network not reachable; will try 15 more times

I'm not sure if this is due to a config error on my part or because of the test kernel. Is there any information you could provide to confirm my configration is correct? I'll attach my /etc/default/kexec file.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

An investigation is currently underway for the issue with the 4.15 kernel.

description: updated
description: updated
Ian May (ian-may)
Changed in linux-azure (Ubuntu Bionic):
status: New → Fix Committed
Ian May (ian-may)
Changed in linux-azure (Ubuntu Focal):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (80.2 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1032.33

---------------
linux-azure (5.4.0-1032.33) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1032.33 -proposed tracker (LP: #1903162)

  * Focal update: v5.4.66 upstream stable release (LP: #1896824)
    - [Config] azure: updateconfigs for VGACON_SOFT_SCROLLBACK

  * [linux-azure][hibernation] Mellanox CX4 NIC's TX/RX packets stop increasing
    after hibernation/resume (LP: #1894896)
    - hv_netvsc: Fix hibernation for mlx5 VF driver

  * [linux-azure][hibernation] GPU device no longer working after resume from
    hibernation in NV6 VM size (LP: #1894893)
    - PCI: hv: Fix hibernation in case interrupts are not re-created

  * linux-azure: build and include the tcm_loop module to the main kernel
    package (LP: #1791794)
    - [Config] linux-azure: CONFIG_LOOPBACK_TARGET=m (tcm_loop)

  * [linux-azure] Two Fixes For kdump Over Network (LP: #1883261)
    - PCI: hv: Fix the PCI HyperV probe failure path to release resource properly
    - PCI: hv: Retry PCI bus D0 entry on invalid device state

  [ Ubuntu: 5.4.0-55.61 ]

  * focal/linux: 5.4.0-55.61 -proposed tracker (LP: #1903175)
  * Update kernel packaging to support forward porting kernels (LP: #1902957)
    - [Debian] Update for leader included in BACKPORT_SUFFIX
  * Avoid double newline when running insertchanges (LP: #1903293)
    - [Packaging] insertchanges: avoid double newline
  * EFI: Fails when BootCurrent entry does not exist (LP: #1899993)
    - efivarfs: Replace invalid slashes with exclamation marks in dentries.
  * CVE-2020-14351
    - perf/core: Fix race in the perf_mmap_close() function
  * raid10: Block discard is very slow, causing severe delays for mkfs and
    fstrim operations (LP: #1896578)
    - md: add md_submit_discard_bio() for submitting discard bio
    - md/raid10: extend r10bio devs to raid disks
    - md/raid10: pull codes that wait for blocked dev into one function
    - md/raid10: improve raid10 discard request
    - md/raid10: improve discard request for far layout
    - dm raid: fix discard limits for raid1 and raid10
    - dm raid: remove unnecessary discard limits for raid10
  * Bionic: btrfs: kernel BUG at /build/linux-
    eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233! (LP: #1902254)
    - btrfs: drop unnecessary offset_in_page in extent buffer helpers
    - btrfs: extent_io: do extra check for extent buffer read write functions
    - btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
    - btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
    - btrfs: ctree: check key order before merging tree blocks
  * Ethernet no link lights after reboot (Intel i225-v 2.5G) (LP: #1902578)
    - igc: Add PHY power management control
  * Undetected Data corruption in MPI workloads that use VSX for reductions on
    POWER9 DD2.1 systems (LP: #1902694)
    - powerpc: Fix undetected data corruption with P9N DD2.1 VSX CI load emulation
    - selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load
      workaround
  * [20.04 FEAT] Support/enhancement of NVMe IPL (LP: #1902179)
    - s390: nvme ipl
    - s390: nvme reipl
    - s390/ipl: support NVMe IPL kernel para...

Changed in linux-azure (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (34.4 KiB)

This bug was fixed in the package linux-azure - 4.15.0-1100.111~16.04.1

---------------
linux-azure (4.15.0-1100.111~16.04.1) xenial; urgency=medium

  * xenial/linux-azure: 4.15.0-1100.111~16.04.1 -proposed tracker (LP: #1903121)

  * Packaging resync (LP: #1786013)
    - [Packaging] update update.conf

  [ Ubuntu: 4.15.0-1100.111 ]

  * bionic/linux-azure-4.15: 4.15.0-1100.111 -proposed tracker (LP: #1903123)
  * CVE-2020-12351 // CVE-2020-12352 // CVE-2020-24490
    - [Config] azure-4.15: Disable BlueZ highspeed support
  * Bionic update: upstream stable patchset 2020-09-30 (LP: #1897977)
    - [Config] azure-4.15: updateconfigs for VGACON_SOFT_SCROLLBACK
  * [linux-azure] Request for two CIFS commits in 16.04 (LP: #1882268)
    - CIFS: Only send SMB2_NEGOTIATE command on new TCP connections
    - cifs: Fix potential softlockups while refreshing DFS cache
  * linux-azure: build and include the tcm_loop module to the main kernel
    package (LP: #1791794)
    - [Config] linux-azure: Ensure CONFIG_LOOPBACK_TARGET=m (tcm_loop)
  * [linux-azure] Two Fixes For kdump Over Network (LP: #1883261)
    - PCI: hv: Reorganize the code in preparation of hibernation
    - PCI: hv: Fix the PCI HyperV probe failure path to release resource properly
    - PCI: hv: Retry PCI bus D0 entry on invalid device state
  * bionic/linux: 4.15.0-125.128 -proposed tracker (LP: #1903137)
  * Update kernel packaging to support forward porting kernels (LP: #1902957)
    - [Debian] Update for leader included in BACKPORT_SUFFIX
  * Avoid double newline when running insertchanges (LP: #1903293)
    - [Packaging] insertchanges: avoid double newline
  * EFI: Fails when BootCurrent entry does not exist (LP: #1899993)
    - efivarfs: Replace invalid slashes with exclamation marks in dentries.
  * CVE-2020-14351
    - perf/core: Fix race in the perf_mmap_close() function
  * raid10: Block discard is very slow, causing severe delays for mkfs and
    fstrim operations (LP: #1896578)
    - md: add md_submit_discard_bio() for submitting discard bio
    - md/raid10: extend r10bio devs to raid disks
    - md/raid10: pull codes that wait for blocked dev into one function
    - md/raid10: improve raid10 discard request
    - md/raid10: improve discard request for far layout
  * Bionic: btrfs: kernel BUG at /build/linux-
    eTBZpZ/linux-4.15.0/fs/btrfs/ctree.c:3233! (LP: #1902254)
    - btrfs: use offset_in_page instead of open-coding it
    - btrfs: use BUG() instead of BUG_ON(1)
    - btrfs: drop unnecessary offset_in_page in extent buffer helpers
    - btrfs: extent_io: do extra check for extent buffer read write functions
    - btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
    - btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
    - btrfs: ctree: check key order before merging tree blocks
  * Bionic update: upstream stable patchset 2020-11-04 (LP: #1902943)
    - USB: gadget: f_ncm: Fix NDP16 datagram validation
    - gpio: tc35894: fix up tc35894 interrupt configuration
    - vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock
    - vsock/virtio: stop workers during the .remove()
    - vsock/virtio: add transport parameter to the
 ...

Changed in linux-azure (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.