LIO hanging in iscsit_free_session and iscsit_stop_session

Bug #1871688 reported by Matt Coleman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

The target subsystem (LIO) can hang if multiple threads try to destroy iSCSI sessions simultaneously. This is reproducible on systems that have multiple targets with initiators regularly connecting/disconnecting.

This may happen when a "targetcli iscsi/iqn.../tpg1 disable" command is executed when a logout operation is underway.

The iscsi target doesn't handle such events in a correct way: two or more threads may end up sleeping while waiting for the driver to close the remaining connections on the session. When the connections are closed, the driver wakes up only the first thread that will then proceed to destroy the session structure. The remaining threads are blocked there forever, waiting on a completion synchronization mechanism that doesn't exist in memory anymore because it has been freed by the first thread.

Note that if the blocked threads are somehow forced to wake up, they will try to free the same iSCSI session structure destroyed by the first thread, causing double frees, memory corruptions, etc.

The driver has been reorganized so the concurrent threads will set a flag in the session structure to notify the driver that the session should be destroyed; then, they wait for the driver to close the remaining connections. When the connections are all closed, the driver will wake up all the threads and will wait for the refcount of the iSCSI session structure to reach zero. When the last thread wakes up, the refcount is decreased to zero and the driver can proceed to destroy the session structure because no one is referencing it anymore.

I've witnessed this happening on hundreds of Ubuntu 16.04.5 systems. It is a regression, because this did not occur several years ago. Unfortunately, I don't have detailed records from that far back to determine exactly which kernel I was running that was not affected by this bug (I believe it was either 4.8.x or 4.10.x).

I've attached the requested uname, version_signature, dmesg, and lspci from my system. However, I've seen this happen on a wide array of hardware: 2 to 24 cores, 8GB to 256GB RAM, both AMD and Intel CPUs, onboard storage and PCIe SAS cards, etc.

This has been fixed in the upstream master branch, but it hasn't yet been backported to "-stable".

To fix this in the Ubuntu kernel, these three commits should be backported:
* https://github.com/torvalds/linux/commit/e49a7d994379278d3353d7ffc7994672752fb0ad
* https://github.com/torvalds/linux/commit/57c46e9f33da530a2485fa01aa27b6d18c28c796
* https://github.com/torvalds/linux/commit/626bac73371eed79e2afa2966de393da96cf925e

I'd like these commits to be added to the xenial, bionic, and focal kernels.

Revision history for this message
Matt Coleman (mcoleman) wrote :
Revision history for this message
Matt Coleman (mcoleman) wrote :
Revision history for this message
Matt Coleman (mcoleman) wrote :
Revision history for this message
Matt Coleman (mcoleman) wrote :
Matt Coleman (mcoleman)
tags: added: bionic focal
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu Focal):
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Revision history for this message
Kelsey Steele (kelsey-steele) wrote :

Target to Xenial was meant for Xenial-hwe (derivative of Bionic).

Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: New → Won't Fix
no longer affects: linux (Ubuntu Xenial)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Matt Coleman (mcoleman) wrote :

The only machines I've reproduced this bug on are running Xenial. For the past few months, I've been applying these patches to the Xenial HWE kernels and building packages for my dev/test systems.

These systems frequently add and remove targets (around 125 per hour) with separate initiators connecting to each target. Initiators connect for varying durations, so creating/destroying and connecting/disconnecting happen for different targets in unpredictable order.

With an unpatched kernel, the systems will hit this bug within a day (very occasionally stretches into a second day), but it's usually only hours.

With the patches applied, the systems remain stable.

I attempted to install the proposed Bionic kernel on one of my Xenial systems but it depends on libssl1.1, which isn't available in Xenial.

Revision history for this message
Matt Coleman (mcoleman) wrote :

I built and installed the package on one of my Xenial systems from the source available here: https://launchpad.net/ubuntu/+source/linux/4.15.0-100.101

If the system is still stable a day from now, is it okay for me to set the 'verification-done-bionic' tag, even though I'm testing on Xenial?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Matt Coleman (mcoleman) wrote :

My Xenial system is still stable after almost two days with the kernel package I built from the proposed Bionic kernel's source.

Revision history for this message
Matt Coleman (mcoleman) wrote :

I installed the proposed Bionic packages on my Xenial system and have confirmed that basic functionality is working properly. If the system remains stable for >24h, I will consider the issue to be fixed in this kernel.

Revision history for this message
Matt Coleman (mcoleman) wrote :

The Xenial system that I installed the binary packages from the proposed Bionic repository on has been running properly and stable since the 7th.

Revision history for this message
Khaled El Mously (kmously) wrote :

Thanks very much @Matt Coleman for reporting the bug and verifying the fix! I will mark the bug as verified based on your feedback.

tags: added: verification-done-bionic verification-done-focal
removed: verification-needed-bionic verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (12.7 KiB)

This bug was fixed in the package linux - 4.15.0-101.102

---------------
linux (4.15.0-101.102) bionic; urgency=medium

  * bionic/linux: 4.15.0-101.102 -proposed tracker (LP: #1877262)

  * 4.15.0-100.101 breaks userspace builds due to a bug in the headers
    /usr/include/linux/swab.h of linux-libc-dev (LP: #1877123)
    - include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for
      swap

  * bionic snapdragon 4.15 snap failed Certification testing (LP: #1877657)
    - Revert "drm/msm: Use the correct dma_sync calls in msm_gem"
    - Revert "drm/msm: stop abusing dma_map/unmap for cache"

linux (4.15.0-100.101) bionic; urgency=medium

  * bionic/linux: 4.15.0-100.101 -proposed tracker (LP: #1875878)

  * built-using constraints preventing uploads (LP: #1875601)
    - temporarily drop Built-Using data

  * Add debian/rules targets to compile/run kernel selftests (LP: #1874286)
    - [Packaging] add support to compile/run selftests

  * getitimer returns it_value=0 erroneously (LP: #1349028)
    - [Config] CONTEXT_TRACKING_FORCE policy should be unset

  * QEMU/KVM display is garbled when booting from kernel EFI stub due to missing
    bochs-drm module (LP: #1872863)
    - [Config] Enable CONFIG_DRM_BOCHS as module for all archs

  * Backport MPLS patches from 5.3 to 4.15 (LP: #1851446)
    - net/mlx5e: Report netdevice MPLS features
    - net: vlan: Inherit MPLS features from parent device
    - net: bonding: Inherit MPLS features from slave devices
    - net/mlx5e: Move to HW checksumming advertising

  * LIO hanging in iscsit_free_session and iscsit_stop_session (LP: #1871688)
    - scsi: target: remove boilerplate code
    - scsi: target: fix hang when multiple threads try to destroy the same iscsi
      session
    - scsi: target: iscsi: calling iscsit_stop_session() inside
      iscsit_close_session() has no effect

  * Add hw timestamps to received skbs in peak_canfd (LP: #1874124)
    - can: peak_canfd: provide hw timestamps in rx skbs

  * Bionic update: upstream stable patchset 2020-04-23 (LP: #1874502)
    - ARM: dts: sun8i-a83t-tbs-a711: HM5065 doesn't like such a high voltage
    - bus: sunxi-rsb: Return correct data when mixing 16-bit and 8-bit reads
    - net: vxge: fix wrong __VA_ARGS__ usage
    - hinic: fix a bug of waitting for IO stopped
    - hinic: fix wrong para of wait_for_completion_timeout
    - cxgb4/ptp: pass the sign of offset delta in FW CMD
    - qlcnic: Fix bad kzalloc null test
    - i2c: st: fix missing struct parameter description
    - firmware: arm_sdei: fix double-lock on hibernate with shared events
    - null_blk: Fix the null_add_dev() error path
    - null_blk: Handle null_add_dev() failures properly
    - null_blk: fix spurious IO errors after failed past-wp access
    - xhci: bail out early if driver can't accress host in resume
    - x86: Don't let pgprot_modify() change the page encryption bit
    - block: keep bdi->io_pages in sync with max_sectors_kb for stacked devices
    - irqchip/versatile-fpga: Handle chained IRQs properly
    - sched: Avoid scale real weight down to zero
    - selftests/x86/ptrace_syscall_32: Fix no-vDSO segfault
    - PCI/switchtec: Fix init_completio...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (25.9 KiB)

This bug was fixed in the package linux - 5.4.0-31.35

---------------
linux (5.4.0-31.35) focal; urgency=medium

  * focal/linux: 5.4.0-31.35 -proposed tracker (LP: #1877253)

  * Intermittent display blackouts on event (LP: #1875254)
    - drm/i915: Limit audio CDCLK>=2*BCLK constraint back to GLK only

  * Unable to handle kernel pointer dereference in virtual kernel address space
    on Eoan (LP: #1876645)
    - SAUCE: overlayfs: fix shitfs special-casing

linux (5.4.0-30.34) focal; urgency=medium

  * focal/linux: 5.4.0-30.34 -proposed tracker (LP: #1875385)

  * ubuntu/focal64 fails to mount Vagrant shared folders (LP: #1873506)
    - [Packaging] Move virtualbox modules to linux-modules
    - [Packaging] Remove vbox and zfs modules from generic.inclusion-list

  * linux-image-5.0.0-35-generic breaks checkpointing of container
    (LP: #1857257)
    - SAUCE: overlayfs: use shiftfs hacks only with shiftfs as underlay

  * shiftfs: broken shiftfs nesting (LP: #1872094)
    - SAUCE: shiftfs: record correct creator credentials

  * Add debian/rules targets to compile/run kernel selftests (LP: #1874286)
    - [Packaging] add support to compile/run selftests

  * shiftfs: O_TMPFILE reports ESTALE (LP: #1872757)
    - SAUCE: shiftfs: fix dentry revalidation

  * LIO hanging in iscsit_free_session and iscsit_stop_session (LP: #1871688)
    - scsi: target: iscsi: calling iscsit_stop_session() inside
      iscsit_close_session() has no effect

  * [ICL] TC port in legacy/static mode can't be detected due TCCOLD
    (LP: #1868936)
    - SAUCE: drm/i915: Align power domain names with port names
    - SAUCE: drm/i915/display: Move out code to return the digital_port of the aux
      ch
    - SAUCE: drm/i915/display: Add intel_legacy_aux_to_power_domain()
    - SAUCE: drm/i915/display: Split hsw_power_well_enable() into two
    - SAUCE: drm/i915/tc/icl: Implement TC cold sequences
    - SAUCE: drm/i915/tc: Skip ref held check for TC legacy aux power wells
    - SAUCE: drm/i915/tc/tgl: Implement TC cold sequences
    - SAUCE: drm/i915/tc: Catch TC users accessing FIA registers without enable
      aux
    - SAUCE: drm/i915/tc: Do not warn when aux power well of static TC ports
      timeout

  * alsa/sof: external mic can't be deteced on Lenovo and HP laptops
    (LP: #1872569)
    - SAUCE: ASoC: intel/skl/hda - set autosuspend timeout for hda codecs

  * amdgpu kernel errors in Linux 5.4 (LP: #1871248)
    - drm/amd/display: Stop if retimer is not available

  * Focal update: v5.4.34 upstream stable release (LP: #1874111)
    - amd-xgbe: Use __napi_schedule() in BH context
    - hsr: check protocol version in hsr_newlink()
    - l2tp: Allow management of tunnels and session in user namespace
    - net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
    - net: ipv4: devinet: Fix crash when add/del multicast IP with autojoin
    - net: ipv6: do not consider routes via gateways for anycast address check
    - net: phy: micrel: use genphy_read_status for KSZ9131
    - net: qrtr: send msgs from local of same id as broadcast
    - net: revert default NAPI poll timeout to 2 jiffies
    - net: tun: record RX queue in skb before do_xdp_gener...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.4.0-42.46

---------------
linux (5.4.0-42.46) focal; urgency=medium

  * focal/linux: 5.4.0-42.46 -proposed tracker (LP: #1887069)

  * linux 4.15.0-109-generic network DoS regression vs -108 (LP: #1886668)
    - SAUCE: Revert "netprio_cgroup: Fix unlimited memory leak of v2 cgroups"

linux (5.4.0-41.45) focal; urgency=medium

  * focal/linux: 5.4.0-41.45 -proposed tracker (LP: #1885855)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * CVE-2019-19642
    - kernel/relay.c: handle alloc_percpu returning NULL in relay_open

  * CVE-2019-16089
    - SAUCE: nbd_genl_status: null check for nla_nest_start

  * CVE-2020-11935
    - aufs: do not call i_readcount_inc()

  * ip_defrag.sh in net from ubuntu_kernel_selftests failed with 5.0 / 5.3 / 5.4
    kernel (LP: #1826848)
    - selftests: net: ip_defrag: ignore EPERM

  * Update lockdown patches (LP: #1884159)
    - SAUCE: acpi: disallow loading configfs acpi tables when locked down

  * seccomp_bpf fails on powerpc (LP: #1885757)
    - SAUCE: selftests/seccomp: fix ptrace tests on powerpc

  * Introduce the new NVIDIA 418-server and 440-server series, and update the
    current NVIDIA drivers (LP: #1881137)
    - [packaging] add signed modules for the 418-server and the 440-server
      flavours

 -- Khalid Elmously <email address hidden> Thu, 09 Jul 2020 19:50:26 -0400

Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.