nvme - Polling on timeout

Bug #1807393 reported by Guilherme G. Piccoli
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Guilherme G. Piccoli
Xenial
Fix Released
High
Guilherme G. Piccoli

Bug Description

[Impact]
* NVMe controllers potentially could miss to send an interrupt, specially
due to bugs in virtual devices(which are common those days - qemu has its
own NVMe virtual device, so does AWS). This would be a difficult to
debug situation, because NVMe driver only reports the request timeout,
not the reason.

* The upstream patch proposed to SRU here here, 7776db1ccc12
("NVMe/pci: Poll CQ on timeout") was designed to provide more information
in these cases, by pro-actively polling the CQEs on request timeouts, to
check if the specific request was completed and some issue (probably a
missed interrupt) prevented the driver to notice, or if the request really
wasn't completed, which indicates more severe issues.

* Although quite useful for debugging, this patch could help to mitigate
issues in cloud environments like AWS, in case we may have jitter in
request completion and the i/o timeout was set to low values, or even
in case of atypical bugs in the virtual NVMe controller. With this patch,
if polling succeeds the NVMe driver will continue working instead of
trying a reset controller procedure, which may lead to fails in the
rootfs - refer to https://launchpad.net/bugs/1788035.

[Test Case]

* It's a bit tricky to artificially create a situation of missed interrupt;
one idea was to implement a small hack in the NVMe qemu virtual device
that given a trigger in guest kernel, will induce the virtual device to
skip an interrupt. The hack patch is present in comment #1 below.

* To trigger such hack from guest kernel, all is needed is to issue a
raw admin command from nvme-cli: "nvme admin-passthru -o 0xff /dev/nvme0"
After that, just perform some I/Os to see one of them aborting - one could
use dd for this goal, like "dd if=/dev/zero of=/dev/nvme0n1 count=5".

[Regression Potential]

* There are no clear risks in adding such polling mechanism to the NVMe driver; one bad thing that was neverreported but could happen with this patch is the device could be in a bad state IRQ-wise that a reset would fix, but
the patch could cause all requests to be completed via polling, which
prevents the adapter reset. This is however a very hypothetical situation,
which would also happen in the mainline kernel (since it has the patch).

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
description: updated
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

The bug was verified in Xenial kernel 4.4.0-142.168 available in -proposed.

I'm running in an AWS 2-cpu instance, which exhibits the issue if we run a small reproducer script (a loop that basically changes IRQ affinity for the NVMe MSIs/legacy interrupt among the CPUs and performs a 4K write to the device + sync):

$ uptime
 17:42:11 up 4:00, 2 users, load average: 0.19, 0.14, 0.08

$ uname -rmv
4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64

$ dmesg
[ 2218.252634] nvme 0000:00:04.0: I/O 6 QID 2 timeout, completion polled
[ 2451.245962] nvme 0000:00:04.0: I/O 22 QID 2 timeout, completion polled
[ 6672.249406] nvme 0000:00:04.0: I/O 3 QID 1 timeout, completion polled
[ 8425.253863] nvme 0000:00:04.0: I/O 28 QID 2 timeout, completion polled

<notice I've cleared the dmesg after the boot>

Cheers,

Guilherme

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Also verified that in an AWS instance running the aws kernel in -proposed:

$ uname -rv
4.4.0-1075-aws #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019

$ dmesg #cleared that before testing

[ 5429.330967] nvme 0000:00:04.0: I/O 10 QID 1 timeout, completion polled
[39630.417191] nvme 0000:00:04.0: I/O 0 QID 2 timeout, completion polled
[52869.430017] nvme 0000:00:04.0: I/O 17 QID 1 timeout, completion polled
[69245.492236] nvme 0000:00:04.0: I/O 3 QID 2 timeout, completion polled
[111366.577241] nvme 0000:00:04.0: I/O 20 QID 2 timeout, completion polled
[120978.583000] nvme 0000:00:04.0: I/O 11 QID 2 timeout, completion polled
[136204.614939] nvme 0000:00:04.0: I/O 7 QID 1 timeout, completion polled
[142200.642476] nvme 0000:00:04.0: I/O 3 QID 1 timeout, completion polled
[145549.661249] nvme 0000:00:04.0: I/O 29 QID 2 timeout, completion polled

$ uptime
 10:57:48 up 1 day, 20:22, 2 users, load average: 0.00, 0.03, 0.02

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (33.2 KiB)

This bug was fixed in the package linux - 4.4.0-142.168

---------------
linux (4.4.0-142.168) xenial; urgency=medium

  * linux: 4.4.0-142.168 -proposed tracker (LP: #1811846)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  * iptables connlimit allows more connections than the limit when using
    multiple CPUs (LP: #1811094)
    - netfilter: xt_connlimit: don't store address in the conn nodes
    - SAUCE: netfilter: xt_connlimit: remove the 'addr' parameter in add_hlist()
    - netfilter: nf_conncount: expose connection list interface
    - netfilter: nf_conncount: Fix garbage collection with zones
    - netfilter: nf_conncount: fix garbage collection confirm race
    - netfilter: nf_conncount: don't skip eviction when age is negative

  * CVE-2017-5715
    - SAUCE: x86/speculation: Cleanup IBPB runtime control handling
    - SAUCE: x86/speculation: Cleanup IBRS runtime control handling
    - SAUCE: x86/speculation: Use x86_spec_ctrl_base in entry/exit code
    - SAUCE: x86/speculation: Move RSB_CTXSW hunk

  * Xenial update: 4.4.167 upstream stable release (LP: #1811077)
    - media: em28xx: Fix use-after-free when disconnecting
    - Revert "wlcore: Add missing PM call for
      wlcore_cmd_wait_for_event_or_timeout()"
    - rapidio/rionet: do not free skb before reading its length
    - s390/qeth: fix length check in SNMP processing
    - usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2
    - kvm: mmu: Fix race in emulated page table writes
    - xtensa: enable coprocessors that are being flushed
    - xtensa: fix coprocessor context offset definitions
    - Btrfs: ensure path name is null terminated at btrfs_control_ioctl
    - ALSA: wss: Fix invalid snd_free_pages() at error path
    - ALSA: ac97: Fix incorrect bit shift at AC97-SPSA control write
    - ALSA: control: Fix race between adding and removing a user element
    - ALSA: sparc: Fix invalid snd_free_pages() at error path
    - ext2: fix potential use after free
    - dmaengine: at_hdmac: fix memory leak in at_dma_xlate()
    - dmaengine: at_hdmac: fix module unloading
    - btrfs: release metadata before running delayed refs
    - USB: usb-storage: Add new IDs to ums-realtek
    - usb: core: quirks: add RESET_RESUME quirk for Cherry G230 Stream series
    - misc: mic/scif: fix copy-paste error in scif_create_remote_lookup
    - Kbuild: suppress packed-not-aligned warning for default setting only
    - exec: avoid gcc-8 warning for get_task_comm
    - disable stringop truncation warnings for now
    - kobject: Replace strncpy with memcpy
    - unifdef: use memcpy instead of strncpy
    - kernfs: Replace strncpy with memcpy
    - ip_tunnel: Fix name string concatenate in __ip_tunnel_create()
    - drm: gma500: fix logic error
    - scsi: bfa: convert to strlcpy/strlcat
    - staging: rts5208: fix gcc-8 logic error warning
    - kdb: use memmove instead of overlapping memcpy
    - iser: set sector for ambiguous mr status errors
    - uprobes: Fix handle_swbp() vs. unregister() + register() race once more
    - MIPS: ralink: Fix mt7620 nd_sd pinmux
    - mips: fix mips_get_syscall_arg o32 check
    - drm/ast: Fix incorrect free on ioregs
 ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.