Intel i40e PF reset under load

Bug #1700834 reported by Jay Vosburgh
40
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Jay Vosburgh
Xenial
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification:

Impact:

 Using an Intel i40e network device, under heavy traffic load with
TSO enabled, the device will spontaneously reset itself and issue errors
similar to the following:

Jun 14 14:09:51 hostname kernel: [4253913.851053] i40e 0000:05:00.1: TX driver issue detected, PF reset issued
Jun 14 14:09:53 hostname kernel: [4253915.476283] i40e 0000:05:00.1: TX driver issue detected, PF reset issued
Jun 14 14:09:54 hostname kernel: [4253917.411264] i40e 0000:05:00.1: TX driver issue detected, PF reset issued

 This causes a full reset of the PF, which causes an interruption
in traffic flow.

 In this case, these errors arise from a bug in the i40e device
driver introduced by commit:

commit 584a837e26408c66e87df87a022faa6a54c2b020
Author: Alexander Duyck <email address hidden>
Date: Wed Feb 17 11:02:50 2016 -0800

    i40e/i40evf: Rewrite logic for 8 descriptor per packet check

 This patch was added to the Xenial kernel beginning with version
4.4.0-8.23. This bug does not manifest on any other Ubuntu kernel series.

Fix:

 This error is resolved upstream by:

commit 3f3f7cb875c0f621485644d4fd7453b0d37f00e4
Author: Alexander Duyck <email address hidden>
Date: Wed Mar 30 16:15:37 2016 -0700

    i40e/i40evf: Limit TSO to 7 descriptors for payload instead of 8 per packet

 This fix was never backported into the Xenial 4.4 kernel series.

Testcase:

 In this case, the issue occurs at a customer site using i40e based
Intel network cards with SR-IOV enabled. Under heavy load, the card will
reset itself as described. The customer has tested the 3f3f7cb875c patch
in their environment and confirmed that it resolves the issue.

CVE References

Jay Vosburgh (jvosburgh)
Changed in linux (Ubuntu):
assignee: nobody → Jay Vosburgh (jvosburgh)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1700834

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Jay Vosburgh (jvosburgh)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Xenial):
status: New → Confirmed
Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (12.0 KiB)

This bug was fixed in the package linux - 4.4.0-89.112

---------------
linux (4.4.0-89.112) xenial; urgency=low

  * CVE-2017-7533
    - dentry name snapshots

linux (4.4.0-88.111) xenial; urgency=low

  * linux: 4.4.0-88.111 -proposed tracker (LP: #1705270)

  * [Xenial] nvme: Quirks for PM1725 controllers (LP: #1704435)
    - nvme: Quirks for PM1725 controllers

  * Upgrade Redpine WLAN/BT driver to ver. 1.2 (production release)
    (LP: #1697829)
    - SAUCE: Redpine: Upgrade to ver. 1.2 production release

  * ubuntu/rsi driver has several issues as picked up by static analysis
    (LP: #1694733)
    - SAUCE: Redpine: Upgrade to ver. 1.2 production release

  * Redpine vendor driver - Switching to AP mode causes kernel panic
    (LP: #1700941)
    - SAUCE: Redpine: Upgrade to ver. 1.2 production release

  * CVE-2017-10810
    - drm/virtio: don't leak bo on drm_gem_object_init failure

  * Ath10k to read different board data file if specify in SMBIOS (LP: #1666742)
    - ath10k: search SMBIOS for OEM board file extension

  * make snap-pkg support (LP: #1700747)
    - SAUCE: make snap-pkg support

  * ISST-LTE: Briggs:Stratton:UbuntuKVM: ics_opal_set_affinity on host kernel
    log using Intel X710 (i40e driver) (LP: #1703663)
    - i40e: use valid online CPU on q_vector initialization

  * Update snapcraft.yaml (LP: #1700480)
    - snapcraft.yaml: various improvements

  * Xenial update to 4.4.76 stable release (LP: #1702863)
    - ipv6: release dst on error in ip6_dst_lookup_tail
    - net: don't call strlen on non-terminated string in dev_set_alias()
    - decnet: dn_rtmsg: Improve input length sanitization in
      dnrmg_receive_user_skb
    - net: Zero ifla_vf_info in rtnl_fill_vfinfo()
    - af_unix: Add sockaddr length checks before accessing sa_family in bind and
      connect handlers
    - Fix an intermittent pr_emerg warning about lo becoming free.
    - net: caif: Fix a sleep-in-atomic bug in cfpkt_create_pfx
    - igmp: acquire pmc lock for ip_mc_clear_src()
    - igmp: add a missing spin_lock_init()
    - ipv6: fix calling in6_ifa_hold incorrectly for dad work
    - net/mlx5: Wait for FW readiness before initializing command interface
    - decnet: always not take dst->__refcnt when inserting dst into hash table
    - net: 8021q: Fix one possible panic caused by BUG_ON in free_netdev
    - sfc: provide dummy definitions of vswitch functions
    - ipv6: Do not leak throw route references
    - rtnetlink: add IFLA_GROUP to ifla_policy
    - netfilter: xt_TCPMSS: add more sanity tests on tcph->doff
    - netfilter: synproxy: fix conntrackd interaction
    - NFSv4: fix a reference leak caused WARNING messages
    - drm/ast: Handle configuration without P2A bridge
    - mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()
    - MIPS: Avoid accidental raw backtrace
    - MIPS: pm-cps: Drop manual cache-line alignment of ready_count
    - MIPS: Fix IRQ tracing & lockdep when rescheduling
    - ALSA: hda - Fix endless loop of codec configure
    - ALSA: hda - set input_path bitmap to zero after moving it to new place
    - drm/vmwgfx: Free hash table allocated by cmdbuf managed res mgr
    - usb: gadget: f_fs: Fix possi...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Jay Vosburgh (jvosburgh)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Björn Zettergren (bjozet) wrote :

I'm running Xenial with kernel 4.4.0-92.115 on a Dell R330 with intel X710 NIC.
Under load it fails with message: "TX driver issue detected, PF reset issued" as original report says.

To me it looks like this issue isn't completely solved since I'm running a more recent kernel than the "fix" was commited to.

Revision history for this message
Björn Zettergren (bjozet) wrote :

I forgot to mention in previous comment, that this happens within the hour (usually just a few minutes) of adding the server to production loads. I can provide more information and test patches if necessary.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> To me it looks like this issue isn't completely solved since I'm running a more
> recent kernel than the "fix" was commited to.

There is one additional upstream commit required to fully fix this, please see bug 1713553.

Revision history for this message
Björn Zettergren (bjozet) wrote : Re: [Bug 1700834] Re: Intel i40e PF reset under load

>
> There is one additional upstream commit required to fully fix this,
> please see bug 1713553.
>

Ah, nice! Thanks for pointing it out, i had not found it myself. I solved
my problems temporarily by adding i40e 2.0.30 driver as dkms to my current
system and will follow the other bug (the i40e 2.1.26 leaked memory at an
alarming rate, but that's a story for a different bugreport maybe).

Doug Parrish (dparrish)
Changed in linux (Ubuntu Xenial):
milestone: none → ubuntu-16.04.4
milestone: ubuntu-16.04.4 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.