i40e Intel X710 error during device probe prevents link set up and ip association

Bug #1672550 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Xenial
Fix Released
High
Seth Forshee

Bug Description

== Comment: #0 - Mauro Sergio Martins Rodrigues - 2017-02-22 06:48:42 ==
While investigating bug #145959 I got blocked in the reproduction process due to the follow issue during interface link bring up:

[ 1.590591] i40e 0045:01:00.0: AQ command Config VSI BW allocation per TC failed = 14
[ 1.590661] i40e 0045:01:00.0: Failed configuring TC map 255 for VSI 399
[ 1.590669] i40e 0045:01:00.0: failed to configure TCs for main VSI tc_map 0x000000ff, err I40E_ERR_INVALID_QP_ID aq_err I40E_AQ_RC_EINVAL

which prevented me to bring the interface up and associate an ip to it.

== Comment: #2 - Mauro Sergio Martins Rodrigues - 2017-02-22 07:26:36 ==
some missing Information kernel is Ubuntu's 4.4.0-62-generic.

When testing with 4.8.0-36-generic (from xenial's proposed) device probe works fine, no similar message is seen.

To obtain some more data on this I added some statements to see which TC MAP was applied in a healthy probe (note that the other functions, like function 1 works fine but those functions have no cable on them).

root@yangtze-lp1:~/_maurosr/linux-4.4.0/drivers/net/ethernet/intel/i40e# dmesg
[52448.914605] i40e 0045:01:00.3: i40e_ptp_stop: removed PHC on enP69p1s0f3
[52448.981801] i40e 0045:01:00.2: i40e_ptp_stop: removed PHC on enP69p1s0f2
[52449.069793] i40e 0045:01:00.1: i40e_ptp_stop: removed PHC on enP69p1s0f1
[52449.173834] i40e 0045:01:00.0: i40e_ptp_stop: removed PHC on enP69p1s0f0
[52449.264462] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.25-k
[52449.264468] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
[52449.264625] i40e 0045:01:00.0: Using 64-bit DMA iommu bypass
[52449.286138] i40e 0045:01:00.0: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
[52449.505657] i40e 0045:01:00.0: MAC address: 68:05:ca:2d:e9:08
[52449.508977] i40e 0045:01:00.0: SAN MAC: 68:05:ca:2d:e9:0c
[52449.529200] i40e 0045:01:00.0: DEBUG DATA vsi > 399;enabled_tc > 255
[52449.531210] i40e 0045:01:00.0: AQ command Config VSI BW allocation per TC failed = 14
[52449.531213] i40e 0045:01:00.0: Failed configuring TC map 255 for VSI 399
[52449.531217] i40e 0045:01:00.0: failed to configure TCs for main VSI tc_map 0x000000ff, err I40E_ERR_INVALID_QP_ID aq_err I40E_AQ_RC_EINVAL
[52449.544642] i40e 0045:01:00.0 enP69p1s0f0: renamed from eth0
[52449.697424] i40e 0045:01:00.0: PCI-Express: Speed 8.0GT/s Width x8
[52449.727043] i40e 0045:01:00.0: Features: PF-id[0] VFs: 32 VSIs: 34 QP: 0 RX: 1BUF RSS FD_ATR DCB VxLAN Geneve PTP VEPA
[52449.727098] i40e 0045:01:00.1: Using 64-bit DMA iommu bypass
[52449.748667] i40e 0045:01:00.1: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
[52449.976665] i40e 0045:01:00.1: MAC address: 68:05:ca:2d:e9:09
[52449.980685] i40e 0045:01:00.1: SAN MAC: 68:05:ca:2d:e9:0d
[52449.994982] i40e 0045:01:00.1: DEBUG DATA vsi > 398;enabled_tc > 1
[52450.015610] i40e 0045:01:00.1 enP69p1s0f1: renamed from eth0
[52450.074479] i40e 0045:01:00.1: PCI-Express: Speed 8.0GT/s Width x8
[52450.080516] i40e 0045:01:00.1: Features: PF-id[1] VFs: 32 VSIs: 34 QP: 128 RX: 1BUF RSS FD_ATR DCB VxLAN Geneve PTP VEPA

Comparing function 0:
[52449.529200] i40e 0045:01:00.0: DEBUG DATA vsi > 399;enabled_tc > 255
and function 1:
[52449.994982] i40e 0045:01:00.1: DEBUG DATA vsi > 398;enabled_tc > 1

Then looking at 4.8:
[ 123.425399] i40e: loading out-of-tree module taints kernel.
[ 123.428958] i40e: module verification failed: signature and/or required key missing - tainting kernel
[ 123.430690] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.6.11-k
[ 123.430691] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
[ 123.430918] i40e 0045:01:00.0: Using 64-bit DMA iommu bypass
[ 123.450445] i40e 0045:01:00.0: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
[ 123.664088] i40e 0045:01:00.0: MAC address: 68:05:ca:2d:e9:08
[ 123.667878] i40e 0045:01:00.0: SAN MAC: 68:05:ca:2d:e9:0c
[ 123.681915] Non-contiguous TC - Disabling DCB
[ 123.690177] i40e 0045:01:00.0: DEBUG DATA vsi > 399, enabled_tc 1
[ 123.713262] i40e 0045:01:00.0 enP69p1s0f0: renamed from eth0
[ 123.864601] i40e 0045:01:00.0: Added LAN device PF0 bus=0x00 func=0x00
[ 123.864611] i40e 0045:01:00.0: PCI-Express: Speed 8.0GT/s Width x8
[ 123.893254] i40e 0045:01:00.0: Features: PF-id[0] VFs: 32 VSIs: 34 QP: 128 RSS FD_ATR DCB VxLAN Geneve PTP VEPA
[ 123.893321] i40e 0045:01:00.1: Using 64-bit DMA iommu bypass
[ 123.914829] i40e 0045:01:00.1: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
[ 124.152980] i40e 0045:01:00.1: MAC address: 68:05:ca:2d:e9:09
[ 124.156999] i40e 0045:01:00.1: SAN MAC: 68:05:ca:2d:e9:0d
[ 124.171266] i40e 0045:01:00.1: DEBUG DATA vsi > 398, enabled_tc 1
[ 124.196080] i40e 0045:01:00.1 enP69p1s0f1: renamed from eth0
[ 124.253353] i40e 0045:01:00.1: Added LAN device PF1 bus=0x00 func=0x01
[ 124.253387] i40e 0045:01:00.1: PCI-Express: Speed 8.0GT/s Width x8
[ 124.263908] i40e 0045:01:00.1: Features: PF-id[1] VFs: 32 VSIs: 34 QP: 128 RSS FD_ATR DCB VxLAN Geneve PTP VEPA

These 2 lines are important here:
[ 123.681915] Non-contiguous TC - Disabling DCB
[ 123.690177] i40e 0045:01:00.0: DEBUG DATA vsi > 399, enabled_tc 1

First it decided to disable DCB feature due to lack of contiguous traffic classes, and then it used TC MAP (enabled_tc in device driver code as 1, same we already knew works). With that information in hand I forced enabled_tc (TC MAP) to 1 in 4.4's code and it worked, so I'm suspecting of a bad TC mask due to DCB being enabled.

== Comment: #3 - Mauro Sergio Martins Rodrigues - 2017-02-23 11:24:41 ==
I tried the 4.4's version of the i40e but with dcbx disabled in switch's port, Traffic class setup and function bring up worked fine! It user TC MAP (or traffic class mask) as 1. I do understand that this is just a workaround though, the device driver should deal with the case where the switch has such feature enabled instead of leaving the device 'broken':

[ 199.762738] i40e 0045:01:00.0: Using 64-bit DMA iommu bypass
[ 199.786589] i40e 0045:01:00.0: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
[ 200.045270] i40e 0045:01:00.0: MAC address: 68:05:ca:2d:e9:08
[ 200.048955] i40e 0045:01:00.0: SAN MAC: 68:05:ca:2d:e9:0c
[ 200.069228] i40e 0045:01:00.0: DEBUG DATA >> dcb not enabled - first if
[ 200.069232] i40e 0045:01:00.0: DEBUG DATA vsi > 399;enabled_tc > 1
[ 200.088056] i40e 0045:01:00.0 enP69p1s0f0: renamed from eth0
[ 200.240641] i40e 0045:01:00.0: PCI-Express: Speed 8.0GT/s Width x8
[ 200.270717] i40e 0045:01:00.0: Features: PF-id[0] VFs: 32 VSIs: 34 QP: 128 RX: 1BUF RSS FD_ATR DCB VxLAN Geneve PTP VEPA

The line
[ 200.069228] i40e 0045:01:00.0: DEBUG DATA >> dcb not enabled - first if
corresponds to the piece of code where the traffic class is defined (see: http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/i40e/i40e_main.c?v=4.4#L4563)

Another interesting discovery is that the device behaves well when we turn dcbx on in the switch after it's already probed:

[ 609.566786] i40e 0045:01:00.0: DEBUG DATA >> dcb not enabled - first if
[ 609.566794] i40e 0045:01:00.0: DEBUG DATA >> dcb not enabled - first if
[ 611.574987] i40e 0045:01:00.0: DEBUG DATA >> SFP - second if
[ 611.574990] i40e 0045:01:00.0: DEBUG DATA >> SFP - second if
[ 611.574994] i40e 0045:01:00.0: DEBUG DATA vsi > 399;enabled_tc > 31

and such transition set traffic class mask as 31 instead of 255. and if we unload/load the module it goes to the original bad state we experienced in this bug again:

[ 746.151068] i40e 0045:01:00.0: Using 64-bit DMA iommu bypass
[ 746.174695] i40e 0045:01:00.0: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
[ 746.433649] i40e 0045:01:00.0: MAC address: 68:05:ca:2d:e9:08
[ 746.437552] i40e 0045:01:00.0: SAN MAC: 68:05:ca:2d:e9:0c
[ 746.457815] i40e 0045:01:00.0: DEBUG DATA >> SFP - second if
[ 746.457819] i40e 0045:01:00.0: DEBUG DATA vsi > 399;enabled_tc > 255
[ 746.459537] i40e 0045:01:00.0: AQ command Config VSI BW allocation per TC failed = 14
[ 746.459541] i40e 0045:01:00.0: Failed configuring TC map 255 for VSI 399
[ 746.459550] i40e 0045:01:00.0: failed to configure TCs for main VSI tc_map 0x000000ff, err I40E_ERR_INVALID_QP_ID aq_err I40E_AQ_RC_EINVAL

== Comment: #4 - Mauro Sergio Martins Rodrigues - 2017-02-23 14:25:30 ==
Things are going smoothly in kernel 4.8 even if dcbx is enabled in the port due to this commit https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=fbfe12c which disabledcbx when TC are not contiguous (it's not supported by the device)

We should ask for a backport into 4.4.0 but I'm still investigating to see if something else should be included since in comment #3 we can see it transitioning into a valid state when dcbx is enabled in the switch.

== Comment: #5 - Mauro Sergio Martins Rodrigues - 2017-03-13 13:41:19 ==
Even though it was already clear that was related to kernel code, since it works on 4.8 and doesn't in 4.4 I decided to perform a nvm update and it didn't change the scenario.

comment #2 show nvm version as:
> [ 123.450445] i40e 0045:01:00.0: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0

Current version is:
firmware-version: 5.05 0x8000289d 1.1568.0

and the issue continues reproducible .

As stated in comment #4, now I can confirm we need to backport https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=fbfe12c to 4.4 to avoid getting into the broken state when probing Intel x710 (driver i40e).

CVE References

Revision history for this message
bugproxy (bugproxy) wrote : a more complete log of i40e steps

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-151930 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Michael Hohnbaum (hohnbaum) wrote : Re: [Bug 1672550] [NEW] i40e Intel X710 error during device probe prevents link set up and ip association
Download full text (10.0 KiB)

Leann,

This looks like a kernel patch for your team to evaluate.

Thanks.

                     Michael

On 03/13/2017 02:49 PM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> == Comment: #0 - Mauro Sergio Martins Rodrigues - 2017-02-22 06:48:42 ==
> While investigating bug #145959 I got blocked in the reproduction process due to the follow issue during interface link bring up:
>
> [ 1.590591] i40e 0045:01:00.0: AQ command Config VSI BW allocation per TC failed = 14
> [ 1.590661] i40e 0045:01:00.0: Failed configuring TC map 255 for VSI 399
> [ 1.590669] i40e 0045:01:00.0: failed to configure TCs for main VSI tc_map 0x000000ff, err I40E_ERR_INVALID_QP_ID aq_err I40E_AQ_RC_EINVAL
>
> which prevented me to bring the interface up and associate an ip to it.
>
> == Comment: #2 - Mauro Sergio Martins Rodrigues - 2017-02-22 07:26:36 ==
> some missing Information kernel is Ubuntu's 4.4.0-62-generic.
>
> When testing with 4.8.0-36-generic (from xenial's proposed) device probe
> works fine, no similar message is seen.
>
> To obtain some more data on this I added some statements to see which TC
> MAP was applied in a healthy probe (note that the other functions, like
> function 1 works fine but those functions have no cable on them).
>
> root@yangtze-lp1:~/_maurosr/linux-4.4.0/drivers/net/ethernet/intel/i40e# dmesg
> [52448.914605] i40e 0045:01:00.3: i40e_ptp_stop: removed PHC on enP69p1s0f3
> [52448.981801] i40e 0045:01:00.2: i40e_ptp_stop: removed PHC on enP69p1s0f2
> [52449.069793] i40e 0045:01:00.1: i40e_ptp_stop: removed PHC on enP69p1s0f1
> [52449.173834] i40e 0045:01:00.0: i40e_ptp_stop: removed PHC on enP69p1s0f0
> [52449.264462] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 1.4.25-k
> [52449.264468] i40e: Copyright (c) 2013 - 2014 Intel Corporation.
> [52449.264625] i40e 0045:01:00.0: Using 64-bit DMA iommu bypass
> [52449.286138] i40e 0045:01:00.0: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
> [52449.505657] i40e 0045:01:00.0: MAC address: 68:05:ca:2d:e9:08
> [52449.508977] i40e 0045:01:00.0: SAN MAC: 68:05:ca:2d:e9:0c
> [52449.529200] i40e 0045:01:00.0: DEBUG DATA vsi > 399;enabled_tc > 255
> [52449.531210] i40e 0045:01:00.0: AQ command Config VSI BW allocation per TC failed = 14
> [52449.531213] i40e 0045:01:00.0: Failed configuring TC map 255 for VSI 399
> [52449.531217] i40e 0045:01:00.0: failed to configure TCs for main VSI tc_map 0x000000ff, err I40E_ERR_INVALID_QP_ID aq_err I40E_AQ_RC_EINVAL
> [52449.544642] i40e 0045:01:00.0 enP69p1s0f0: renamed from eth0
> [52449.697424] i40e 0045:01:00.0: PCI-Express: Speed 8.0GT/s Width x8
> [52449.727043] i40e 0045:01:00.0: Features: PF-id[0] VFs: 32 VSIs: 34 QP: 0 RX: 1BUF RSS FD_ATR DCB VxLAN Geneve PTP VEPA
> [52449.727098] i40e 0045:01:00.1: Using 64-bit DMA iommu bypass
> [52449.748667] i40e 0045:01:00.1: fw 5.0.40043 api 1.5 nvm 5.02 0x80002284 0.0.0
> [52449.976665] i40e 0045:01:00.1: MAC address: 68:05:ca:2d:e9:09
> [52449.980685] i40e 0045:01:00.1: SAN MAC: 68:05:ca:2d:e9:0d
> [52449.994982] i40e 0045:01:00.1: DEBUG DATA vsi > 398;enabled_tc > 1
> [52450.015610] i40e 0045:01:00.1 enP69p1s0f1: renamed from et...

Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu Xenial):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Seth Forshee (sforshee)
Changed in linux (Ubuntu):
status: Triaged → Fix Released
Changed in linux (Ubuntu Xenial):
assignee: Canonical Kernel Team (canonical-kernel-team) → Seth Forshee (sforshee)
status: Triaged → In Progress
Revision history for this message
Seth Forshee (sforshee) wrote :
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
bugproxy (bugproxy)
tags: added: targetmilestone-inin16041
removed: targetmilestone-inin---
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
bugproxy (bugproxy)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.1 KiB)

This bug was fixed in the package linux - 4.4.0-75.96

---------------
linux (4.4.0-75.96) xenial; urgency=low

  * linux: 4.4.0-75.96 -proposed tracker (LP: #1684441)

  * [Hyper-V] hv: util: move waiting for release to hv_utils_transport itself
    (LP: #1682561)
    - Drivers: hv: util: move waiting for release to hv_utils_transport itself

linux (4.4.0-74.95) xenial; urgency=low

  * linux: 4.4.0-74.95 -proposed tracker (LP: #1682041)

  * [Hyper-V] hv: vmbus: Raise retry/wait limits in vmbus_post_msg()
    (LP: #1681893)
    - Drivers: hv: vmbus: Raise retry/wait limits in vmbus_post_msg()

linux (4.4.0-73.94) xenial; urgency=low

  * linux: 4.4.0-73.94 -proposed tracker (LP: #1680416)

  * CVE-2017-6353
    - sctp: deny peeloff operation on asocs with threads sleeping on it

  * vfat: missing iso8859-1 charset (LP: #1677230)
    - [Config] NLS_ISO8859_1=y

  * Regression: KVM modules should be on main kernel package (LP: #1678099)
    - [Config] powerpc: Add kvm-hv and kvm-pr to the generic inclusion list

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in complain mode

  * Permission denied and inconsistent behavior in complain mode with 'ip netns
    list' command (LP: #1648903)
    - SAUCE: fix regression with domain change in complain mode

  * unexpected errno=13 and disconnected path when trying to open /proc/1/ns/mnt
    from a unshared mount namespace (LP: #1656121)
    - SAUCE: apparmor: null profiles should inherit parent control flags

  * apparmor refcount leak of profile namespace when removing profiles
    (LP: #1660849)
    - SAUCE: apparmor: fix ns ref count link when removing profiles from policy

  * tor in lxd: apparmor="DENIED" operation="change_onexec"
    namespace="root//CONTAINERNAME_<var-lib-lxd>" profile="unconfined"
    name="system_tor" (LP: #1648143)
    - SAUCE: apparmor: Fix no_new_privs blocking change_onexec when using stacked
      namespaces

  * apparmor oops in bind_mnt when dev_path lookup fails (LP: #1660840)
    - SAUCE: apparmor: fix oops in bind_mnt when dev_path lookup fails

  * apparmor auditing denied access of special apparmor .null fi\ le
    (LP: #1660836)
    - SAUCE: apparmor: Don't audit denied access of special apparmor .null file

  * apparmor label leak when new label is unused (LP: #1660834)
    - SAUCE: apparmor: fix label leak when new label is unused

  * apparmor reference count bug in label_merge_insert() (LP: #1660833)
    - SAUCE: apparmor: fix reference count bug in label_merge_insert()

  * apparmor's raw_data file in securityfs is sometimes truncated (LP: #1638996)
    - SAUCE: apparmor: fix replacement race in reading rawdata

  * unix domain socket cross permission check failing with nested namespaces
    (LP: #1660832)
    - SAUCE: apparmor: fix cross ns perm of unix domain sockets

  * Xenial update to v4.4.59 stable release (LP: #1678960)
    - xfrm: policy: init locks early
    - virtio_balloon: init ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.