mlx4 not recovering from EEH in Ubuntu 15.04 (Mellanox)

Bug #1422481 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Leann Ogasawara

Bug Description

---Problem Description---
EEH is not working with mlx4 driver. When the driver recovered it hits another EEH.

---uname output---
Linux ubuntu 3.18.0-12-generic #13 SMP Mon Feb 9 16:31:42 CST 2015 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
Need Mellanox adapter like Connect 3 adapter.

Machine Type = P8

---Steps to Reproduce---
 Just inject EEH to mlx4 device.

Stack trace output:
 from EEH recovery then it hits this:
[ 188.747571] EEH: Collect temporary log
[ 188.748330] EEH: of node=/pci@800000020000007/ethernet@3
[ 188.748339] EEH: PCI device/vendor: 100715b3
[ 188.748361] EEH: PCI cmd/status register: 00100146
[ 188.748362] EEH: PCI-E capabilities and status follow:
[ 188.748459] EEH: PCI-E 00: 00020010 10008e02 0001200e 0843f483
[ 188.748537] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 188.748539] EEH: PCI-E 20: 00000000
[ 188.748540] EEH: PCI-E AER capability register set follows:
[ 188.748625] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
[ 188.748704] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
[ 188.748783] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 188.748805] EEH: PCI-E AER 30: 00000000 00000000
[ 188.748813] EEH: Reset without hotplug activity
[ 193.833245] EEH: Notify device drivers the completion of reset
[ 193.833257] mlx4_core: Initializing 0001:00:03.0
[ 193.833317] mlx4_core 0001:00:03.0: BAR 0: can't reserve [mem 0x170b0000000-0x170b00fffff]
[ 193.833321] mlx4_core 0001:00:03.0: Couldn't get PCI resources, aborting
[ 193.833395] EEH: Not recovered
[ 193.833397] EEH: Unable to recover from failure from PHB#1-PE#1.
Please try reseating or replacing it
[ 193.834531] EEH: of node=/pci@800000020000007/ethernet@3
[ 193.834547] EEH: PCI device/vendor: 100715b3
[ 193.834580] EEH: PCI cmd/status register: 00100142
[ 193.834582] EEH: PCI-E capabilities and status follow:
[ 193.834728] EEH: PCI-E 00: 00020010 10008e02 0000200e 0843f483
[ 193.834846] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 193.834849] EEH: PCI-E 20: 00000000
[ 193.834850] EEH: PCI-E AER capability register set follows:
[ 193.834981] EEH: PCI-E AER 00: 00020001 00000000 00000000 00062010
[ 193.835101] EEH: PCI-E AER 10: 00002000 00002000 000001e0 00000000
[ 193.835219] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 193.835252] EEH: PCI-E AER 30: 00000000 00000000
[ 193.835289] Unable to handle kernel paging request for data at address 0x00000388
[ 193.835356] Faulting instruction address: 0xd000000001f3231c
[ 193.835415] Oops: Kernel access of bad area, sig: 11 [#1]
[ 193.835460] SMP NR_CPUS=2048 NUMA pSeries
[ 193.835509] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc rtc_generic mlx4_en vxlan ip6_udp_tunnel udp_tunnel mlx4_core
[ 193.835886] CPU: 6 PID: 50 Comm: eehd Not tainted 3.18.0-12-generic #13
[ 193.835942] task: c0000003f72ca880 ti: c0000003f707c000 task.ti: c0000003f707c000
[ 193.836009] NIP: d000000001f3231c LR: d000000001f32790 CTR: d000000001f32760
[ 193.836076] REGS: c0000003f707f790 TRAP: 0300 Not tainted (3.18.0-12-generic)
[ 193.836141] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44000048 XER: 20000000
[ 193.836302] CFAR: c0000000000a7be0 DAR: 0000000000000388 DSISR: 40000000 SOFTE: 1
GPR00: d000000001f32790 c0000003f707fa10 d000000001f66310 c0000003fe0ad000
GPR04: 0000000000000003 0000000000000000 0000000000000000 c0000003fd000000
GPR08: 0000000000000001 d000000001f32760 00000000fffffffa 0000000100001001
GPR12: d000000001f32760 c00000000fb83600 c0000000000d9118 c0000003f90e56c0
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000c4ab90
GPR24: c000000000c4ab68 0000000000100100 c0000003fe068580 c0000003fe068580
GPR28: c0000003fe0ad000 c0000003fe0685e0 d000000001f5da50 0000000000000000
[ 193.837205] NIP [d000000001f3231c] mlx4_unload_one+0x3c/0x480 [mlx4_core]
[ 193.837269] LR [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core]
[ 193.837336] Call Trace:
[ 193.837361] [c0000003f707fa10] [c0000003fe068580] 0xc0000003fe068580 (unreliable)
[ 193.837447] [c0000003f707faa0] [d000000001f32790] mlx4_pci_err_detected+0x30/0x60 [mlx4_core]
[ 193.837528] [c0000003f707fae0] [c00000000003ac64] eeh_report_failure+0xb4/0xf0
[ 193.837606] [c0000003f707fb10] [c0000000000393b4] eeh_pe_dev_traverse+0x94/0x160
[ 193.837685] [c0000003f707fba0] [c00000000003b148] eeh_handle_normal_event+0xa8/0x400
[ 193.837764] [c0000003f707fc20] [c00000000003b6b4] eeh_handle_event+0x54/0x360
[ 193.837832] [c0000003f707fcd0] [c00000000003bae4] eeh_event_handler+0x124/0x1d0
[ 193.837911] [c0000003f707fd80] [c0000000000d9220] kthread+0x110/0x130
[ 193.837980] [c0000003f707fe30] [c000000000009568] ret_from_kernel_thread+0x5c/0x74
[ 193.838057] Instruction dump:
[ 193.838094] fb41ffd0 fb61ffd8 fb81ffe0 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71
[ 193.838217] 7c7c1b78 48000008 e8410018 ebfc0138 <813f0388> 2f890000 409e020c e93f0008
[ 193.838341] ---[ end trace 7cd21329722bcbd1 ]---

There is a series of patches in this link that should resolve this issue.
http://permalink.gmane.org/gmane.linux.network/347777
I had applied these in upstream kernel and it is ok but let me double check with Ubuntu 15.04 kernel if these are the patches we need to solve this bugzilla.

I used this kernel from Ubuntu 15.04 3.18.0-13.14
To make EEH work, to try to reach the first 2 patches of that series I have to use all this patches:

From ca9f9f703950e5cb300526549b4f1b0a6605a5c5 Mon Sep 17 00:00:00 2001
From: Amir Vadai <email address hidden>
Date: Tue, 25 Feb 2014 18:17:52 +0200
Subject: net/mlx4_en: Fix bad use of dev_id

From adbc7ac5c15eb5e9d70393428345e72a1a897d6a Mon Sep 17 00:00:00 2001
From: Saeed Mahameed <email address hidden>
Date: Mon, 27 Oct 2014 11:37:37 +0200
Subject: net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap

From a53e3e8c1db547981e13d1ebf24a659bd4e87710 Mon Sep 17 00:00:00 2001
From: Saeed Mahameed <email address hidden>
Date: Mon, 27 Oct 2014 11:37:38 +0200
Subject: net/mlx4_core: Add ethernet backplane autoneg device capability

From d475c95b4bcff983ac76e8522bfd2d29bcc567d0 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Sun, 2 Nov 2014 16:26:17 +0200
Subject: net/mlx4_core: Add retrieval of CONFIG_DEV parameters

From dd65beac48a5259945846956d4b27344dfb73bd9 Mon Sep 17 00:00:00 2001
From: Shani Michaeli <email address hidden>
Date: Sun, 9 Nov 2014 13:51:52 +0200
Subject: net/mlx4_en: Extend usage of napi_gro_frags

From f8c6455bb04b944edb69e9b074e28efee2c56bdd Mon Sep 17 00:00:00 2001
From: Shani Michaeli <email address hidden>
Date: Sun, 9 Nov 2014 13:51:53 +0200
Subject: net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE

From ffc39f6d6fff2878c55ffa5ffb1828d7618c0a29 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:29 +0200
Subject: net/mlx4_core: Refactor mlx4_cmd_init and mlx4_cmd_cleanup

From a0eacca948d2d4531a393d82a736ff19b7b8fa0b Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:30 +0200
Subject: net/mlx4_core: Refactor mlx4_load_one

From e8c4265bea8437f5583d0c2f272058200ebc10ff Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:31 +0200
Subject: net/mlx4_core: Add QUERY_FUNC firmware command

From 7ae0e400cd9396c41fe596d35dcc34feaa89a04f Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:32 +0200
Subject: net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X
 vectors for PF/VFs
From da315679e80635021e98de1306ff4eee0759ba57 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Sun, 14 Dec 2014 16:18:04 +0200
Subject: net/mlx4_core: Fixed memory leak and incorrect refcount in

with those patches I can apply from the series that I pointed:

==> 0001-net-mlx4_core-Maintain-a-persistent-memory-for-mlx4-.patch <==
From 872bf2fb69d90e3619befee842fc26db39d8e475 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:35 +0200
Subject: net/mlx4_core: Maintain a persistent memory for mlx4 device

==> 0002-net-mlx4_core-Set-device-configuration-data-to-be-pe.patch <==
From dd0eefe3abbf47442db296bf68f27eb2860c1cdf Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:36 +0200
Subject: net/mlx4_core: Set device configuration data to be persistent across
 reset
==> 0003-net-mlx4_core-Refactor-the-catas-flow-to-work-per-de.patch <==
From ad9a0bf08ffbf32b8f292c3bb78ca0f24bb8f6b2 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:37 +0200
Subject: net/mlx4_core: Refactor the catas flow to work per device

==> 0004-net-mlx4_core-Enhance-the-catas-flow-to-support-devi.patch <==
From f6bc11e42646e661e699a5593cbd1e9dba7191d0 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:38 +0200
Subject: net/mlx4_core: Enhance the catas flow to support device reset

==> 0005-net-mlx4_core-Activate-reset-flow-upon-fatal-command.patch <==
From f5aef5aa35063f2b45c3605871cd525d0cb7fb7a Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:39 +0200
Subject: net/mlx4_core: Activate reset flow upon fatal command cases

==> 0006-net-mlx4_core-Manage-interface-state-for-Reset-flow-.patch <==
From c69453e294c9f16da977b68e658a8028b854c209 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:40 +0200
Subject: net/mlx4_core: Manage interface state for Reset flow cases

==> 0007-net-mlx4_core-Handle-AER-flow-properly.patch <==
From 2ba5fbd62b2534335f4e3b844ecc7860115525a3 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:41 +0200
Subject: net/mlx4_core: Handle AER flow properly

but to apply the whole series to include SRIOV EEH, then I need these extra packages:
==> 0008-g-mlx4.patch <==
From 225c6c8c6bbbc32455df3d1c0fb1e1e1fb51c533 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:28 +0200
Subject: net/mlx4_core: Use correct variable type for mlx4_slave_cap

==> 0008-l-mlx4.patch <==
From de966c5928026b100a989c8cef761d306310a184 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 13 Nov 2014 14:45:33 +0200
Subject: net/mlx4_core: Support more than 64 VFs

==> 0008-m-mlx4.patch <==
From 383677da43fa83b390888cf7d25885166b2a6812 Mon Sep 17 00:00:00 2001
From: Or Gerlitz <email address hidden>
Date: Thu, 11 Dec 2014 10:57:52 +0200
Subject: net/mlx4_core: Mask out host side virtualization features for guests

==> 0008-net-mlx4_core-Enable-device-recovery-flow-with-SRIOV.patch <==
From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:42 +0200
Subject: net/mlx4_core: Enable device recovery flow with SRIOV

==> 0008-n-mlx4.patch <==
From ddae0349fdb78bcc5e7219061847012aa1a29069 Mon Sep 17 00:00:00 2001
From: Eugenia Emantayev <email address hidden>
Date: Thu, 11 Dec 2014 10:57:54 +0200
Subject: net/mlx4: Change QP allocation scheme

==> 0008-o-mlx4.patch <==
From 431df8c7e9708433459fd806a08308997de43121 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:59 +0200
Subject: net/mlx4: Refactor QUERY_PORT

==> 0008-p-mlx4.patch <==
From ab256e5ad02b36951f01bf6b5cfda25f14820847 Mon Sep 17 00:00:00 2001
From: Dotan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:55 +0200
Subject: net/mlx4: Add a check if there are too many reserved QPs

==> 0008-r-mlx4.patch <==
From d57febe1a47801ef8a55dbf10672850523dfaa60 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:57 +0200
Subject: net/mlx4: Add A0 hybrid steering

==> 0008-s-mlx4.patch <==
From 7d077cd34eabb2ffd05abe0f2cad01da1ef11712 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:58:00 +0200
Subject: net/mlx4: Add support for A0 steering

==> 0008-z-mlx4.patch <==
From 7a89399ffad7b7c47b43afda010309b3b88538c0 Mon Sep 17 00:00:00 2001
From: Matan Barak <email address hidden>
Date: Thu, 11 Dec 2014 10:57:56 +0200
Subject: net/mlx4: Add mlx4_bitmap zone allocator

So then I can apply these
From 55ad359225b2232b9b8f04a0dfa169bd3a7d86d2 Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:42 +0200
Subject: net/mlx4_core: Enable device recovery flow with SRIOV

==> 0009-net-mlx4_core-Reset-flow-activation-upon-SRIOV-fatal.patch <==
From 0cd9302734111abc0b5912b695336f2ee63cb22b Mon Sep 17 00:00:00 2001
From: Yishai Hadas <email address hidden>
Date: Sun, 25 Jan 2015 16:59:43 +0200
Subject: net/mlx4_core: Reset flow activation upon SRIOV fatal command cases

So basically to apply the series will need a lot of patches and probably restest the driver.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-121681 severity-high targetmilestone-inin1504
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1422481/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
tags: added: kernel-da-key
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :
Download full text (8.5 KiB)

I've done an audit of all of the patches noted above. All of them are officially upstream. Some landed in 3.19, while the rest are in 4.0-rc1. It breaks down as follows:

git describe --contains ca9f9f703950e5cb300526549b4f1b0a6605a5c5
v3.15-rc1~113^2~263^2

  commit ca9f9f703950e5cb300526549b4f1b0a6605a5c5
  Author: Amir Vadai <email address hidden>
  Date: Tue Feb 25 18:17:52 2014 +0200

    net/mlx4_en: Fix bad use of dev_id

git describe --contains adbc7ac5c15eb5e9d70393428345e72a1a897d6a
v3.19-rc1~118^2~332^2~10

  commit adbc7ac5c15eb5e9d70393428345e72a1a897d6a
  Author: Saeed Mahameed <email address hidden>
  Date: Mon Oct 27 11:37:37 2014 +0200

    net/mlx4_core: Introduce ACCESS_REG CMD and eth_prot_ctrl dev cap

git describe --contains a53e3e8c1db547981e13d1ebf24a659bd4e87710
v3.19-rc1~118^2~332^2~9

  commit a53e3e8c1db547981e13d1ebf24a659bd4e87710
  Author: Saeed Mahameed <email address hidden>
  Date: Mon Oct 27 11:37:38 2014 +0200

    net/mlx4_core: Add ethernet backplane autoneg device capability

git describe --contains d475c95b4bcff983ac76e8522bfd2d29bcc567d0
v3.19-rc1~118^2~294^2

  commit d475c95b4bcff983ac76e8522bfd2d29bcc567d0
  Author: Matan Barak <email address hidden>
  Date: Sun Nov 2 16:26:17 2014 +0200

    net/mlx4_core: Add retrieval of CONFIG_DEV parameters

git describe --contains dd65beac48a5259945846956d4b27344dfb73bd9
v3.19-rc1~118^2~228^2~1

  commit dd65beac48a5259945846956d4b27344dfb73bd9
  Author: Shani Michaeli <email address hidden>
  Date: Sun Nov 9 13:51:52 2014 +0200

    net/mlx4_en: Extend usage of napi_gro_frags

git describe --contains f8c6455bb04b944edb69e9b074e28efee2c56bdd
v3.19-rc1~118^2~228^2

  commit f8c6455bb04b944edb69e9b074e28efee2c56bdd
  Author: Shani Michaeli <email address hidden>
  Date: Sun Nov 9 13:51:53 2014 +0200

    net/mlx4_en: Extend checksum offloading by CHECKSUM COMPLETE

git describe --contains ffc39f6d6fff2878c55ffa5ffb1828d7618c0a29
v3.19-rc1~118^2~192^2~4

  commit ffc39f6d6fff2878c55ffa5ffb1828d7618c0a29
  Author: Matan Barak <email address hidden>
  Date: Thu Nov 13 14:45:29 2014 +0200

    net/mlx4_core: Refactor mlx4_cmd_init and mlx4_cmd_cleanup

git describe --contains a0eacca948d2d4531a393d82a736ff19b7b8fa0b
v3.19-rc1~118^2~192^2~3

  commit a0eacca948d2d4531a393d82a736ff19b7b8fa0b
  Author: Matan Barak <email address hidden>
  Date: Thu Nov 13 14:45:30 2014 +0200

    net/mlx4_core: Refactor mlx4_load_one

git describe --contains e8c4265bea8437f5583d0c2f272058200ebc10ff
v3.19-rc1~118^2~192^2~2

  commit e8c4265bea8437f5583d0c2f272058200ebc10ff
  Author: Matan Barak <email address hidden>
  Date: Thu Nov 13 14:45:31 2014 +0200

    net/mlx4_core: Add QUERY_FUNC firmware command

git describe --contains 7ae0e400cd9396c41fe596d35dcc34feaa89a04f
v3.19-rc1~118^2~192^2~1

  commit 7ae0e400cd9396c41fe596d35dcc34feaa89a04f
  Author: Matan Barak <email address hidden>
  Date: Thu Nov 13 14:45:32 2014 +0200

    net/mlx4_core: Flexible (asymmetric) allocation of EQs and MSI-X vectors for PF/VFs

git describe --contains da315679e80635021e98de1306ff4eee0759ba57
v3.19-rc1~32^2~28^2~1

  commit da315679e80635021e98de1306ff4eee0759ba57
  Author: Matan Barak <m...

Read more...

Changed in linux (Ubuntu):
assignee: nobody → Leann Ogasawara (leannogasawara)
importance: Undecided → High
status: New → In Progress
Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-8.8

---------------
linux (3.19.0-8.8) vivid; urgency=low

  [ Andy Whitcroft ]

  * ubuntu: vbox -- elide the new symlinks and reconstruct on clean:
    - LP: #1426113
  * rebase to stable v3.19.1

  [ John Johansen ]

  * SAUCE: (no-up): apparmor: fix mediation of fs unix sockets
    - LP: #1408833

  [ Leann Ogasawara ]

  * Release Tracking Bug
    - LP: #1429940

  [ Upstream Kernel Changes ]

  * xen: correct bug in p2m list initialization
  * net/mlx5_core: Fix configuration of log_uar_page_sz
    - LP: #1419938
  * tpm/ibmvtpm: Additional LE support for tpm_ibmvtpm_send
    - LP: #1420575
  * net/mlx4_core: Maintain a persistent memory for mlx4 device
    - LP: #1422481
  * net/mlx4_core: Set device configuration data to be persistent across
    reset
    - LP: #1422481
  * net/mlx4_core: Refactor the catas flow to work per device
    - LP: #1422481
  * net/mlx4_core: Enhance the catas flow to support device reset
    - LP: #1422481
  * net/mlx4_core: Activate reset flow upon fatal command cases
    - LP: #1422481
  * net/mlx4_core: Manage interface state for Reset flow cases
    - LP: #1422481
  * net/mlx4_core: Handle AER flow properly
    - LP: #1422481
  * net/mlx4_core: Enable device recovery flow with SRIOV
    - LP: #1422481
  * net/mlx4_core: Reset flow activation upon SRIOV fatal command cases
    - LP: #1422481
  * tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one()
    - LP: #1428111
  * rebase to v3.19.1
    - LP: #1410704
    - LP: #1411193
    - LP: #1400215
 -- Leann Ogasawara <email address hidden> Mon, 09 Mar 2015 10:08:29 -0700

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (3.3 KiB)

------- Comment From <email address hidden> 2015-03-11 19:15 EDT-------
This looks fixed with 3.19.0-8-generic #8-Ubuntu
it was able to recover from EEH.

[ 2694.622586] EEH: Notify device drivers to shutdown
[ 2694.622587] mlx4_core 0004:01:00.0: device was reset successfully
[ 2694.622589] mlx4_core 0004:01:00.0: mlx4_pci_err_detected was called
[ 2694.622594] mlx4_en 0004:01:00.0: Internal error detected, restarting device
[ 2694.622786] mlx4_en: eth14: Close port called
[ 2694.846830] mlx4_en 0004:01:00.0: removed PHC
[ 2694.874036] EEH: Collect temporary log
[ 2694.879101] EEH: of node=/pciex@3fffe42000000/pci@0/ethernet@0
[ 2694.879465] EEH: PCI device/vendor: 100715b3
[ 2694.879478] EEH: PCI cmd/status register: 00100142
[ 2694.879479] EEH: PCI-E capabilities and status follow:
[ 2694.879544] EEH: PCI-E 00: 00020010 10008e02 0020204e 0843f483
[ 2694.879597] EEH: PCI-E 10: 10830040 00000000 00000000 00000000
[ 2694.879598] EEH: PCI-E 20: 00000000
[ 2694.879599] EEH: PCI-E AER capability register set follows:
[ 2694.879666] EEH: PCI-E AER 00: 18c20001 00000000 00000000 00062010
[ 2694.879719] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 2694.879772] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 2694.879785] EEH: PCI-E AER 30: 00000000 00000000
[ 2694.879787] PHB3 PHB#4 Diag-data (Version: 1)
[ 2694.879789] brdgCtl: 00000002
[ 2694.879790] UtlSts: 00200000 00000000 00000000
[ 2694.879791] RootSts: 00000040 00400000 f0830048 00100147 00000000
[ 2694.879792] PhbSts: 0000001c00000000 0000001c00000000
[ 2694.879793] Lem: 0000000000100000 42498e327f502eae 0000000000000000
[ 2694.879795] InAErr: 8000000000000000 8000000000000000 0402008000000000 0000000000000000
[ 2694.879796] PE[ 1] A/B: 8480002b00000000 8000000000000000
[ 2694.879797] PE[ 2] A/B: 8000000000000000 8000000000000000
[ 2694.879798] PE[ 3] A/B: 8000000000000000 8000000000000000
[ 2694.879799] PE[ 4] A/B: 8000000000000000 8000000000000000
[ 2694.879800] PE[ 5] A/B: 8000000000000000 8000000000000000
[ 2694.879801] EEH: Reset without hotplug activity
[ 2698.898176] EEH: Notify device drivers the completion of reset
[ 2698.898181] mlx4_core 0004:01:00.0: mlx4_pci_slot_reset was called
[ 2698.898218] mlx4_core 0004:01:00.0: enabling device (0140 -> 0142)
[ 2705.396286] mlx4_core 0004:01:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 2705.396288] mlx4_core 0004:01:00.0: PCIe link width is x8, device supports x8
[ 2706.143789] mlx4_en 0004:01:00.0: registered PHC clock
[ 2706.143864] mlx4_en 0004:01:00.0: Activating port:1
[ 2706.159496] mlx4_en: eth11: Using 256 TX rings
[ 2706.159504] mlx4_en: eth11: Using 8 RX rings
[ 2706.159506] mlx4_en: eth11: frag:0 - size:1518 prefix:0 stride:1536
[ 2706.159722] mlx4_en: eth11: Initializing port
[ 2706.160022] mlx4_en 0004:01:00.0: Activating port:2
[ 2706.165214] mlx4_core 0004:01:00.0 eth14: renamed from eth11
[ 2706.188419] mlx4_en: eth11: Using 256 TX rings
[ 2706.188427] mlx4_en: eth11: Using 8 RX rings
[ 2706.188430] mlx4_en: eth11: frag:0 - size:1518 prefix:0 stride:1536
[ 2706.188660] mlx4_en: eth11: Initializing port
[ 2706.197316] EEH: Notify device driver to resume...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-03-12 13:31 EDT-------
Closing it per previous comment.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.