OPRASHB:Habanero:EEH: Opal not calling out slot number for failing adapter behind plx switch

Bug #1538909 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Tim Gardner
Vivid
Won't Fix
Undecided
Unassigned
Wily
New
Undecided
Unassigned
Xenial
Fix Released
High
Tim Gardner

Bug Description

== Comment: #0 - MAMATHA INAMDAR <email address hidden> - 2016-01-14 04:51:41 ==
---Problem Description---
Working with Chad (IO team) we were able to inject an EEH recoverable error to the broadcom network adapter (PE #2) behind the PLX switch. Looks like OPAL calls out the Backplane PLX ( Planar ) instead of the adapter slot.

We want to primarily focus on why the adapter slot ( behind PLX ) didn't get called out using this defect.

Problem:
========
>> Working with Chad (IO team) we were able to inject an EEH recoverable error
>> to the broadcom network adapter (PE #2) and noticed that we are not getting
>> the adapter slot called out, instead we get the location pointing to the
>> backplane PLX.
>
>If what you injected is a PCIe error message, I think those cause the
>switch leg to freeze, but I will need Gavin to confirm.
>
>> Per Chad:
>> ??????????????????????????????????????????????????????????????????????????????
>> ?They're logging the right PE (#2--which corresponds to the Broadcom??????????
>> ?adapter)--they're just not pointing to its slot explicitly.??????????????????
>> ??????????????????????????????????????????????????????????????????????????????
>>
>>
>>
>> Here is a snippet the /var/log/messages:
>>
>> ??????????????????????????????????????????????????????????????????????????????????
>> ?Nov 11 12:17:18 habmc8p01 kernel: EEH: Frozen PHB#1-PE#2 detected????????????????
>> ?Nov 11 12:17:18 habmc8p01 kernel: bnx2x: [bnx2x_timer:5750(net0)]MFW seems???????
>> ?hanged: drv_pulse (0x1c1) != mcp_pulse (0x7fff)??????????????????????????????????
>> ?Nov 11 12:17:18 habmc8p01 kernel: EEH: PE location: Backplane PLX, PHB location:
>> ?Backplane PLX????????????????????????????????????????????????????????????????????
>> ??????????????????????????????????????????????????????????????????????????????????
>>
>>
>>
>>
>> injection:
>> setpci -s 0001:0c:00.2 COMMAND
>> ???????????????????????????
>> ?setpci -s??0001:0c:00.2???
>> ?COMMAND=0540??????????????
>> ???????????????????????????

It seems you're disabling memory BAR and then issue MMIO load, which results
in "unsupported request" returned from the adapter. In response to that, the
PE#2 as shown in the kernel log is put to frozen state. Nothing wrong at this
point. I think the only question would be: the location code isn't making sense.

Contact Information = -----

---uname output---
3.19.0-43-generic

Machine Type = ----

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 This bug is follow up of bug 133061. This bug is opened to backport the kernel patch which is available to fix the issue for bug 133061 on Ubuntu.

Stack trace output:
 no

Oops output:
 no

System Dump Info:
  The system is not configured to capture a system dump.

*Additional Instructions for -----:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.

Firestone server ( Ubuntu Host )
=================================

I have built the Ubuntu kernel with patch and created *.deb files.

Please find the same in the following path for installing and testing your test case

==== State: Assigned by: thalerj on 13 January 2016 12:10:58 ====

The patched kernel is working great for both firestone and Habanero. It resolves the issue and all slot numbers are called out properly.

== Comment: #1 - MAMATHA INAMDAR <email address hidden> - 2016-01-28 01:02:23 ==
Patch is now available in the following branch

https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git/commit/?h=fixes&id=7e56f627768da4e6480986b5145dc3422bc448a5

== Comment: #3 - MAMATHA INAMDAR <email address hidden> - 2016-01-28 01:09:14 ==

Revision history for this message
bugproxy (bugproxy) wrote : Kernel patch to fix issue

Default Comment by Bridge

tags: added: architecture-ppc64 bugnameltc-135219 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1538909/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Xenial):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Triaged → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-02-07 23:57 EDT-------
Hello Canonical,

Any updates on this?

tags: added: severity-critical
removed: severity-high
Revision history for this message
Tim Gardner (timg-tpi) wrote :

This fix will be released with Ubuntu-4.4.0-4.19 which is in proposed. It has also been added to the 4.2.y-ckt stable tree.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (4.0 KiB)

This bug was fixed in the package linux - 4.4.0-4.19

---------------
linux (4.4.0-4.19) xenial; urgency=low

  * update ZFS and SPL to 0.6.5.4 (LP: #1542296)
    - [Config] update spl/zfs version
    - SAUCE: (noup) Update spl to 0.6.5.4-0ubuntu2, zfs to 0.6.5.4-0ubuntu1
    - [Config] reconstruct -- drop links for zfs userspace components
    - [Config] reconstruct -- drop links for zfs userspace components -- restore spec links

  * recvmsg() fails SCM_CREDENTIALS request with EOPNOTSUPP. (LP: #1540731)
    - Revert "af_unix: Revert 'lock_interruptible' in stream receive code"

  * lxc: ADT exercise test failing with linux-4.4.0-3.17 (LP: #1542049)
    - Revert "UBUNTU: SAUCE: apparmor: fix sleep from invalid context"

  * WARNING: at /build/linux-lts-wily-W0lTWH/linux-lts-wily-4.2.0/net/core/skbuff.c:4174 (Travis IB) (LP: #1541326)
    - SAUCE: IB/IPoIB: Do not set skb truesize since using one linearskb

  * backport Microsoft Precision Touchpad palm rejection patch (LP: #1541671)
    - HID: multitouch: enable palm rejection if device implements confidence usage

  * [Ubuntu 16.04] Update qla2xxx driver for POWER (QLogic) (LP: #1541456)
    - qla2xxx: Remove unavailable firmware files
    - qla2xxx: Enable Extended Logins support
    - qla2xxx: Enable Exchange offload support.
    - qla2xxx: Enable Target counters in DebugFS.
    - qla2xxx: Add FW resource count in DebugFS.
    - qla2xxx: Added interface to send explicit LOGO.
    - qla2xxx: Delete session if initiator is gone from FW
    - qla2xxx: Wait for all conflicts before ack'ing PLOGI
    - qla2xxx: Replace QLA_TGT_STATE_ABORTED with a bit.
    - qla2xxx: Remove dependency on hardware_lock to reduce lock contention.
    - qla2xxx: Add irq affinity notification
    - qla2xxx: Add selective command queuing
    - qla2xxx: Move atioq to a different lock to reduce lock contention
    - qla2xxx: Disable ZIO at start time.
    - qla2xxx: Set all queues to 4k
    - qla2xxx: Check for online flag instead of active reset when transmitting responses
    - scsi: qla2xxxx: avoid type mismatch in comparison

  * [Hyper-V] PCI Passthrough (LP: #1541120)
    - x86/irq: Export functions to allow MSI domains in modules
    - genirq/msi: Export functions to allow MSI domains in modules

  * Update lpfc driver to 11.0.0.10 (LP: #1541592)
    - lpfc: Fix FCF Infinite loop in lpfc_sli4_fcf_rr_next_index_get.
    - lpfc: Fix the FLOGI discovery logic to comply with T11 standards
    - lpfc: Fix RegLogin failed error seen on Lancer FC during port bounce
    - lpfc: Fix driver crash when module parameter lpfc_fcp_io_channel set to 16
    - lpfc: Fix crash in fcp command completion path.
    - lpfc: Modularize and cleanup FDMI code in driver
    - lpfc: Fix RDP Speed reporting.
    - lpfc: Fix RDP ACC being too long.
    - lpfc: Make write check error processing more resilient
    - lpfc: Use new FDMI speed definitions for 10G, 25G and 40G FCoE.
    - lpfc: Fix mbox reuse in PLOGI completion
    - lpfc: Fix external loopback failure.
    - lpfc: Add logging for misconfigured optics.
    - lpfc: Delete unnecessary checks before the function call "mempool_destroy"
    - lpfc: Use kzalloc instead of kmalloc
...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-02-22 21:48 EDT-------
Will we see this fix release in a 14.04 SRU?

tags: added: targetmilestone-inin14044
removed: targetmilestone-inin---
Revision history for this message
Tim Gardner (timg-tpi) wrote :

commit 7e56f627768da4e6480986b5145dc3422bc448a5 (powerpc/eeh: Fix PE location code) does not appear to do anything other then add a function that is unused.

Revision history for this message
Andy Whitcroft (apw) wrote : Closing unsupported series nomination.

This bug was nominated against a series that is no longer supported, ie vivid. The bug task representing the vivid nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Vivid):
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.