powerpc: opal machine checks lead to kernel oops and application SIGSEGV

Bug #1301424 reported by Andy Whitcroft
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Andy Whitcroft
Trusty
Fix Released
High
Andy Whitcroft

Bug Description

We're suffering kernel Oopses on multiple Power machines running
Ubuntu 14.04 beta levels, with kernels ranging from (at least) 3.13.0-16
through 3.13.0-19. We are running Ubuntu directly on top of OPAL,
without KVM.

Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Andy Whitcroft (apw)
status: Confirmed → In Progress
Revision history for this message
Andy Whitcroft (apw) wrote :

After discussions the below was the recommendation:

"Below are the list of upstream commits that rewrites the machine check:

a68c33f powerpc: Fix endian issues in power7/8 machine check handler
30c8263 Move precessing of MCE queued event out from syscall exit path.
4e243b7 powerpc: Fix "attempt to move .org backwards" error
b63a0ff powerpc/powernv: Machine check exception handling.
28446de powerpc/powernv: Remove machine check handling in OPAL.
b5ff421 powerpc/book3s: Queue up and process delayed MCE events.
36df96f powerpc/book3s: Decode and save machine check event.
ae744f3 powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on
e22a227 powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on
0440705 powerpc/book3s: Add flush_tlb operation in cpu_spec.
4c70341 powerpc/book3s: Introduce a early machine check hook in cpu_spec.
1c51089 powerpc/book3s: Return from interrupt if coming from evil context.
1e9b450 powerpc/book3s: handle machine check in Linux host.
729b0f7 powerpc/book3s: Introduce exclusive emergency stack for machine check ex
b14a7253 powerpc/book3s: Split the common exception prolog logic into two sectio

Additional commits that are in linux-next:

ece980f powerpc/book3s: Fix CFAR clobbering issue in machine check handler.
55672ec powerpc/book3s: Recover from MC in sapphire on SCOM read via MMIO.

Beloe is the link to a critical fix that I posted to ppc-devel yesterday:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-March/116447.html

There is one more critical fix in machine handling path on its way, which
I will post soon to ppc-devel after the tests. Will send you the link as
soon as I post that to the ppc-devel and the commit ids once available
upstream."

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Revision history for this message
Andy Whitcroft (apw) wrote :

The above is not in a particularly chronological order, the following patches were applied in the order below:

 powerpc/book3s: Split the common exception prolog logic into two section.
 powerpc/book3s: Introduce exclusive emergency stack for machine check exception.
 powerpc/book3s: handle machine check in Linux host.
 powerpc/book3s: Return from interrupt if coming from evil context.
 powerpc/book3s: Introduce a early machine check hook in cpu_spec.
 powerpc/book3s: Add flush_tlb operation in cpu_spec.
 powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on power7.
 powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors on power8.
 powerpc/book3s: Decode and save machine check event.
 powerpc/book3s: Queue up and process delayed MCE events.
 powerpc/powernv: Remove machine check handling in OPAL.
 powerpc/powernv: Machine check exception handling.
 powerpc: Fix "attempt to move .org backwards" error
 powerpc: Fix endian issues in power7/8 machine check handler
 Move precessing of MCE queued event out from syscall exit path.
 UBUNTU: SAUCE: powerpc/book3s: Fix CFAR clobbering issue in machine check handler.
 UBUNTU: SAUCE: powerpc/book3s: Recover from MC in sapphire on SCOM read via MMIO.
 UBUNTU: SAUCE: powerpc/book3s: Fix mc_recoverable_range buffer overrun issue.

Revision history for this message
Andy Whitcroft (apw) wrote :

Testing reported that this combination avoided the oopses and allowed the kernel to recover from various MC errors. Not exploding being the primary gain.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.13.0-22.44

---------------
linux (3.13.0-22.44) trusty; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1301562

  [ dann frazier ]

  * [Config] enable linux-tools on arm64
    https://lists.ubuntu.com/archives/kernel-team/2014-April/041332.html

  [ Greg Kurz ]

  * SAUCE: powerpc/le: Big endian arguments for ppc_rtas()
    - LP: #1289518

  [ Mahesh Salgaonkar ]

  * SAUCE: powerpc/book3s: Fix CFAR clobbering issue in machine check
    handler.
    - LP: #1301424
  * SAUCE: powerpc/book3s: Recover from MC in sapphire on SCOM read via
    MMIO.
    - LP: #1301424
  * SAUCE: powerpc/book3s: Fix mc_recoverable_range buffer overrun issue.
    - LP: #1301424

  [ Paolo Pisati ]

  * [Config] armhf: USB_STORAGE=y
    https://lists.ubuntu.com/archives/kernel-team/2014-April/041349.html

  [ Stefan Bader ]

  * SAUCE: kvm: Force preempt folding in kvm on i386
    - LP: #1268906

  [ Tim Gardner ]

  * SAUCE: Drop lttng in favor of lttng-modules
    The kernel version was down rev on an rc release.

  [ Tomas Winkler ]

  * SAUCE: (no-up) mei: me: do not load the driver if the FW doesn't
    support MEI interface
    - LP: #1301118

  [ Upstream Kernel Changes ]

  * drm/i915: Deprecated UMS support
    - LP: #1284816
  * powerpc/book3s: Split the common exception prolog logic into two
    section.
    - LP: #1301424
  * powerpc/book3s: Introduce exclusive emergency stack for machine check
    exception.
    - LP: #1301424
  * powerpc/book3s: handle machine check in Linux host.
    - LP: #1301424
  * powerpc/book3s: Return from interrupt if coming from evil context.
    - LP: #1301424
  * powerpc/book3s: Introduce a early machine check hook in cpu_spec.
    - LP: #1301424
  * powerpc/book3s: Add flush_tlb operation in cpu_spec.
    - LP: #1301424
  * powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors
    on power7.
    - LP: #1301424
  * powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors
    on power8.
    - LP: #1301424
  * powerpc/book3s: Decode and save machine check event.
    - LP: #1301424
  * powerpc/book3s: Queue up and process delayed MCE events.
    - LP: #1301424
  * powerpc/powernv: Remove machine check handling in OPAL.
    - LP: #1301424
  * powerpc/powernv: Machine check exception handling.
    - LP: #1301424
  * powerpc: Fix "attempt to move .org backwards" error
    - LP: #1301424
  * powerpc: Fix endian issues in power7/8 machine check handler
    - LP: #1301424
  * Move precessing of MCE queued event out from syscall exit path.
    - LP: #1301424
 -- Andy Whitcroft <email address hidden> Wed, 02 Apr 2014 15:58:48 +0100

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.