17.10 fails to boot on POWER9 DD2.0 with Deep stop states

Bug #1715064 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
High
Seth Forshee
Artful
Fix Released
High
Seth Forshee

Bug Description

== Comment: #0 - Ranjal G. Shenoy
On Boston DD2.0 system, where deep stop states such as stop4 are enabled, the 17.10 kernel Ubuntu-4.12.0-12.13 fails to boot.

It requires the following upstream fixes to be backported.

1) commit 5f221c3ca13d ("powerpc/powernv/idle: Correctly initialize core_idle_state_ptr")
2) commit ec4867355244 ("powerpc/powernv/idle: Decouple Timebase restore & Per-core SPRs restore")
3) commit cb0be7ec0307 ("powerpc/powernv/idle: Restore LPCR on wakeup from deep-stop")
4) commit 1e1601b38e6e ("powerpc/powernv/idle: Restore SPRs for deep idle states via stop API.")
5) commit 22c6663dc69a ("powerpc/powernv/idle: Use Requested Level for restoring state on P9 DD1")
6) commit f9122ee4f558 ("cpuidle-powernv: Allow Deep stop states that don't stop time")
7) commit 785a12afdb4a ("powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api fails")
8) commit e1c1cfed5432 ("powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle")
9) commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug")
10) https://patchwork.ozlabs.org/patch/808233/ ("powerpc/powernv: Clear LPCR[PECE1] via stop-api only for deep state offline")

Of these 1-7 are in Linux Kernel 4.13. 8 and 9 are in powerpc/linux.git -next branch. and 10) is posted upstream which fixes 9).

These patches have been backported on top of Ubuntu-4.12.0-12.13 and tested on Boston where they are working as expected.

== Comment: #1 - Ranjal G. Shenoy

    The lower 8 bits of core_idle_state_ptr tracks the number of non-idle
    threads in the core. This is supposed to be initialized to bit-map
    corresponding to the threads_per_core. However, currently it is
    initialized to PNV_CORE_IDLE_THREAD_BITS (0xFF). This is correct for
    POWER8 which has 8 threads per core, but not for POWER9 which has 4
    threads per core.

    As a result, on POWER9, core_idle_state_ptr gets initialized to
    0xFF. In case when all the threads of the core are idle, the bits
    corresponding tracking the idle-threads are non-zero. As a result, the
    idle entry/exit code fails to save/restore per-core hypervisor state
    since it assumes that there are threads in the cores which are still
    active.

    Fix this by correctly initializing the lower bits of the
    core_idle_state_ptr on the basis of threads_per_core.

    Cherry-picked from commit 5f221c3ca13d ("powerpc/powernv/idle:
    Correctly initialize core_idle_state_ptr")

== Comment: #2 - Ranjal G. Shenoy
   On POWER8, in case of
       - nap: both timebase and hypervisor state is retained.
       - fast-sleep: timebase is lost. But the hypervisor state is retained.
       - winkle: timebase and hypervisor state is lost.

    Hence, the current code for handling exit from a idle state assumes
    that if the timebase value is retained, then so is the hypervisor
    state. Thus, the current code doesn't restore per-core hypervisor
    state in such cases.

    But that is no longer the case on POWER9 where we do have stop states
    in which timebase value is retained, but the hypervisor state is
    lost. So we have to ensure that the per-core hypervisor state gets
    restored in such cases.

    Fix this by ensuring that even in the case when timebase is retained,
    we explicitly check if we are waking up from a deep stop that loses
    per-core hypervisor state (indicated by cr4 being eq or gt), and if
    this is the case, we restore the per-core hypervisor state.

    Cherry-picked from commit ec4867355244 ("powerpc/powernv/idle:
    Decouple Timebase restore & Per-core SPRs restore")

== Comment: #3 - Ranjal G. Shenoy
   On wakeup from a deep stop state which is supposed to lose the
    hypervisor state, we don't restore the LPCR to the old value but set
    it to a "sane" value via cur_cpu_spec->cpu_restore().

    The problem is that the "sane" value doesn't include UPRT and the HR
    bits which are required to run correctly in Radix mode.

    Fix this on POWER9 onwards by restoring the LPCR value whatever it was
    before executing the stop instruction.

    Cherry-picked from commit cb0be7ec0307 ("powerpc/powernv/idle: Restore
    LPCR on wakeup from deep-stop")

== Comment: #4 - Ranjal G. Shenoy
   Some of the SPR values (HID0, MSR, SPRG0) don't change during the run
    time of a booted kernel, once they have been initialized.

    The contents of these SPRs are lost when the CPUs enter deep stop
    states. So instead saving and restoring SPRs from the kernel, use the
    stop-api provided by the firmware by which the firmware can restore
    the contents of these SPRs to their initialized values after wakeup
    from a deep stop state.

    Apart from these, program the PSSCR value to that of the deepest stop
    state via the stop-api. This will be used to indicate to the
    underlying firmware as to what stop state to put the threads that have
    been woken up by a special-wakeup.

    And while we are at programming SPRs via stop-api, ensure that HID1,
    HID4 and HID5 registers which are only available on POWER8 are not
    requested to be restored by the firware on POWER9.

    Cherry-picked from commit 1e1601b38e6e ("powerpc/powernv/idle: Restore
    SPRs for deep idle states via stop API.")

== Comment: #5 - Ranjal G. Shenoy
   On Power9 DD1 due to a hardware bug the Power-Saving Level Status
    field (PLS) of the PSSCR for a thread waking up from a deep state can
    under-report if some other thread in the core is in a shallow stop
    state. The scenario in which this can manifest is as follows:

       1) All the threads of the core are in deep stop.
       2) One of the threads is woken up. The PLS for this thread will
          correctly reflect that it is waking up from deep stop.
       3) The thread that has woken up now executes a shallow stop.
       4) When some other thread in the core is woken, its PLS will reflect
          the shallow stop state.

    Thus, the subsequent thread for which the PLS is under-reporting the
    wakeup state will not restore the hypervisor resources.

    Hence, on DD1 systems, use the Requested Level (RL) field as a
    workaround to restore the contents of the hypervisor resources on the
    wakeup from the stop state.

    Cherry-picked from commit 22c6663dc69a ("powerpc/powernv/idle: Use
    Requested Level for restoring state on P9 DD1")

== Comment: #6 - Ranjal G. Shenoy
   The current code in the cpuidle-powernv intialization only allows deep
    stop states (indicated by OPAL_PM_STOP_INST_DEEP) which lose timebase
    (indicated by OPAL_PM_TIMEBASE_STOP). This assumption goes back to
    POWER8 time where deep states used to lose the timebase. However, on
    POWER9, we do have stop states that are deep (they lose hypervisor
    state) but retain the timebase.

    Fix the initialization code in the cpuidle-powernv driver to allow
    such deep states.

    Further, there is a bug in cpuidle-powernv driver with
    CONFIG_TICK_ONESHOT=n where we end up incrementing the nr_idle_states
    even if a platform idle state which loses time base was not added to
    the cpuidle table.

    Fix this by ensuring that the nr_idle_states variable gets incremented
    only when the platform idle state was added to the cpuidle table.

    Cherry-picked from commit f9122ee4f558 ("cpuidle-powernv: Allow Deep
    stop states that don't stop time")

== Comment: #7 - Ranjal G. Shenoy
   Currently, we use the opal call opal_slw_set_reg() to inform the
    Sleep-Winkle Engine (SLW) to restore the contents of some of the
    Hypervisor state on wakeup from deep idle states that lose full
    hypervisor context (characterized by the flag
    OPAL_PM_LOSE_FULL_CONTEXT).

    However, the current code has a bug in that if opal_slw_set_reg()
    fails, we don't disable the use of these deep states (winkle on
    POWER8, stop4 onwards on POWER9).

    This patch fixes this bug by ensuring that if programing the
    sleep-winkle engine to restore the hypervisor states in
    pnv_save_sprs_for_deep_states() fails, then we exclude such states by
    clearing the OPAL_PM_LOSE_FULL_CONTEXT flag from
    supported_cpuidle_states. As a result POWER8 will be prevented from
    using winkle for CPU-Hotplug, and POWER9 will put the offlined CPUs to
    the default stop state when available.

    Further, we ensure in the initialization of the cpuidle-powernv driver
    to only include those states whose flags are present in
    supported_cpuidle_states, thereby skipping OPAL_PM_LOSE_FULL_CONTEXT
    states when they have been disabled due to stop-api failure.

    Fixes: 1e1601b38e6 ("powerpc/powernv/idle: Restore SPRs for deep idle
    states via stop API.")

    Cherry-picked from commit 785a12afdb4a ("powerpc/powernv/idle: Disable
    LOSE_FULL_CONTEXT states when stop-api fails")

== Comment: #8 - Ranjal G. Shenoy
   The stop4 idle state on POWER9 is a deep idle state which loses
    hypervisor resources, but whose latency is low enough that it can be
    exposed via cpuidle.

    Until now, the deep idle states which lose hypervisor resources (eg:
    winkle) were only exposed via CPU-Hotplug. Hence currently on wakeup
    from such states, barring a few SPRs which need to be restored to
    their older value, rest of the SPRS are reinitialized to their values
    corresponding to that at boot time.

    When stop4 is used in the context of cpuidle, we want these additional
    SPRs to be restored to their older value, to ensure that the context
    on the CPU coming back from idle is same as it was before going idle.

    In this patch, we define a SPR save area in PACA (since we have used
    up the volatile register space in the stack) and on POWER9, we restore
    SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
    SPRN_MMCR2 to the values they had before entering stop.

    Cherry-picked from commit e1c1cfed5432 ("powerpc/powernv: Save/Restore
    additional SPRs for stop4 cpuidle")

== Comment: #9 - Ranjal G. Shenoy
   Currently we use the stop-api provided by the firmware to program the
    SLW engine to restore the values of hypervisor resources that get lost
    on deeper idle states (such as winkle). Since the deep states were
    only used for CPU-Hotplug on POWER8 systems, we would program the LPCR
    to have the PECE1 bit since Hotplugged CPUs shouldn't be spuriously
    woken up by decrementer.

    On POWER9, some of the deep platform idle states such as stop4 can be
    used in cpuidle as well. In this case, we want the CPU in stop4 to be
    woken up by the decrementer when some timer on the CPU expires.

    In this patch, we program the stop-api for LPCR with PECE1
    bit cleared only when we are offlining the CPU and set it
    back once the CPU is online.

    Cherry-picked from commit 24be85a23d1f ("powerpc/powernv: Clear PECE1
    in LPCR via stop-api only on Hotplug")

== Comment: #10 - Ranjal G. Shenoy
   commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via
    stop-api only on Hotplug") clears the PECE1 bit of the LPCR via
    stop-api during CPU-Hotplug to prevent wakeup due to a decrementer on
    an offlined CPU which is in a deep stop state.

    In the case where the stop-api support is found to be lacking, the
    commit 785a12afdb4a ("powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT
    states when stop-api fails") disables deep states that lose hypervisor
    context. Thus in this case, the offlined CPU will be put to some
    shallow idle state.

    However, we currently unconditionally clear the PECE1 in LPCR via
    stop-api during CPU-Hotplug even when deep states are disabled due to
    stop-api failure.

    Fix this by clearing PECE1 of LPCR via stop-api during CPU-Hotplug
    *only* when the offlined CPU will be put to a deep state that loses
    hypervisor context.

    Fixes: commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via
    stop-api only on Hotplug")

    upstream reference: https://patchwork.ozlabs.org/patch/808233/

CVE References

Revision history for this message
bugproxy (bugproxy) wrote : 1/10: powerpc/powernv/idle: Correctly initialize core_idle_state_ptr

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-158257 severity-high targetmilestone-inin1710
Revision history for this message
bugproxy (bugproxy) wrote : 2/10: powerpc/powernv/idle: Decouple Timebase restore & Per-core SPRs restore

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 3/10: powerpc/powernv/idle: Restore LPCR on wakeup from deep-stop

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 4/10: powerpc/powernv/idle: Restore SPRs for deep idle states via stop API.

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 5/10: powerpc/powernv/idle: Use Requested Level for restoring state on P9 DD1

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 6/10: cpuidle-powernv: Allow Deep stop states that don't stop time

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 7/10: powerpc/powernv/idle: Disable LOSE_FULL_CONTEXT states when stop-api fails

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 8/10: powerpc/powernv: Save/Restore additional SPRs for stop4 cpuidle

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 9/10: powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : 10/10: powerpc/powernv: Clear LPCR[PECE1] via stop-api only for deep state offline

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Seth Forshee (sforshee) wrote :

Applied to artful/master-next, and also cherry picked the linux-next commits and applied the mailing list patch to the 4.13 unstable kernel. Note that 785a12afdb4a52903447fd890633c82fdda4b6f7 and 24be85a23d1fcdc72264a062a2e4ebaaea48feab were backports in 4.12 and not clean cherry picks.

I'll also mention that we expect the current artful-proposed kernel to be the last 4.12 kernel for artful before we switch to 4.13, so if that holds we won't actually release a 4.12 with these patches (but they are applied in case we do release another 4.12). Depending how things go the patches will be in either the first or the second 4.13 kernel in artful-release.

Changed in linux (Ubuntu Artful):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Seth Forshee (sforshee)
status: Triaged → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-09-06 06:05 EDT-------
Hi,

Yes, those two commits didn't apply cleanly. However, I have resolved the conflicts and the attached patches for those two commits apply on Ubuntu 4.12.0-12.13

Revision history for this message
Seth Forshee (sforshee) wrote :

Right, I used the backports in those commits. I just added the note because those patches say "cherry picked" in the commit message and our convention is to change that to say "backported" instead if the patch has to be fixed up manually.

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.13.0-11.12

---------------
linux (4.13.0-11.12) artful; urgency=low

  * linux: 4.13.0-11.12 -proposed tracker (LP: #1716699)

  * kernel panic -not syncing: Fatal exception: panic_on_oops (LP: #1708399)
    - s390/mm: fix local TLB flushing vs. detach of an mm address space
    - s390/mm: fix race on mm->context.flush_mm

  * CVE-2017-1000251
    - Bluetooth: Properly check L2CAP config option output buffer length

 -- Seth Forshee <email address hidden> Tue, 12 Sep 2017 10:18:38 -0500

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.