opal-prd: Have a worker process handle page offlining (Fixes "PlatServices: dyndealloc memory_error() failed" is getting reported in error log (opal-prd))
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
Critical
|
Ubuntu on IBM Power Systems Bug Triage | ||
skiboot (Ubuntu) |
Fix Released
|
High
|
Matthieu Clemenceau | ||
Xenial |
Fix Released
|
High
|
Matthieu Clemenceau | ||
Bionic |
Fix Released
|
High
|
Matthieu Clemenceau | ||
Focal |
Fix Released
|
High
|
Matthieu Clemenceau | ||
Groovy |
Fix Released
|
High
|
Matthieu Clemenceau | ||
Hirsute |
Fix Released
|
High
|
Matthieu Clemenceau |
Bug Description
[Impact]
This impacts the opal-prd userspace command from the skiboot package
The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the memory_error() hservice call which blocks HBRT from processing other
errors.
[Test Case]
Unfortunately due to the specific hardware requirement I wasn't able to reproduce this problem and provide a test case for it. However I was able to build this package into a ppa and got the IBM team to confirm this problem was resolved for groovy focal, bionic, xenial see comment #4 and #6
Another verification test will be done (as part of the SRU process) again by the IBM Power team.
[What could go wrong]
To avoid long delays (that may blocks HBRT from processing other errors) the memory offlining process is now separated in a dedicated worker process, that can now be handled in the background.
If broken this can introduce further issues, like hangs in the worker process, not returning, and processes that pile up or in worst case memory pages that are not offlined, but reported otherwise.
The latter one would be a significant memory management problem, that even may break the system over time entirely.
But the code seem to have taken this into account with 'sigaction', the return-
The fix was prepared back in September and was upstream accepted, hence it's unlikely that regressions are in and in between it already landed in hirsute.
On top a PPA with a patched skiboot package was created, shared and eventually successfully tested by IBM (the initial bug reporter).
[Original Description]
https:/
opal-prd: Have a worker process handle page offlining
The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the
memory_error() hservice call which blocks HBRT from processing other
errors. Fix this by adding a worker process which performs the page
offlining via the sysfs memory error interfaces.
Reviewed-by: Vasant Hegde - <email address hidden>
Signed-off-by: Oliver O'Halloran - <email address hidden>
Thanks in advance for your support.
Machine Type = Power8 and Power9 OPAL systems
---Steps to Reproduce---
* Inject memory error (UE)
* Verify that opal-prd doesn't return asynchronously to the platform after requesting the memory offlining operation
Userspace tool common name: opal-prd
We need this fix for 16.04.x and 18.04.x LTS releases.
Fix also is needed for 20.04 and 20.10.
tags: | added: architecture-ppc64le bugnameltc-189252 severity-critical targetmilestone-inin--- |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → skiboot (Ubuntu) |
Changed in ubuntu-power-systems: | |
assignee: | nobody → Canonical Foundations Team (canonical-foundations) |
importance: | Undecided → Critical |
tags: |
added: targetmilestone-inin18045 removed: targetmilestone-inin--- |
Changed in skiboot (Ubuntu): | |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Foundations Team (canonical-foundations) |
Changed in ubuntu-power-systems: | |
assignee: | Canonical Foundations Team (canonical-foundations) → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
status: | New → Triaged |
tags: | added: fr-935 |
Changed in skiboot (Ubuntu): | |
assignee: | Canonical Foundations Team (canonical-foundations) → Matthieu Clemenceau (mclemenceau) |
Changed in skiboot (Ubuntu Groovy): | |
assignee: | nobody → Matthieu Clemenceau (mclemenceau) |
Changed in skiboot (Ubuntu Focal): | |
assignee: | nobody → Matthieu Clemenceau (mclemenceau) |
Changed in skiboot (Ubuntu Bionic): | |
assignee: | nobody → Matthieu Clemenceau (mclemenceau) |
Changed in skiboot (Ubuntu Xenial): | |
assignee: | nobody → Matthieu Clemenceau (mclemenceau) |
Changed in skiboot (Ubuntu Hirsute): | |
status: | New → Confirmed |
status: | Confirmed → In Progress |
Changed in ubuntu-power-systems: | |
status: | Triaged → In Progress |
Changed in skiboot (Ubuntu Focal): | |
milestone: | none → ubuntu-20.04.2 |
Changed in skiboot (Ubuntu Xenial): | |
importance: | Undecided → High |
Changed in skiboot (Ubuntu Focal): | |
importance: | Undecided → High |
Changed in skiboot (Ubuntu Hirsute): | |
importance: | Undecided → High |
Changed in skiboot (Ubuntu Bionic): | |
importance: | Undecided → High |
Changed in skiboot (Ubuntu Groovy): | |
importance: | Undecided → High |
description: | updated |
description: | updated |
Changed in skiboot (Ubuntu Focal): | |
milestone: | ubuntu-20.04.2 → focal-updates |
tags: | removed: verification-needed-xenial |
Changed in ubuntu-power-systems: | |
status: | In Progress → Fix Committed |
tags: | added: verification-needed-xenial |
tags: | removed: verification-needed |
Changed in ubuntu-power-systems: | |
status: | Fix Committed → Fix Released |
Hello, /distro- work
I've uploaded a new version of skiboot for hirsute to this ppa ppa:mclemenceau
Can you confirm this resolve the issue on this LP and I'll start release process for hirsute and other impacted series
Thanks
Matthieu