opal-prd: Have a worker process handle page offlining (Fixes "PlatServices: dyndealloc memory_error() failed" is getting reported in error log (opal-prd))

Bug #1904585 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Ubuntu on IBM Power Systems Bug Triage
skiboot (Ubuntu)
Fix Released
High
Matthieu Clemenceau
Xenial
Fix Released
High
Matthieu Clemenceau
Bionic
Fix Released
High
Matthieu Clemenceau
Focal
Fix Released
High
Matthieu Clemenceau
Groovy
Fix Released
High
Matthieu Clemenceau
Hirsute
Fix Released
High
Matthieu Clemenceau

Bug Description

[Impact]

This impacts the opal-prd userspace command from the skiboot package
The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the memory_error() hservice call which blocks HBRT from processing other
errors.

[Test Case]

Unfortunately due to the specific hardware requirement I wasn't able to reproduce this problem and provide a test case for it. However I was able to build this package into a ppa and got the IBM team to confirm this problem was resolved for groovy focal, bionic, xenial see comment #4 and #6

Another verification test will be done (as part of the SRU process) again by the IBM Power team.

[What could go wrong]

To avoid long delays (that may blocks HBRT from processing other errors) the memory offlining process is now separated in a dedicated worker process, that can now be handled in the background.

If broken this can introduce further issues, like hangs in the worker process, not returning, and processes that pile up or in worst case memory pages that are not offlined, but reported otherwise.
The latter one would be a significant memory management problem, that even may break the system over time entirely.

But the code seem to have taken this into account with 'sigaction', the return-code/exit-status check and the reaping of the worker threads.

The fix was prepared back in September and was upstream accepted, hence it's unlikely that regressions are in and in between it already landed in hirsute.

On top a PPA with a patched skiboot package was created, shared and eventually successfully tested by IBM (the initial bug reporter).

[Original Description]

https://github.com/open-power/skiboot/commit/8cbd0de88d162e387f11569eee1bdecef8fad2e3

opal-prd: Have a worker process handle page offlining

The memory_error() hservice interface expects the memory_error() call to
just accept the offline request and return without actually offlining the
memory. Currently we will attempt to offline the marked pages before
returning to HBRT which can result in an excessively long time spent in the
memory_error() hservice call which blocks HBRT from processing other
errors. Fix this by adding a worker process which performs the page
offlining via the sysfs memory error interfaces.

Reviewed-by: Vasant Hegde - <email address hidden>
Signed-off-by: Oliver O'Halloran - <email address hidden>

Thanks in advance for your support.

Machine Type = Power8 and Power9 OPAL systems

---Steps to Reproduce---
* Inject memory error (UE)
* Verify that opal-prd doesn't return asynchronously to the platform after requesting the memory offlining operation

Userspace tool common name: opal-prd

We need this fix for 16.04.x and 18.04.x LTS releases.

Fix also is needed for 20.04 and 20.10.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-189252 severity-critical targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → skiboot (Ubuntu)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Foundations Team (canonical-foundations)
importance: Undecided → Critical
bugproxy (bugproxy)
tags: added: targetmilestone-inin18045
removed: targetmilestone-inin---
Frank Heimes (fheimes)
Changed in skiboot (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Foundations Team (canonical-foundations)
Changed in ubuntu-power-systems:
assignee: Canonical Foundations Team (canonical-foundations) → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
status: New → Triaged
Steve Langasek (vorlon)
tags: added: fr-935
Changed in skiboot (Ubuntu):
assignee: Canonical Foundations Team (canonical-foundations) → Matthieu Clemenceau (mclemenceau)
Frank Heimes (fheimes)
Changed in skiboot (Ubuntu Groovy):
assignee: nobody → Matthieu Clemenceau (mclemenceau)
Changed in skiboot (Ubuntu Focal):
assignee: nobody → Matthieu Clemenceau (mclemenceau)
Changed in skiboot (Ubuntu Bionic):
assignee: nobody → Matthieu Clemenceau (mclemenceau)
Changed in skiboot (Ubuntu Xenial):
assignee: nobody → Matthieu Clemenceau (mclemenceau)
Revision history for this message
Matthieu Clemenceau (mclemenceau) wrote :

Hello,
I've uploaded a new version of skiboot for hirsute to this ppa ppa:mclemenceau/distro-work
Can you confirm this resolve the issue on this LP and I'll start release process for hirsute and other impacted series
Thanks
Matthieu

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-12-10 02:01 EDT-------
(In reply to comment #7)
> Hello,
> I've uploaded a new version of skiboot for hirsute to this ppa
> ppa:mclemenceau/distro-work
> Can you confirm this resolve the issue on this LP and I'll start release
> process for hirsute and other impacted series
> Thanks
> Matthieu

Looks good.

-Vasant

Revision history for this message
Matthieu Clemenceau (mclemenceau) wrote :

Excellent, Can you also validate the solution for groovy and focal from this ppa?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-12-11 00:22 EDT-------
(In reply to comment #9)
> Excellent, Can you also validate the solution for groovy and focal from this
> ppa?

Looks good.

-Vasant

Changed in skiboot (Ubuntu Hirsute):
status: New → Confirmed
status: Confirmed → In Progress
Revision history for this message
Matthieu Clemenceau (mclemenceau) wrote :

Excellent.

Finally Can you verify the latest update for bionic and xenial on this ppa.
Once confirmed we should be able to formally release everything

Also in preparation for the SRU, it is required to have some steps to reproduce the problem and explain how to confirm it is resolved, Can you provide me those steps?

Thank you!
Matthieu

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-12-14 06:24 EDT-------
(In reply to comment #11)
> Excellent.
>
> Finally Can you verify the latest update for bionic and xenial on this ppa.
> Once confirmed we should be able to formally release everything

Looks good.

>
> Also in preparation for the SRU, it is required to have some steps to
> reproduce the problem and explain how to confirm it is resolved, Can you
> provide me those steps?

This is bit tricky to verify on customer environment. In lab we inject errors and verify whether memory offlining worked fine or not, and also verify whether it created forked processor or not.

-Vasant

Frank Heimes (fheimes)
Changed in skiboot (Ubuntu Focal):
milestone: none → ubuntu-20.04.2
Revision history for this message
Matthieu Clemenceau (mclemenceau) wrote :
Changed in skiboot (Ubuntu Groovy):
status: New → In Progress
Changed in skiboot (Ubuntu Focal):
status: New → In Progress
Changed in skiboot (Ubuntu Bionic):
status: New → In Progress
Changed in skiboot (Ubuntu Xenial):
status: New → In Progress
Revision history for this message
Matthieu Clemenceau (mclemenceau) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "This is the debdiff for Bionic" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Mathew Hodson (mhodson)
Changed in skiboot (Ubuntu Xenial):
importance: Undecided → High
Changed in skiboot (Ubuntu Focal):
importance: Undecided → High
Changed in skiboot (Ubuntu Hirsute):
importance: Undecided → High
Changed in skiboot (Ubuntu Bionic):
importance: Undecided → High
Changed in skiboot (Ubuntu Groovy):
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package skiboot - 6.6.2-1ubuntu1

---------------
skiboot (6.6.2-1ubuntu1) hirsute; urgency=medium

  * opal-prd: mmap(range:ibm,hbrt-code-image,...) fails with EPERM
     (LP: #1905393) opal-prd fails to start
     d/patches/0005-fix-opal-prd-fail-with-EPERM.patch
  * opal-prd: Have a worker process handle page offlining
     (LP: #1904585) Have a worker process handle page offlining
     d/patches/0006-fix-opal-prd-have-worker-process-handle-page-offlining.patch

 -- Matthieu Clemenceau <email address hidden> Mon, 07 Dec 2020 17:50:16 -0600

Changed in skiboot (Ubuntu Hirsute):
status: In Progress → Fix Released
description: updated
description: updated
Changed in skiboot (Ubuntu Focal):
milestone: ubuntu-20.04.2 → focal-updates
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

I'm not entirely happy with the regression potential section, so let me work a bit on that.

"Hopefully not much" is not a valid regression potential analysis as it gives us, the SRU team, no insight into what the actual situation is like. We all have wishful thinking about our changes! This section requires looking and understanding the code changes, looking at which parts of the code are modified and doing a though exercise: hmmm, if this code is changed, what other parts can it affect? Where should we expect any problems appear, just in case?

Only then we, as the SRU team, can make a decision whether it is safe enough to go forward with the SRU or not. Also, it gives everyone a better understanding what additional dogfooding we might want to perform to minimize regression risk. Lastly, it's also makes easier bisecting whenever some regression is found later, when trying to identify which change is responsible.

Another thing that I'm not entirely happy with is the lack of any DEP-3 headers on any of the patches in the actual uploads. Because of that I don't know where the patches come from, have those been upstreamed?

I will gladly re-visit those once these issues are resolved.

Changed in skiboot (Ubuntu Groovy):
status: In Progress → Incomplete
Revision history for this message
Frank Heimes (fheimes) wrote :

I hope you don't mind if I jump in and help updating the SRU justification ... - specially Risk and Test ...

description: updated
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello bugproxy, or anyone else affected,

Accepted skiboot into groovy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/skiboot/6.5.2-1ubuntu0.20.10.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-groovy to verification-done-groovy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-groovy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in skiboot (Ubuntu Groovy):
status: Incomplete → Fix Committed
tags: added: verification-needed verification-needed-groovy
Changed in skiboot (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed-focal
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello bugproxy, or anyone else affected,

Accepted skiboot into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/skiboot/6.5.2-1ubuntu0.20.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in skiboot (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello bugproxy, or anyone else affected,

Accepted skiboot into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/skiboot/5.10~rc4-1ubuntu1.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in skiboot (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Hello bugproxy, or anyone else affected,

Accepted skiboot into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/skiboot/5.4.3-1ubuntu0.16.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

bugproxy (bugproxy)
tags: removed: verification-needed-xenial
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Frank Heimes (fheimes)
tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-01-13 05:23 EDT-------
(In reply to comment #20)
> Hello bugproxy, or anyone else affected,
>
> Accepted skiboot into groovy-proposed. The package will build now and be
> available at
> https://launchpad.net/ubuntu/+source/skiboot/6.5.2-1ubuntu0.20.10.1 in a few
> hours, and then in the -proposed repository.
>
> Please help us by testing this new package. See
> https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to
> enable and use -proposed. Your feedback will aid us getting this update out
> to other Ubuntu users.
>
> If this package fixes the bug for you, please add a comment to this bug,
> mentioning the version of the package you tested, what testing has been
> performed on the package and change the tag from verification-needed-groovy
> to verification-done-groovy. If it does not fix the bug for you, please add
> a comment stating that, and change the tag to verification-failed-groovy. In
> either case, without details of your testing we will not be able to proceed.

We have verified above package and fixes looks good.
We verified whether daemon starts fine or not. Also whether it spawns separate child to handle page offlining process.

-Vasant

------- Comment From <email address hidden> 2021-01-13 05:24 EDT-------
(In reply to comment #21)
> Hello bugproxy, or anyone else affected,
>
> Accepted skiboot into xenial-proposed. The package will build now and be
> available at
> https://launchpad.net/ubuntu/+source/skiboot/5.4.3-1ubuntu0.16.04.2 in a few
> hours, and then in the -proposed repository.

We have verified above package and fixes looks good.

-Vasant

tags: added: verification-done-focal verification-done-groovy verification-done-xenial
removed: verification-needed-focal verification-needed-groovy verification-needed-xenial
Revision history for this message
Patricia Domingues (patriciasd) wrote :

Vasant, thanks for the verification on this bug.
Could you also check the other opal-prd bug which also needs verification, please:

```
"Ubuntu 20.04: opal-prd fails to start on 20.04"
https://bugs.launchpad.net/bugs/1905393
```

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-01-15 01:23 EDT-------
(In reply to comment #24)
> Vasant, thanks for the verification on this bug.
> Could you also check the other opal-prd bug which also needs verification,
> please:
>
> ```
> "Ubuntu 20.04: opal-prd fails to start on 20.04"
> https://bugs.launchpad.net/bugs/1905393
> ```

I have updated that above defect. Thanks!

-Vasant

Mathew Hodson (mhodson)
tags: removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package skiboot - 6.5.2-1ubuntu0.20.10.1

---------------
skiboot (6.5.2-1ubuntu0.20.10.1) groovy; urgency=medium

  * opal-prd: mmap(range:ibm,hbrt-code-image,...) fails with EPERM
     (LP: #1905393) opal-prd fails to start
     d/patches/0005-fix-opal-prd-fail-with-EPERM.patch
  * opal-prd: Have a worker process handle page offlining
     (LP: #1904585) Have a worker process handle page offlining
     d/patches/0006-fix-opal-prd-have-worker-process-handle-page-offlining.patch

 -- Matthieu Clemenceau <email address hidden> Mon, 07 Dec 2020 16:03:03 -0600

Changed in skiboot (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for skiboot has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package skiboot - 6.5.2-1ubuntu0.20.04.1

---------------
skiboot (6.5.2-1ubuntu0.20.04.1) focal; urgency=medium

  * opal-prd: mmap(range:ibm,hbrt-code-image,...) fails with EPERM
     (LP: #1905393) opal-prd fails to start
     d/patches/0005-fix-opal-prd-fail-with-EPERM.patch
  * opal-prd: Have a worker process handle page offlining
     (LP: #1904585) Have a worker process handle page offlining
     d/patches/0006-fix-opal-prd-have-worker-process-handle-page-offlining.patch

 -- Matthieu Clemenceau <email address hidden> Mon, 07 Dec 2020 16:03:03 -0600

Changed in skiboot (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

What about verification of the bionic packages? I only see tags for groovy, focal and xenial updated, but I assume bionic has also been verified?
Could someone confirm?

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-01-18 09:32 EDT-------
(In reply to comment #28)
> What about verification of the bionic packages? I only see tags for groovy,
> focal and xenial updated, but I assume bionic has also been verified?
> Could someone confirm?

Done. I missed to update the keyword.

-Vasant

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Łukasz Zemczak (sil2100) wrote :

Awesome! Thank you.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package skiboot - 5.10~rc4-1ubuntu1.3

---------------
skiboot (5.10~rc4-1ubuntu1.3) bionic; urgency=medium

  * opal-prd: Have a worker process handle page offlining
     (LP: #1904585) Have a worker process handle page offlining
     d/patches/fix-opal-prd-have-worker-process-handle-page-offlining.patch

 -- Matthieu Clemenceau <email address hidden> Fri, 11 Dec 2020 13:26:18 -0600

Changed in skiboot (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package skiboot - 5.4.3-1ubuntu0.16.04.2

---------------
skiboot (5.4.3-1ubuntu0.16.04.2) xenial; urgency=medium

  * opal-prd: Have a worker process handle page offlining
     (LP: #1904585) Have a worker process handle page offlining
     d/patches/fix-opal-prd-have-worker-process-handle-page-offlining.patch

 -- Matthieu Clemenceau <email address hidden> Fri, 11 Dec 2020 14:22:13 -0600

Changed in skiboot (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.