CryptoExpress EP11 cards are going offline

Bug #1939618 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Skipper Bug Screeners
linux (Ubuntu)
Fix Released
Undecided
Canonical Kernel Team
Focal
Fix Released
Undecided
Unassigned

Bug Description

SRU Justification:

[Impact]

* With current focal kernels IBM Z CryptoExpress adapters in EP11 mode go offline in case of unknown error indications from the hardware.

* This does not only lead to a software fallback, but can also lead to errors and crashes,
  if certain crypto operations are currently ongoing.

* A rework of the AP bus and zcrypt device driver, as it was done in 5.11, fixes the situation.

* From the below range of commits, the last 1/3 are the ones that fix the issue mentioned here
  and the others are pre-requisites to get the relevant ones applied.

* In theory the patch set could have been made smaller,
  but with the cost that the code would be a mix between old and new, with maybe some new code snippets,
  hence it would divert from what's upstream accepted (in 5.11 and above), the risk would increase,
  increased effort to maintain and less test coverage.

[Fix]

* The SRU request was created as pull request,
  so please pull f904c400c9c4^..f6d9ab1de03a (means starting at f904c400c9c4 {incl.} to head/f6d9ab1de03a {incl.})
  from here: https://code.launchpad.net/~fheimes/+git/lp1939618

[Test Case]

* An Ubuntu Server 20.04 on IBM Z or LinuxONE installation is required,
  with ideally three attached CryptoExpress adapters running in CCA, EP11 and accelerator mode.

* Run stress test on these three CryptoExpress adapters.

* IBM has such stress tests and ran these based on a patched Ubuntu 20.04 kernel.
  The tests come with a specially focus on error path tests,
  since this patch set mainly focuses on doing a better error patch handling.

* Note: A a new config option for the zcrypt driver was introduced
  that enables the possibility to inject erroneous messages.

* An application exists that generates such messages and thus tests these error paths.

* Canonical's focus will mainly be on regression testing.

[Regression Potential]

* Like with all modification there is a certain risk of regressions,
  especially with bigger patch sets.

* But the modifications here are limited to the s390x platform,
  and there again largely to the s390x hardware crypto stack and driver
  (CryptoExpress adapter) which is optional hardware.
  (See the diff stat in the comment below.)

* The crypto-specific tools (located at the s390-tools package) may no longer work with this patched driver.
  But this got tested by IBM with the result that the changes are fully backward compatible.
  The 'older' s390 tools package (from focal) can just not show and control the new (config state) feature,
  but the functionality covered by the older s390 tools package is utterly covered by this patch set.

* The core of this patch set went into the 5.11 kernel upstream,
  hence is in hirsute (and has also been picked by other distros).

* Since this patch set is a rework of the AP bus and zcrypt driver code,
  it may now show new errors that were never thrown before, like for or example memory leaks.
  However, this is not unique to this patch set, it the same for upstream, Hirsute and Impish (and other distros).

* The patches are all upstream and all needed upstream commits could just be cherry-picked,
  hence no modifications were needed.

* So the commits were not only tested by IBM upfront,
  but a patched focal master-next kernel is also available as PPA (see comment below) for further testing.

* This patch set was also tested on 5.11, where two issues were found that are already part of this set.

[Other]

* I iterated through all commits and found that that the latest ones got upstream with 5.13,
  hence Impish includes all commits needed and is not affected!

* Looks like all commits, expect three, are even upstream with 5.11,
  but the missing three came in on top via upstream stable,
  hence Hirsute master-next includes all commits needed too and is also not affected!

* But non of the commits could be found in current Focal master-next (aot: 5.4.0-84),
  the first commits from this set started to land with 5.7,
  hence this SRU request is for focal only.
__________

Here is the backport against current git for ubuntu 20.04.
It is a zip file with a patches subdir and all the patches in there together with a series file. So just unpack it and apply with quilt.

Revision history for this message
bugproxy (bugproxy) wrote : backport sec1814 for Ubuntu 20.04

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-193902 severity-high targetmilestone-inin2004
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-08-11 15:55 EDT-------
Additional background information:
'EP11 cards going offline' bug needs to be fixed in U20.04LTS via backport.

The problem has been described in RTC line item SEC1814 which was fixed & released
whith Ubuntu 21.04.
-> see Bugzilla 189018 (LP#1902866 : [21.04 FEAT] [SEC1814] zcrypt DD: add state for offline due to error - kernel)
and Bugzilla 189019 (LP#1902865 : [21.04 FEAT] [SEC1814] zcrypt DD: add state for offline due to error - s390-tools.

For details re. the provided fix / backport to U20.04LTS see Comment#2

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → High
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Revision history for this message
Frank Heimes (fheimes) wrote : Re: EP11 cards going offline => Fix backport to U20.04LTS

Well, this is a patch set of 34 commits, which is significant, and looks to me that they cover much more than just fixing "EP11 cards going offline", rather than bringing the entire driver to the latest level. The usual way is to pick the minimal set of patches that fix a particular issue and get these SRUed.

And btw. this is a relatively unusual way of sharing patches or backports -- if about a handful of patches (or two) are needed, we would do cherry-picks or apply selective backport patch-files.
In case of more patches the usual way is a branch with a pull-request.

But the patches seem to apply cleanly to the current focal master-next tree and touch s390x specific files only (either in /arch/s390/ or in /drivers/s390/).

Please notice that hirsute's 5.11 kernel would need to be on the same level, to avoid any potential regressions on upgrades.

I'll bring this to the kernel teams attention and we'll discuss if this is doable as SRU.

Changed in ubuntu-z-systems:
status: New → Triaged
Changed in linux (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
Krzysztof Kozlowski (krzk) wrote :

As Frank mentioned, the patches cannot be accepted in that form. There is no information where these came from and what was changed in the backport. This is a series of 33 commits, it's enormous. Are these all fixes? Looks like not because only two patches have stable tags, +three patches have Fixes tag. What's with the rest? Are they going to stable trees?

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → Incomplete
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-08-13 10:04 EDT-------
The gist of the 'cards going offline' patches are:

27c4f6738bdc535e42dfc1295dadc78ab7582939 27c4f6738bdc s390/zcrypt: Introduce Failure Injection feature
3730f5300b450bc89306c3ab79c254e6039d9197 3730f5300b45 s390/zcrypt: move ap_msg param one level up the call chain
e0332629e33d1926c93348d918aaaf451ef9a16b e0332629e33d s390/ap/zcrypt: revisit ap and zcrypt error handling
5caa2af97118308c79f29cc9876aec3ed504f9b0 5caa2af97118 s390/ap: Support AP card SCLP config and deconfig operations
0671cc1048744c9a6f1c896baa85966a5abc42a0 0671cc104874 s390/sclp: Add support for SCLP AP adapter config/deconfig
4f2fcccdb547b09a4532c705078811e672fb9235 4f2fcccdb547 s390/ap: add card/queue deconfig state
2ea2a6099ae3d1708f90f43c81a98cba3d4bb74c 2ea2a6099ae3 s390/ap: add error response code field for ap queue devices
0b641cbd24445e56073c69dd046be488dcf1965b 0b641cbd2444 s390/ap: split ap queue state machine state from device state
0ae88ccf4c160e02316e054db67156230568cf49 0ae88ccf4c16 s390/zcrypt: New config switch CONFIG_ZCRYPT_DEBUG
91ffc519c1997520ff3435ee227d86cfaa30d037 91ffc519c199 s390/zcrypt: introduce msg tracking in zcrypt functions

Which is a rework of the ap bus core code and the zcrypt device driver.

All the other patches are cherry pick of the patches needed to have these applied cleanly.

Please note also, that all the patches I attached together with the series file are extracts from the upstream kernel and the patch header includes all information to match them to the original upstream commit. Most of the patches do not even have any modifications.

I can't think of any more reductions to provide the fix for the 'EP11 cards going offline' issue.

Revision history for this message
Frank Heimes (fheimes) wrote : Re: EP11 cards going offline => Fix backport to U20.04LTS

Sorry for coming back to this with a little delay, due to the 21.10 FF and the 20.04.3 RCs ...

I'm not sure if I understood you correctly.
Does that mean that the commits 27c4f6738bdc to 91ffc519c199 from comment #5 all needed to be touched and are backports?
And that the remaining ones are all just plain exports - w/o any modification, hence can simply be cherry-picked?

This distinction is very crucial and need to be added to each of the commits, so that the entire provenance of the commit/patch is traceable.

So in other words all commits that had to be touches ('backported') need to have lines like this:
"
(backported from commit 5620ae29f1eabe655f44335231b580a78c8364ea)
Signed-off-by: First-Name Last-Name <email address hidden>
"
(like described here: https://wiki.ubuntu.com/Kernel/Dev/StablePatchFormat
 this is not unique to the Canonical kernel team procedure, but a kind of a de-facto standard ...)
Here is an example LP bug with backports attached:
https://bugs.launchpad.net/bugs/1925452

Generally the others would need lines like this:
"
(cherry picked from commit d4e0018e3e4dd685af25d300fd26a0d5a984482e)
Signed-off-by: First-Name Last-Name <email address hidden>
"

But the 'cherry-picked' lines do not need to be added manually, since git helps on doing this.

(If you grep for 'backport' or 'cherry-picked' though the list of patches in your zip file I get no hits.)

---

So what we need is:
1) separate 'backport' files of the modified commits only,
   that have a 'backported' AND an additional 'Signed-off-by' line included by the backporter
   (no files are needed for commits that can just be cherry picked)
2) a list / sequence of the backport files AND cherry pick commits,
   that allows to get the entire set of needed commits/patches (cherry-picks and backports)
   applied to the Ubuntu focal master-next git tree
   [git://kernel.ubuntu.com/ubuntu/ubuntu-focal --branch master-next --single-branch]
   (we need to be able to re-run and re-apply this again)

The rest can( and is) usually done by us - means reapplying them, pushing then to a remote git branch on Launchpad and finally sending a pull request (incl. cover letter) to the Canonical kernel teams mailing list.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-08-24 11:41 EDT-------
hm, ok - thats a lot of effort, but I'll do it ...

Revision history for this message
Frank Heimes (fheimes) wrote : Re: EP11 cards going offline => Fix backport to U20.04LTS

Hi Harald, well, I hope it's not too much effort, since:
- you already have a proper sequence (it's more or less your series file in your zip)
  (however, if this list could be minimized to the bare minimum, that would be highly desired)
- then I guess you remember which of the exported commits you had to touch and modify;
  these are the ones that need the additional 'backported' and 'Signed-off-by' lines
- then just attach these official backport files and the sequence
(The cherry-picked lines do not need to be added manually, since git will do that for us, once we'll process the sequence...)

We will also need to add a "SRU justification" (like you can see in the bug description of this bug LP#1925452) - but once there, I can create a draft, but may ask you to help on some details (like how this can be tested, regression risk, etc.) ...

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-08-26 06:11 EDT-------
Ok, I double checked the list and none of the patches needs any modifications. You can directly pick from upstream. This is the sequence:

c4f762ff6b77 s390/zcrypt: Support for CCA protected key block version 2
fa226f1d81e2 s390: Replace zero-length array with flexible-array member
eb3e064b8dd1 s390/zcrypt: Use scnprintf() for avoiding potential buffer
40501c70e3f0 s390/zcrypt: replace snprintf/sprintf with scnprintf
3cc7c927102d s390/ap: Remove ap device suspend and resume callbacks
fcf0220abc5b s390/zcrypt: use fallthrough;
34515df25d7e s390/zcrypt: use kvmalloc instead of kmalloc for 256k alloc
41677b1d9415 s390/ap: remove power management code from ap bus and drivers
bc4b295e87a8 s390/ap: introduce new ap function ap_get_qdev()
79d6c5022710 s390/zcrypt: use kzalloc
47c07bffeb32 s390/zcrypt: fix smatch warnings
74ecbef7b908 s390/zcrypt: code beautification and struct field renames
7e202acb5c43 s390/zcrypt: split ioctl function into smaller code units
dc4b6ded3c17 s390/ap: rename and clarify ap state machine related stuff
a303e88743f6 s390/zcrypt: provide cex4 cca sysfs attributes for cex3
c8337c47deb9 s390/ap: rework crypto config info and default domain code
0d574ad33e5b s390/zcrypt: simplify cca_findcard2 loop code
52f72feba9db s390/zcrypt: remove set_fs() invocation in zcrypt device
48175fed1dea s390/ap: remove unnecessary spin_lock_init()
32ca04bba6fd s390/zcrypt: Support for CCA APKA master keys
91ffc519c199 s390/zcrypt: introduce msg tracking in zcrypt functions
0b641cbd2444 s390/ap: split ap queue state machine state from device state
2ea2a6099ae3 s390/ap: add error response code field for ap queue devices
4f2fcccdb547 s390/ap: add card/queue deconfig state
0671cc104874 s390/sclp: Add support for SCLP AP adapter config/deconfig
5caa2af97118 s390/ap: Support AP card SCLP config and deconfig operations
e0332629e33d s390/ap/zcrypt: revisit ap and zcrypt error handling
3730f5300b45 s390/zcrypt: move ap_msg param one level up the call chain
27c4f6738bdc s390/zcrypt: Introduce Failure Injection feature
4366dd725125 s390/zcrypt: fix wrong format specifications
29c2680fd2bf s390/ap: fix ap devices reference counting
d39fae45c97c s390/zcrypt: return EIO when msg retry limit reached
70fac8088cfa s390/zcrypt: fix zcard and zqueue hot-unplug memleak
e73a99f3287a s390/ap: Fix hanging ioctl caused by wrong msg counter

Frank Heimes (fheimes)
summary: - EP11 cards going offline => Fix backport to U20.04LTS
+ CryptoExpress EP11 cards are going offline
Revision history for this message
Frank Heimes (fheimes) wrote :

A patched kernel (based on 5.4.0-84, current focal master-next) got built and is available for further testing here:
https://launchpad.net/~fheimes/+archive/ubuntu/lp1939618

Revision history for this message
Frank Heimes (fheimes) wrote :

diff stat graph that provides an indication where the changes mainly are:

 arch/s390/appldata/appldata_os.c | 2 +-
 arch/s390/include/asm/sclp.h | 2 +
 arch/s390/include/uapi/asm/zcrypt.h | 140 +++---
 drivers/s390/block/dasd_diag.c | 2 +-
 drivers/s390/block/dasd_eckd.h | 2 +-
 drivers/s390/char/Makefile | 2 +
 drivers/s390/char/raw3270.h | 2 +-
 drivers/s390/char/sclp.h | 2 +-
 drivers/s390/char/sclp_ap.c | 63 +++
 drivers/s390/char/sclp_pci.c | 2 +-
 drivers/s390/cio/idset.c | 2 +-
 drivers/s390/crypto/ap_bus.c | 974 +++++++++++++++++++-----------------
 drivers/s390/crypto/ap_bus.h | 139 +++--
 drivers/s390/crypto/ap_card.c | 98 ++--
 drivers/s390/crypto/ap_debug.h | 8 +
 drivers/s390/crypto/ap_queue.c | 513 +++++++++++--------
 drivers/s390/crypto/pkey_api.c | 20 +-
 drivers/s390/crypto/zcrypt_api.c | 574 ++++++++++++++-------
 drivers/s390/crypto/zcrypt_api.h | 49 +-
 drivers/s390/crypto/zcrypt_card.c | 30 +-
 drivers/s390/crypto/zcrypt_ccamisc.c | 320 ++++++------
 drivers/s390/crypto/zcrypt_ccamisc.h | 32 +-
 drivers/s390/crypto/zcrypt_cex2a.c | 8 +-
 drivers/s390/crypto/zcrypt_cex2c.c | 164 +++++-
 drivers/s390/crypto/zcrypt_cex4.c | 197 ++++----
 drivers/s390/crypto/zcrypt_debug.h | 8 +
 drivers/s390/crypto/zcrypt_ep11misc.c | 41 +-
 drivers/s390/crypto/zcrypt_error.h | 92 ++--
 drivers/s390/crypto/zcrypt_msgtype50.c | 179 +++----
 drivers/s390/crypto/zcrypt_msgtype6.c | 374 +++++++-------
 drivers/s390/crypto/zcrypt_msgtype6.h | 8 +-
 drivers/s390/crypto/zcrypt_queue.c | 26 +-

Revision history for this message
Frank Heimes (fheimes) wrote :

The patched focal master-next kernel sources are available as PR from here:
https://code.launchpad.net/~fheimes/+git/lp1939618

Frank Heimes (fheimes)
description: updated
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in ubuntu-z-systems:
status: Incomplete → Triaged
Revision history for this message
Frank Heimes (fheimes) wrote :

Pull request submitted to kernel team's mailing list:
https://lists.ubuntu.com/archives/kernel-team/2021-August/thread.html#123711
changing status to 'In Progress'.

Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in ubuntu-z-systems:
status: Triaged → In Progress
Changed in linux (Ubuntu Focal):
status: New → In Progress
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Frank Heimes (fheimes)
no longer affects: linux (Ubuntu Hirsute)
no longer affects: linux (Ubuntu Impish)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-09-08 06:10 EDT-------
I cloned the kernel as indicated with comment #18 and did
- check out the only branch master-next
- a code verification for the subdir drivers/s390/crypto
I compared this with an upstream kernel and checked that all the patches
are applied correctly and that the code level is showing what I would expect
and nothing more. Looks good to me.
- kernel build and install and ran my internal ap/zcrypt test suite.
No issues found - looks good to me.

So I assume this is verified now.

Revision history for this message
Frank Heimes (fheimes) wrote :

Thx Harald for the verification (adjusting the tags accordingly ...)

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (34.1 KiB)

This bug was fixed in the package linux - 5.4.0-88.99

---------------
linux (5.4.0-88.99) focal; urgency=medium

  * focal/linux: 5.4.0-88.99 -proposed tracker (LP: #1944747)

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2021.09.06)

  * please drop virtualbox-guest-dkms virtualbox-guest-source (LP: #1933248)
    - Revert "UBUNTU: [Config] Disable virtualbox dkms build"

linux (5.4.0-87.98) focal; urgency=medium

  * please drop virtualbox-guest-dkms virtualbox-guest-source (LP: #1933248)
    - [Config] Disable virtualbox dkms build

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2021.09.06)

  * LRMv5: switch primary version handling to kernel-versions data set
    (LP: #1928921)
    - [Packaging] switch to kernel-versions

  * disable “CONFIG_HISI_DMA” config for ubuntu version (LP: #1936771)
    - Disable CONFIG_HISI_DMA
    - [Config] Record hisi_dma no longer built for arm64

  * memory leaking when removing a profile (LP: #1939915)
    - apparmor: Fix memory leak of profile proxy

  * CryptoExpress EP11 cards are going offline (LP: #1939618)
    - s390/zcrypt: Support for CCA protected key block version 2
    - s390: Replace zero-length array with flexible-array member
    - s390/zcrypt: Use scnprintf() for avoiding potential buffer overflow
    - s390/zcrypt: replace snprintf/sprintf with scnprintf
    - s390/ap: Remove ap device suspend and resume callbacks
    - s390/zcrypt: use fallthrough;
    - s390/zcrypt: use kvmalloc instead of kmalloc for 256k alloc
    - s390/ap: remove power management code from ap bus and drivers
    - s390/ap: introduce new ap function ap_get_qdev()
    - s390/zcrypt: use kzalloc
    - s390/zcrypt: fix smatch warnings
    - s390/zcrypt: code beautification and struct field renames
    - s390/zcrypt: split ioctl function into smaller code units
    - s390/ap: rename and clarify ap state machine related stuff
    - s390/zcrypt: provide cex4 cca sysfs attributes for cex3
    - s390/ap: rework crypto config info and default domain code
    - s390/zcrypt: simplify cca_findcard2 loop code
    - s390/zcrypt: remove set_fs() invocation in zcrypt device driver
    - s390/ap: remove unnecessary spin_lock_init()
    - s390/zcrypt: Support for CCA APKA master keys
    - s390/zcrypt: introduce msg tracking in zcrypt functions
    - s390/ap: split ap queue state machine state from device state
    - s390/ap: add error response code field for ap queue devices
    - s390/ap: add card/queue deconfig state
    - s390/sclp: Add support for SCLP AP adapter config/deconfig
    - s390/ap: Support AP card SCLP config and deconfig operations
    - s390/ap/zcrypt: revisit ap and zcrypt error handling
    - s390/zcrypt: move ap_msg param one level up the call chain
    - s390/zcrypt: Introduce Failure Injection feature
    - s390/zcrypt: fix wrong format specifications
    - s390/ap: fix ap devices reference counting
    - s390/zcrypt: return EIO when msg retry limit reached
    - s390/zcrypt: fix zcard and zqueue hot-unplug memleak
    - s390/ap: Fix hanging ioctl caused by wrong msg counter

  * memfd from ubuntu_kernel_s...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.