Surelock GA2 SP1: capiredp01: cxl_init_adapter fails for CAPI devices 0000:01:00.0 and 0005:01:00.0 after upgrading to 840.10 Platform firmware build fips840/b1208b_1604.840

Bug #1532914 reported by bugproxy
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Tim Gardner
Wily
Fix Released
Undecided
Tim Gardner
Xenial
Fix Released
High
Tim Gardner

Bug Description

Problem Description
++++++++++++++++++++
I upgraded the Platform firmware to the 840.10 Platform firmware build (b1208b_1604.840) to prepare for Surelock GA2 SP1 testing. After the upgrade, I used the ipmitool to power on capiredfsp.aus.stglabs.ibm.com and boot the Ubuntu 15.10 partition (capiredp01.aus.stglabs.ibm.com) in OPAL firmware mode. In petitboot, I saw messages for "cxl-pci 0000:01:00.0: cxl_init_adapter failed: -5" and "cxl-pci 0005:01:00.0: cxl_init_adapter failed: -5." After the partition started running, I didn't see any AFU devices in /dev/cxl/ or /sys/class/cxl/ although I was able to see PCI devices for the hardware accelerators (0000:01:00.0 and 0005:01:00.0) with the lspci command.

ubuntu@capiredp01:~$ ls -l /dev/cxl/
ls: cannot access /dev/cxl/: No such file or directory
ubuntu@capiredp01:~$ ls -l /sys/class/cxl/
total 0
ubuntu@capiredp01:~$ sudo lscfg | grep -i afu
ubuntu@capiredp01:~$ sudo lspci|egrep -i "04cf|0477"
0000:01:00.0 Processing accelerators: IBM Device 04cf (rev 01)
0005:01:00.0 Processing accelerators: IBM Device 04cf (rev 01)
ubuntu@capiredp01:~$ lsscsi -g
[0:0:0:0] enclosu IBM VSBPD12M1 6GSAS 03 - /dev/sg1
[0:0:1:0] cd/dvd IBM. RMBO0140512 RA65 /dev/sr0 /dev/sg2
[0:3:0:0] no dev IBM 57D7001SISIOA 0150 - /dev/sg0
[1:0:0:0] enclosu IBM VSBPD12M1 6GSAS 03 - /dev/sg4
[1:0:1:0] disk IBM HUC109030CSS600 E5C6 /dev/sda /dev/sg5
[1:0:2:0] disk IBM HUC101212CSS600 A5AA /dev/sdb /dev/sg6
[1:0:3:0] disk IBM HUC101212CSS600 A5AA /dev/sdc /dev/sg7
[1:0:4:0] disk IBM HUC101212CSS600 A5AA /dev/sdd /dev/sg8
[1:0:5:0] disk IBM ST1200MM0007 BF04 /dev/sde /dev/sg9
[1:0:6:0] disk IBM ST1200MM0007 BF04 /dev/sdf /dev/sg10
[1:3:0:0] no dev IBM 57D7001SISIOA 0150 - /dev/sg3

This is a regression: the Linux kernel has failed to synchronize the PSL timebase.
The corresponding error message is in the dmesg log attached in comment #4:

[ 1.687586] PSL: Timebase sync: giving up!

CAPI devices are not enabled, because of this failure.

PSL Timebase sync should not be a requirement for CAPI initialization, nor should it make an initialized card become unavailable. Currently, timebase is an unused function of CAPI with hopes of adoption in the future. Support of this feature should be considered optional at this time.

I'm not sure what the fastest way to fix this is, but it needs to be fixed as quickly as possible. CAPI is broken in Ubuntu 15.10.

I can reproduce the bug, regardless of the skiboot level, with recent kernels.
Older kernels behave as expected, regardless of the skiboot level.

Firmware is not the cause of the regression, and kernel probably is.
I sent this out to the capi-linux distro too, but I'll comment here as well. I'm not sure what is being looked at to determine the PSL timebase sync failed. As far as I know all PSL versions should support timebase. The only timebase error the PSL logs is if CAPP returns a status that says timebase has an error. I'd think if that is the issue that timebase has not been correctly enabled or sequenced correctly in the host CAPP. The PSL can't be enabled for timebase until the CAPP unit in the host has been enabled.

I have installed a recent mainline Linux kernel (4.4.0-rc8) on capiredp01. I have rebooted this kernel and verified that the PSL timebase syncs without problem.

I will now compare the source code of Ubuntu kernel 4.2.0-19 (that hits the bug) with the source of mainline kernel 4.4.0-rc8 (that operates as expected).

I have updated the Ubuntu kernel and modules with:

$ sudo apt-get install linux-image-4.2.0-23-generic
$ sudo apt-get install linux-image-extra-4.2.0-23-generic

I have rebooted Ubuntu kernel linux-image-4.2.0-23-generic, and found that the cxl driver hits the bug.
I have also downloaded the source for this Ubuntu kernel (and modules) with:

$ sudo apt-get source linux-image-4.2.0-23-generic

I have recompiled and installed, and noticed that the resulting kernel bears the version 4.2.6 (??). I have rebooted this Ubuntu kernel 4.2.6 built from the Ubuntu source for 4.2.0-23-generic, and found that the timebase sync occurs normally.

In short, the kernels linux-4.2.6 and linux-4.4.0-rc8 (that I have built from the source, respectively provided by Ubuntu and Linus) operate normally, when all kernels compiled by, and downloaded from, Ubuntu hit the timebase sync bug.

I will try to investigate possible differences between kernel config files or toolchain and build procedures.

I have found that the bug can be activated or prevented via the Linux kernel config file.
I have compiled the Ubuntu kernel source downloaded with

$ sudo apt-get source linux-image-4.2.0-23-generic

1. with my own config file => PSL timebase sync works fine
2. with the config fille supplied by Ubuntu => PSL timebase sync fails

I will now diff the config files, and try to identify the set of config parameters that change the kernel behavior regarding timebase sync.

Got it. Here is the difference between config-4.2.0-23-generic (that hits the bug) and .config (that operates normally):

$ diff config-4.2.0-23-generic .config
130,131c130,132
< CONFIG_TICK_CPU_ACCOUNTING=y
< # CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
---
> CONFIG_VIRT_CPU_ACCOUNTING=y
> # CONFIG_TICK_CPU_ACCOUNTING is not set
> CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y

For some reason, setting CONFIG_TICK_CPU_ACCOUNTING breaks PSL Timebase sync on ppc64le. Investigating further.

Canonical, can you please replace

CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set

by

CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y

in the default ppc64le Linux kernel configuration file?

CVE References

Revision history for this message
bugproxy (bugproxy) wrote : capiredp01_dmesg

Default Comment by Bridge

tags: added: architecture-ppc64 bugnameltc-133987 severity-critical targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : capiredp01_kern.log

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : capiredp01_msglog

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : capiredp01_syslog

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Temporary workaround patch

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
bugproxy (bugproxy)
tags: added: architecture-ppc64le
removed: architecture-ppc64
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-01-12 17:17 EDT-------
(In reply to comment #26)
> Canonical, can you please replace
>
> CONFIG_TICK_CPU_ACCOUNTING=y
> # CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
>
> by
>
> CONFIG_VIRT_CPU_ACCOUNTING=y
> # CONFIG_TICK_CPU_ACCOUNTING is not set
> CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
>
> in the default ppc64le Linux kernel configuration file?

What's the actual bug? Why does this config option cause it to not sync?

I object to just flipping this config option. We need to root cause the issue rather papering it over with this. This config option improves context swtiching time (for every process in the system), so I really don't want to turn it off.

penalvch (penalvch)
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
status: New → Triaged
Changed in linux (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-01-25 12:16 EDT-------
==== State: Assigned by: sshrisat on 25 January 2016 11:13:19 ====

I have Open Another defect

SW332831 : PSL_FIR_SLICE_An error triggred while ruuning htx on surelock system with 840 driver

This problem has happen when I ran htx on system with Kernal 4.4.0

Please refer the defect SW332831

tags: added: architecture-ppc64
removed: architecture-ppc64le
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-02-18 00:30 EDT-------
*** Bug 137313 has been marked as a duplicate of this bug. ***

bugproxy (bugproxy)
tags: added: targetmilestone-inin1510
removed: targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-02-25 02:19 EDT-------
Patch for this has been sent to the linuxppc-dev mailing list and is being reviewed:
http://patchwork.ozlabs.org/patch/587545

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Kernel team,,

This is a high priority problem for IBM, can mark this bug importance as high?

This is also affect both Ubuntu 15.10 (Kernel 4.2) and 16.04 (kernel 4.4). If you could cherry-pick on both versions, it would be ideal. :-)

Thank you!

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Medium → High
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Xenial):
assignee: Canonical Kernel Team (canonical-kernel-team) → Tim Gardner (timg-tpi)
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Wily):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Thanks Tim!

The patch was already integrated in the powerpc branch.

https://git.kernel.org/powerpc/c/923adb1646d5ba739d2a1e63ee

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (7.6 KiB)

This bug was fixed in the package linux - 4.4.0-9.24

---------------
linux (4.4.0-9.24) xenial; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1551319

  * AppArmor logs denial for when the device path is ENOENT (LP: #1482943)
    - SAUCE: apparmor: fix log of apparmor audit message when kern_path() fails

  * BUG: unable to handle kernel NULL pointer dereference (aa_label_merge) (LP:
    #1448912)
    - SAUCE: apparmor: Fix: insert race between label_update and label_merge
    - SAUCE: apparmor: Fix: ensure aa_get_newest will trip debugging if the
      replacedby is not setup
    - SAUCE: apparmor: Fix: label merge handling of marking unconfined and stale
    - SAUCE: apparmor: Fix: refcount race between locating in labelset and get
    - SAUCE: apparmor: Fix: ensure new labels resulting from merge have a
      replacedby
    - SAUCE: apparmor: Fix: label_vec_merge insertion
    - SAUCE: apparmor: Fix: deadlock in aa_put_label() call chain
    - SAUCE: apparmor: Fix: add required locking of __aa_update_replacedby on
      merge path
    - SAUCE: apparmor: Fix: convert replacedby update to be protected by the
      labelset lock
    - SAUCE: apparmor: Fix: update replacedby allocation to take a gfp parameter

  * apparmor kernel BUG kills firefox (LP: #1430546)
    - SAUCE: apparmor: Disallow update of cred when then subjective != the
      objective cred
    - SAUCE: apparmor: rework retrieval of the current label in the profile update
      case

  * sleep from invalid context in aa_move_mount (LP: #1539349)
    - SAUCE: apparmor: fix sleep from invalid context

  * s390x: correct restore of high gprs on signal return (LP: #1550468)
    - s390/compat: correct restore of high gprs on signal return

  * missing SMAP support (LP: #1550517)
    - x86/entry/compat: Add missing CLAC to entry_INT80_32

  * Floating-point exception handler receives empty Data-Exception Code in
    Floating Point Control register (LP: #1548414)
    - s390/fpu: signals vs. floating point control register

  * kvm fails to boot GNU Hurd kernels with 4.4 Xenial kernel (LP: #1550596)
    - KVM: x86: fix conversion of addresses to linear in 32-bit protected mode

  * Surelock GA2 SP1: capiredp01: cxl_init_adapter fails for CAPI devices
    0000:01:00.0 and 0005:01:00.0 after upgrading to 840.10 Platform firmware
    build fips840/b1208b_1604.840 (LP: #1532914)
    - cxl: Fix PSL timebase synchronization detection

  * [Feature]EDAC support for Knights Landing (LP: #1519631)
    - EDAC, sb_edac: Set fixed DIMM width on Xeon Knights Landing

  * Various failures of kernel_security suite on Xenial kernel on s390x arch
    (LP: #1531327)
    - [config] s390x -- CONFIG_DEFAULT_MMAP_MIN_ADDR=65536

  * Unable to install VirtualBox Guest Service in 15.04 (LP: #1434579)
    - [Config] Provides: virtualbox-guest-modules when appropriate

  * linux is missing provides for virtualbox-guest-modules [i386 amd64 x32] (LP:
    #1507588)
    - [Config] Provides: virtualbox-guest-modules when appropriate

  * Backport more recent driver for SKL, KBL and BXT graphics (LP: #1540390)
    - SAUCE: i915_bpo: Provide a backport driver for SKL, KBL & BXT graphics
    - SA...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Wily):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (7.7 KiB)

This bug was fixed in the package linux - 4.2.0-34.39

---------------
linux (4.2.0-34.39) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1555821

  [ Florian Westphal ]

  * SAUCE: [nf] netfilter: x_tables: check for size overflow
    - LP: #1555353
  * SAUCE: [nf,v2] netfilter: x_tables: don't rely on well-behaving
    userspace
    - LP: #1555338

linux (4.2.0-33.38) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1554649

  [ Upstream Kernel Changes ]

  * Revert "drm/radeon: call hpd_irq_event on resume"
    - LP: #1554608
  * cxl: Fix PSL timebase synchronization detection
    - LP: #1532914

linux (4.2.0-32.37) wily; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1550045

  [ Kamal Mostafa ]

  * Merged back Ubuntu-4.2.0-31.36

linux (4.2.0-31.36) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1548579

  [ Andy Whitcroft ]

  * [Debian] hv: hv_set_ifconfig -- convert to python3
    - LP: #1506521
  * [Debian] hv: hv_set_ifconfig -- switch to approved indentation
    - LP: #1540586
  * [Debian] hv: hv_set_ifconfig -- fix numerous parameter handling issues
    - LP: #1540586

  [ Carol L Soto ]

  * SAUCE: IB/IPoIB: Do not set skb truesize since using one linearskb
    - LP: #1541326

  [ Dan Streetman ]

  * SAUCE: nbd: ratelimit error msgs after socket close
    - LP: #1505564

  [ Tim Gardner ]

  * Revert "SAUCE: (noup) cxlflash: Fix to avoid virtual LUN failover
    failure"
    - LP: #1541635
  * Revert "SAUCE: (noup) cxlflash: Fix to escalate LINK_RESET also on port
    1"
    - LP: #1541635
  * [Config] ARMV8_DEPRECATED=y
    - LP: #1545542

  [ Upstream Kernel Changes ]

  * x86/xen/p2m: hint at the last populated P2M entry
    - LP: #1542941
  * mm: add dma_pool_zalloc() call to DMA API
    - LP: #1543737
  * sctp: Prevent soft lockup when sctp_accept() is called during a timeout
    event
    - LP: #1543737
  * xen-netback: respect user provided max_queues
    - LP: #1543737
  * xen-netfront: respect user provided max_queues
    - LP: #1543737
  * xen-netfront: update num_queues to real created
    - LP: #1543737
  * iio: adis_buffer: Fix out-of-bounds memory access
    - LP: #1543737
  * KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
    - LP: #1543737
  * KVM: PPC: Fix ONE_REG AltiVec support
    - LP: #1543737
  * x86/irq: Call chip->irq_set_affinity in proper context
    - LP: #1543737
  * drm/amdgpu: fix tonga smu resume
    - LP: #1543737
  * perf kvm record/report: 'unprocessable sample' error while
    recording/reporting guest data
    - LP: #1543737
  * hrtimer: Handle remaining time proper for TIME_LOW_RES
    - LP: #1543737
  * timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper
    - LP: #1543737
  * posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    - LP: #1543737
  * itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    - LP: #1543737
  * drm/amdgpu: Use drm_calloc_large for VM page_tables array
    - LP: #1543737
  * drm/amdgpu: fix amdgpu_bo_pin_restricted VRAM placing v2
    - LP: #1543737
  * drm/radeon: properly byte swap vce firmware setup
    - LP: #1543737
  ...

Read more...

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-03-15 13:16 EDT-------
Marking comment as external

(In reply to comment #48)
> I can see CAPI device on Ubuntu 15.10 with kernel 4.2.0-34-generic
> #39-Ubuntu. I also didn't see timeout messages for PSL timebase.
>
> $ pwd
> /home/ubuntu
>
> $ lsb_release -a && uname -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description: Ubuntu 15.10
> Release: 15.10
> Codename: wily
> Linux capiredp01 4.2.0-34-generic #39-Ubuntu SMP Thu Mar 10 22:11:28 UTC
> 2016 ppc64le ppc64le ppc64le GNU/Linux
>
> $ ls -l /dev/cxl/
> total 0
> crw------- 1 cxl cxl 246, 2 Mar 15 12:10 afu0.0m
> crw------- 1 cxl cxl 246, 3 Mar 15 12:10 afu0.0s
> crw------- 1 cxl cxl 246, 15 Mar 15 12:10 afu1.0m
> crw------- 1 cxl cxl 246, 16 Mar 15 12:10 afu1.0s

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-03-15 13:24 EDT-------
Fixed verified on Xenial level

root@z1391:~# uname -a
Linux z1391 4.4.0-12-generic #28-Ubuntu SMP Wed Mar 9 00:40:38 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
root@z1391:~# cat /proc/version
Linux version 4.4.0-12-generic (buildd@bos01-ppc64el-001) (gcc version 5.3.1 20160225 (Ubuntu/IBM 5.3.1-10ubuntu2) ) #28-Ubuntu SMP Wed Mar 9 00:40:38 UTC 2016

$ cupdcmd -f

Machine Type:............8247-22L
Card Type:...............FSP2_P8LE
Current Boot Side:.......T
Next Boot Side:..........T
PT_Swap:.................0
Current Side Driver:.....fips840/b0307b_1611.840
Non-Current Side Driver:.fips840/b0307b_1611.840
P Side FipS Valid Flag:..1
T Side FipS Valid Flag:..1
Update Policy:...........Out of Band Update
Current FSP Position.....A
Current FSP Role:........Primary
Sibling FSP:.............Not Present
Sibling FSP IP...........Unknown
Image Type...............ship

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.