rcu_sched detected stalls on CPUs/tasks

Bug #1967130 reported by Heinrich Schuchardt
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-riscv (Ubuntu)
Fix Released
Undecided
Dimitri John Ledkov

Bug Description

When running the riscv64 live installer (https://cdimage.ubuntu.com/ubuntu-server/daily-live/20220330/jammy-live-server-riscv64.iso) on the SiFive Unmatched board we reproducibly see rcu_sched stalls related to accessing the UEFI runtime:

Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451647] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451673] rcu: 3-...0: (3 GPs behind) idle=d03/1/0x4000000000000000 softirq=85648/85648 fqs=7326
Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451698] (detected by 0, t=15002 jiffies, g=155853, q=1286)
Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451712] Task dump for CPU 3:
Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451720] task:kworker/u8:1 state:R running task stack: 0 pid: 3715 ppid: 2 flags:0x00000008
Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451750] Workqueue: efi_rts_wq efi_call_rts
Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451777] Call Trace:
Mar 30 11:50:47 ubuntu-server kernel: [ 1313.451782] [<ffffffff809be1b6>] __schedule+0x226/0x644

uname -a:
Linux ubuntu-server 5.15.0-1004-generic #4-Ubuntu SMP Wed Feb 9 18:17:33 UTC 2022 riscv64 riscv64 riscv64 GNU/Linux
---
ProblemType: Bug
Architecture: riscv64
DistroRelease: Ubuntu 22.04
Package: linux-image-generic 5.15.0.1004.4
PackageArchitecture: riscv64
ProcEnviron:
 TERM=vt220
 PATH=(custom, no user)
 LANG=C.UTF-8
Uname: Linux 5.15.0-1004-generic riscv64
UserGroups: N/A
_MarkForUpload: True

Revision history for this message
Heinrich Schuchardt (xypron) wrote :
Revision history for this message
Heinrich Schuchardt (xypron) wrote : Dependencies.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Heinrich Schuchardt (xypron) wrote :
Revision history for this message
Heinrich Schuchardt (xypron) wrote :

efi: Make efi_rts_work accessible to efi page fault handler
https://<email address hidden>/T/#mcbd236122ec0ae61f1ee9c22b8d7910ffc2276a0
addressed a similar issue on x86.

Revision history for this message
Alexandre Ghiti (alexghiti) wrote :

I have a workaround for this:

    UBUNTU: SAUCE: riscv: Disable VMAP_STACK since it fails with efi

    When VMAP_STACK is enabled, kernel threads have their stacks in the vmalloc
    region.

    So when The kworker responsible for handling efi work queue (efi_call_rts) calls
    efi_virtmap_load and then switch_mm, if the stack of the worker is in a vmalloc
    area not yet synchronized with efi_mm (since RISC-V lazily populates vmalloc
    area), an attempt to access this stack will trigger a fault which can't be
    resolved since when trying to save the context, a new trap will be triggered and
    so on.

    So disable VMAP_STACK for now until we figure out the correct fix.

And I'm working on the proper fix which consists in synchronizing the efi page table with the page table of the calling thread before switching to efi mm.

Revision history for this message
Alexandre Ghiti (alexghiti) wrote :

The workaround built successfully in my PPA: https://launchpad.net/~alexghiti/+archive/ubuntu/riscv/+sourcepub/13413205/+listing-archive-extra

I tested it successfully with the RISC-V live installer (where this problem was seen).

I did not have time yet to work on the proper fix so I think we should go for this workaround: @xnox what do you think?

Thanks

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

VMAP_STACK is new on riscv since v5.14 kernel, meaning we haven't shipped any releases on riscv with VMAP_STACK enabled.

At the time when VMAP_STACK was enabled in the kernel nobody had EFI for riscv yet either.

So it makes sense that this issue is being experienced by us just now, once we got all the pieces landing together.

I would be in favour to disable VMAP_STACK, on riscv64 only, until we have more fixes to make it work better.

Changed in linux-riscv (Ubuntu):
milestone: none → ubuntu-22.04
status: New → Triaged
Changed in linux-riscv (Ubuntu):
status: Triaged → Fix Committed
assignee: nobody → Dimitri John Ledkov (xnox)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-riscv - 5.15.0-1007.7

---------------
linux-riscv (5.15.0-1007.7) jammy; urgency=medium

  * jammy/linux-riscv: 5.15.0-1007.7 -proposed tracker (LP: #1968740)

  * rcu_sched detected stalls on CPUs/tasks (LP: #1967130)
    - [Config] Disable VMAP_STACK due to CPU stalls on EFI

  * Fix unmatched ASMedia ASM2824 PCIe link training (LP: #1964796)
    - PCI: fu740: Force 2.5GT/s for initial device probe

 -- Dimitri John Ledkov <email address hidden> Tue, 12 Apr 2022 15:44:58 +0100

Changed in linux-riscv (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Emil Renner Berthing (esmil) wrote :

There is more information in LP: #1992458

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-hwe-5.19/5.19.0-24.25~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-hwe-5.19 verification-needed-jammy
Revision history for this message
Emil Renner Berthing (esmil) wrote :

This should now be fixed by the upstream commit:
3f105a742725 ("riscv: Sync efi page table's kernel mappings before switching")

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure-6.5/6.5.0-1007.7~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-azure-6.5' to 'verification-done-jammy-linux-azure-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-azure-6.5' to 'verification-failed-jammy-linux-azure-6.5'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure-6.5-v2 verification-needed-jammy-linux-azure-6.5
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-aws-6.5/6.5.0-1008.8~22.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-aws-6.5' to 'verification-done-jammy-linux-aws-6.5'. If the problem still exists, change the tag 'verification-needed-jammy-linux-aws-6.5' to 'verification-failed-jammy-linux-aws-6.5'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-aws-6.5-v2 verification-needed-jammy-linux-aws-6.5
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.