Crashing with CPU soft lock on GA kernel 5.15.0.79.76 and HWE kernel 5.19.0-46.47-22.04.1
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
Fix Released
|
High
|
Stefan Bader |
Bug Description
Impact:
We had reports of VM setups which would show intermediate crashes and after that locking up completely. This could be reproduced with large memory setups.
The problem seems to be that fixes to performance regressions caused more problems in 5.15 kernels and the full fixes are too intrusive to be backported.
Fix:
The following patch was recently sent to the upstream stable mailing list and looks to be making its way into linux-5.15.y. This changes the default value of kvm.tdp_mmu to off (if anyone is willing to take the risks, this can be changed back in config).
Regression potential:
VM hosts with many large memory tennants might see a performance impact which the TDP MMU approach tried to solve. If those did not see other problems they might turn this on again.
Testcase:
Large openstack instance (64GB memory, AMD CPU (using SVM)) with a large second level guest (32GB memory). Repeatedly starting and stopping the 2nd level guest.
--- original description ---
The crash occurred on a juju machine, and the juju agent was lost.
The juju machine is on an openstack instance provision by juju.
The openstack console log indicts the it is related to spin_lock and KVM MMU:
[418200.348830] ? _raw_spin_
[418200.349588] _raw_write_
[418200.350196] kvm_tdp_
[418200.351014] kvm_mmu_
[418200.351796] direct_
[418200.352667] __mmu_notifier_
[418200.353624] kvm_tdp_
[418200.354496] try_to_
[418200.355436] kvm_mmu_
openstack console log: https:/
syslog: https:/
The syslog was rotated after the crash occurred, so the syslog at the time of the initial crash was lost.
Other juju machine with 5.15.0.79.76 kernel seems to have the same issues.
We previously have a similar issue with 5.15.0-73. The juju machine crashed with raw_spin_lock and kvm mmu in the logs as well: https:/
ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-
ProcVersionSign
Uname: Linux 5.19.0-46-generic x86_64
NonfreeKernelMo
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckR
CloudArchitecture: x86_64
CloudID: openstack
CloudName: openstack
CloudPlatform: openstack
CloudSubPlatform: metadata (http://
Date: Mon Aug 21 08:59:46 2023
Ec2AMI: ami-00000c61
Ec2AMIManifest: FIXME
Ec2Availability
Ec2InstanceType: builder-
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
LANG=C.UTF-8
SHELL=/bin/bash
SourcePackage: linux-signed-
UpgradeStatus: No upgrade log present (probably fresh install)
---
ProblemType: Bug
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Aug 23 03:23 seq
crw-rw---- 1 root audio 116, 33 Aug 23 03:23 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
CasperMD5CheckR
CloudArchitecture: x86_64
CloudID: openstack
CloudName: openstack
CloudPlatform: openstack
CloudSubPlatform: metadata (http://
DistroRelease: Ubuntu 22.04
Ec2AMI: ami-00000fbb
Ec2AMIManifest: FIXME
Ec2Availability
Ec2InstanceType: builder-
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb: Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Lsusb-t: /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=uhci_hcd/2p, 12M
MachineType: OpenStack Foundation OpenStack Nova
NonfreeKernelMo
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
LANG=C.UTF-8
SHELL=/bin/bash
ProcFB: 0 qxldrmfb
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 20220329.
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: jammy ec2-images
Uname: Linux 5.15.0-83-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 04/01/2014
dmi.bios.release: 0.0
dmi.bios.vendor: SeaBIOS
dmi.bios.version: 1.13.0-1ubuntu1.1
dmi.chassis.type: 1
dmi.chassis.vendor: QEMU
dmi.chassis.
dmi.modalias: dmi:bvnSeaBIOS:
dmi.product.family: Virtual Machine
dmi.product.name: OpenStack Nova
dmi.product.
dmi.sys.vendor: OpenStack Foundation
description: | updated |
description: | updated |
affects: | linux-signed-hwe-5.19 (Ubuntu) → linux (Ubuntu) |
Changed in linux (Ubuntu Jammy): | |
status: | New → Triaged |
importance: | Undecided → High |
summary: |
- Crashing with CPU soft lock on HWE kernel 5.19.0-46.47-22.04.1 + Crashing with CPU soft lock on GA kernel 5.15.0.79.76 and HWE kernel + 5.19.0-46.47-22.04.1 |
Changed in linux (Ubuntu): | |
status: | Confirmed → Fix Released |
Changed in linux (Ubuntu Jammy): | |
status: | Triaged → In Progress |
assignee: | nobody → Stefan Bader (smb) |
description: | updated |
Changed in linux (Ubuntu Jammy): | |
status: | In Progress → Fix Committed |
The problem is the missing start of the issue. What is there indicates that the VM ran into a page fault situation and is stuck in a lock. From other reports the console display might be part of a soft cpu lockup detected message. This is often the result of a previous crash (which leaves locks in locked state). Would it be possible to reset a stuck instance and retrieve the rotated journal?
The other problem is that the virtualization stack cannot really be deducted from the given info. There is at least one level of VM. The instance in openstack. Question: is the juju "machine" a secondary VM on the openstack instance? The available trace suggests that the lockup occurred on an AMD CPU. This is relevant for VMs as Intel and AMD use different virt. extensions (SVM vs VMX).
Other note on 5.19: the HWE kernel rolled to 6.2 with 22.04.3. I know this could result in other issues but maybe it is one possible option for the current situation.