cpu soft lockup running kvm

Bug #1268906 reported by James Hunt
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Stefan Bader
Trusty
Fix Released
Medium
Stefan Bader
qemu-kvm (Ubuntu)
Invalid
Medium
Unassigned
Trusty
Invalid
Medium
Unassigned

Bug Description

Ran twice - first time running kvm killed my system. The second time, I just got lots of kernel oops messages in dmesg.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-2-generic 3.13.0-2.17
ProcVersionSignature: Ubuntu 3.13.0-2.17-generic 3.13.0-rc7
Uname: Linux 3.13.0-2-generic i686
NonfreeKernelModules: nvidia
ApportVersion: 2.13.1-0ubuntu1
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: james 3250 F.... pulseaudio
 /dev/snd/controlC0: james 3250 F.... pulseaudio
 /dev/snd/pcmC0D0p: james 3250 F...m pulseaudio
CurrentDesktop: Unity
Date: Tue Jan 14 10:29:20 2014
HibernationDevice: RESUME=UUID=67e3cd44-242b-4bbf-918b-28fff81e0312
InstallationDate: Installed on 2010-10-21 (1181 days ago)
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release i386 (20101007)
MachineType: LENOVO 2516CTO
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-2-generic root=UUID=7ad192e9-7b26-49d1-8e1c-fefc7dc495cb ro quiet splash
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-2-generic N/A
 linux-backports-modules-3.13.0-2-generic N/A
 linux-firmware 1.121
SourcePackage: linux
UpgradeStatus: Upgraded to trusty on 2013-11-01 (73 days ago)
dmi.bios.date: 08/27/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 6IET72WW (1.32 )
dmi.board.name: 2516CTO
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr6IET72WW(1.32):bd08/27/2010:svnLENOVO:pn2516CTO:pvrThinkPadT410:rvnLENOVO:rn2516CTO:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 2516CTO
dmi.product.version: ThinkPad T410
dmi.sys.vendor: LENOVO

Revision history for this message
James Hunt (jamesodhunt) wrote :
Revision history for this message
James Hunt (jamesodhunt) wrote :

See CurrentDmesg.txt for lots of errors such as:

[ 1688.474887] Call Trace:
[ 1688.474914] [<f8bae0f3>] kvm_vcpu_ioctl+0x433/0x4d0 [kvm]
[ 1688.474924] [<c108064f>] ? wake_up_state+0xf/0x20
[ 1688.474931] [<c10b9795>] ? wake_futex+0x65/0x90
[ 1688.474937] [<c10babed>] ? futex_wake+0x13d/0x160
[ 1688.474942] [<c10bbe4b>] ? do_futex+0xeb/0x660
[ 1688.474966] [<f8badcc0>] ? vcpu_put+0x30/0x30 [kvm]
[ 1688.474972] [<c1186012>] do_vfs_ioctl+0x2e2/0x4d0
[ 1688.474977] [<c10b7819>] ? tick_program_event+0x29/0x30
[ 1688.474983] [<c1075eb2>] ? hrtimer_interrupt+0x142/0x2b0
[ 1688.474988] [<c10bc44c>] ? SyS_futex+0x8c/0x140
[ 1688.474993] [<c1186260>] SyS_ioctl+0x60/0x80
[ 1688.475002] [<c1646147>] syscall_call+0x7/0xb
[ 1688.475009] [<c1640000>] ? __ticket_unlock_slowpath+0x13/0x31

Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13-rc8-trusty/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
James Hunt (jamesodhunt) wrote :

Alas, same issue using kernel 3.13.0-031300rc8-generic.

Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Brad Figg (brad-figg) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
James Page (james-page)
Changed in qemu-kvm (Ubuntu):
importance: Undecided → Medium
Revision history for this message
James Hunt (jamesodhunt) wrote :

Same problem with 3.13.0-4-generic.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Same problem with 3.13.0-4-generic. Any progress on this? kvm is unusable due to this issue.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Still a problem on 3.13.0-5-generic.

Revision history for this message
James Hunt (jamesodhunt) wrote :

dmesg from 3.13.0-5-generic kernel.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a kernel version where you were not having this particular problem? This will help determine if the problem you are seeing is the result of the introduction of a regression, and when this regression was introduced. If this is a regression, we can perform a kernel bisect to identify the commit that introduced the problem.

tags: added: kernel-da-key
Revision history for this message
James Hunt (jamesodhunt) wrote :

Serge has just identified that the trigger for this bug is running kvm with '-net user': without this option, no issues.

Changed in qemu-kvm (Ubuntu):
status: New → Confirmed
Revision history for this message
James Hunt (jamesodhunt) wrote :

Latest series of oopses running on a fully updated trusty system.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Same problem running with -vnc :7

Revision history for this message
James Hunt (jamesodhunt) wrote :

To be clear, my host system is a 64-bit capable first-gen i7, but is actually running fully 32-bit.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Latest trusty desktop image (dated 24 Jan) results in the oops when running with kvm as:

$ kvm --enable-kvm -cdrom ubuntu_trusty-desktop-i386.iso -boot d -m 1024

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :
Download full text (4.7 KiB)

On my AMD laptop (x130e thinkpad) with a uptodate trusty amd64 install, I cannot reproduce this. After installing a 32-bit trusty image, I can in fact reproduce it:

Jan 27 13:07:05 serge-ThinkPad-X130e kernel: [ 725.593411] kvm: zapping shadow pages for mmio generation wraparound
Jan 27 13:07:19 serge-ThinkPad-X130e kernel: [ 740.100465] kvm [3887]: vcpu0 unhandled rdmsr: 0xc0010001
Jan 27 13:07:21 serge-ThinkPad-X130e kernel: [ 742.187275] usb 1-1: USB disconnect, device number 3
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524715] BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-i38:3889]
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524729] Modules linked in: usb_storage parport_pc ppdev joydev rfcomm bnep arc4 uvcvideo rtl8192ce videobuf2_vmalloc videobuf2_memops videobuf2_core videodev rtl_pci kvm_amd rtlwifi kvm rtl8192c_common mac80211 microcode snd_hda_codec_conexant btusb snd_hda_codec_hdmi radeon cfg80211 bluetooth thinkpad_acpi snd_hda_intel snd_hda_codec nvram snd_hwdep snd_seq_midi snd_pcm snd_seq_midi_event snd_rawmidi snd_seq ttm psmouse drm_kms_helper snd_seq_device sp5100_tco snd_page_alloc rtsx_pci_ms serio_raw memstick drm snd_timer i2c_piix4 k10temp wmi i2c_algo_bit snd soundcore video mac_hid lp parport rtsx_pci_sdmmc rtsx_pci atl1c ahci libahci
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524815] CPU: 0 PID: 3889 Comm: qemu-system-i38 Not tainted 3.13.0-5-generic #20-Ubuntu
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524820] Hardware name: LENOVO 062223U/062223U, BIOS 8RET52WW (1.15 ) 11/15/2011
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524825] task: edc78d00 ti: f63e6000 task.ti: f63e6000
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524830] EIP: 0060:[<c10a9cac>] EFLAGS: 00000202 CPU: 0
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524840] EIP is at __srcu_read_lock+0x2c/0x50
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524845] EAX: f14e4024 EBX: 00000000 ECX: f7bdddf0 EDX: 00000000
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524848] ESI: 00000001 EDI: 00000001 EBP: f63e7dec ESP: f63e7de4
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524852] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524856] CR0: 8005003b CR2: ffffffff CR3: 35e58000 CR4: 000007f0
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524860] Stack:
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524863] f626b550 00000000 f63e7e6c f91da2f6 000000b2 f7bd3400 57c83df4 134d36c2
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524873] f7bd3140 edc78d00 00000046 00000000 edc78d00 f626b550 003e7e24 c1058fb7
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524882] f63e7e30 c164c598 f626b570 f63e7e94 c1644fac f626b570 360df2c8 f14e2008
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524892] Call Trace:
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524933] [<f91da2f6>] vcpu_enter_guest+0x636/0xc90 [kvm]
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524941] [<c1058fb7>] ? irq_exit+0x47/0xa0
Jan 27 13:07:48 serge-ThinkPad-X130e kernel: [ 768.524950] [<c164c598>] ? smp_apic_timer_interrupt+0x38/0x50
Jan...

Read more...

Revision history for this message
Stefan Bader (smb) wrote :

I wonder whether this could be remotely related to bug #1265841. I don'T have any proof there. Just stumbled over some used_math check (followed by potential init_fpu) right at the beginning of kvm_arch_vcpu_ioctl_run. There was a longer thread on LKML about that ("[PATCH] Make math_state_restore() save and restore the interrupt flag").
Though in this case nothing is restored, and I don't see interrupts being disabled. Also from the stack it seems we are usually later in that function. So this more loud thinking and putting down notes, so I do not forget....

Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (smb)
Revision history for this message
Stefan Bader (smb) wrote :

OK, its not related to the eager fpu stuff. Just reproduced the same lockup with a test kernel that contained some potential fixes from upstream. The interesting part of the backtraces is that the EIP is changing and often not in a function that should be able to block.
So something seems to cause the __vcpu_run (called by kvm_arch_vcpu_ioctl_run) to spin in a relatively busy loop without ever exiting. The failed RDMSR most likely is harmless. Checking next what that "zapping shadow pages" might tell us.
Does anybody recall when (which kernel) this worked last on i386?

Revision history for this message
James Hunt (jamesodhunt) wrote :

I think the switch to 3.13 may have borked kvm for me. I'll try to confirm that tomorrow...

Revision history for this message
Stefan Bader (smb) wrote :

This looks to be drilling down to a problem which seems to only affect i386 for some reason. But basically the kvm/qemu process gets into a situation where the thread_info says "you have to reschedule" but the per-cpu counterpart prevents that. Now the kvm/qemu process runs wildly around between trying to run the guest vcpu which is not done because the process side says reschedule and the cond_reschedule() function not doing anything because the cpu info tells it otherwise.

Right now I got a test kernel back to run kvm but that is rather a vile work-around. I have sent an email upstream in the hope that somebody there has a better solution.

Revision history for this message
Gareth Woolridge (moon127) wrote :

I've had a number of lockups in the last 2 days since upgrading to Trusty and trying to run a VM under kvm which look to match this issue.

System appears to become unresponsive/hang with what looks like some kind of I/O lockup, cannot ls in a terminal etc before eventually display freezes. syslog shows kernel spew related to soft lockup after this point though.

The issue appeared to coincide with a period of high disk I/O within the guest on both occasions. I'll look to reproduce on a less active guest outside of working day.

This occured on a 64 bit system:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu Trusty Tahr (development branch)
Release: 14.04
Codename: trusty

uname -a
Linux moon127-pcsubu 3.13.0-12-generic #32-Ubuntu SMP Fri Feb 21 17:45:10 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Gareth,

I suspect yours is a different bug, so it would be best if you file a new bug (so that people trying to debug *this* bug don't mix up symptoms in their minds, which can wrongly make them rule out a proper cause for the other bug) against both qemu and linux.

In that bug, please let us know if a simple 'apt-get dist-upgrade' or a kernel compile suffices to reproduce the bug. Also, the full kvm command line on both the host and in the guest, the distro+release in the guest, and the disk type (qcow, raw, qed) at both levels.

Thank you.

Revision history for this message
James Hunt (jamesodhunt) wrote :

Still a problem on 3.13.0-17-generic.

Stefan Bader (smb)
Changed in qemu-kvm (Ubuntu Trusty):
status: Confirmed → Invalid
Andy Whitcroft (apw)
Changed in linux (Ubuntu Trusty):
status: Confirmed → In Progress
Stefan Bader (smb)
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.13.0-22.44

---------------
linux (3.13.0-22.44) trusty; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1301562

  [ dann frazier ]

  * [Config] enable linux-tools on arm64
    https://lists.ubuntu.com/archives/kernel-team/2014-April/041332.html

  [ Greg Kurz ]

  * SAUCE: powerpc/le: Big endian arguments for ppc_rtas()
    - LP: #1289518

  [ Mahesh Salgaonkar ]

  * SAUCE: powerpc/book3s: Fix CFAR clobbering issue in machine check
    handler.
    - LP: #1301424
  * SAUCE: powerpc/book3s: Recover from MC in sapphire on SCOM read via
    MMIO.
    - LP: #1301424
  * SAUCE: powerpc/book3s: Fix mc_recoverable_range buffer overrun issue.
    - LP: #1301424

  [ Paolo Pisati ]

  * [Config] armhf: USB_STORAGE=y
    https://lists.ubuntu.com/archives/kernel-team/2014-April/041349.html

  [ Stefan Bader ]

  * SAUCE: kvm: Force preempt folding in kvm on i386
    - LP: #1268906

  [ Tim Gardner ]

  * SAUCE: Drop lttng in favor of lttng-modules
    The kernel version was down rev on an rc release.

  [ Tomas Winkler ]

  * SAUCE: (no-up) mei: me: do not load the driver if the FW doesn't
    support MEI interface
    - LP: #1301118

  [ Upstream Kernel Changes ]

  * drm/i915: Deprecated UMS support
    - LP: #1284816
  * powerpc/book3s: Split the common exception prolog logic into two
    section.
    - LP: #1301424
  * powerpc/book3s: Introduce exclusive emergency stack for machine check
    exception.
    - LP: #1301424
  * powerpc/book3s: handle machine check in Linux host.
    - LP: #1301424
  * powerpc/book3s: Return from interrupt if coming from evil context.
    - LP: #1301424
  * powerpc/book3s: Introduce a early machine check hook in cpu_spec.
    - LP: #1301424
  * powerpc/book3s: Add flush_tlb operation in cpu_spec.
    - LP: #1301424
  * powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors
    on power7.
    - LP: #1301424
  * powerpc/book3s: Flush SLB/TLBs if we get SLB/TLB machine check errors
    on power8.
    - LP: #1301424
  * powerpc/book3s: Decode and save machine check event.
    - LP: #1301424
  * powerpc/book3s: Queue up and process delayed MCE events.
    - LP: #1301424
  * powerpc/powernv: Remove machine check handling in OPAL.
    - LP: #1301424
  * powerpc/powernv: Machine check exception handling.
    - LP: #1301424
  * powerpc: Fix "attempt to move .org backwards" error
    - LP: #1301424
  * powerpc: Fix endian issues in power7/8 machine check handler
    - LP: #1301424
  * Move precessing of MCE queued event out from syscall exit path.
    - LP: #1301424
 -- Andy Whitcroft <email address hidden> Wed, 02 Apr 2014 15:58:48 +0100

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.