xen virtual Machines and Dom0 crashes BUG: soft lockup - CPU#0 stuck for 11s! [savelog:]; EIP is at _spin_lock+0x7/0x10

Bug #259487 reported by Maiquel
64
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
linux-meta (Debian)
Fix Released
Unknown

Bug Description

Ubuntu 8.04
uname -r
-2.6.24-19-xen

syslogd.conf
[...]
kernel: BUG: soft lockup - CPU#0 stuck for 11s! [savelog:]
kernel: BUG: soft lockup - CPU#1 stuck for 11s! [postgres:]
kernel: BUG: soft lockup - CPU#2 stuck for 11s! [mysql:]
kernel: BUG: soft lockup - CPU#3 stuck for 11s! [syslog:]
Pid: 11194, comm: savelog Tainted: G B D (2.6.24-19-xen #2)
EIP: 0061:[dm_mod:_spin_lock+0x7/0x10] EFLAGS: 00000282 CPU: 0
EIP is at _spin_lock+0x7/0x10
EAX: c1daf2ec EBX: 00000000 ECX: 17097000 EDX: 00000000

#cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz
stepping : 7
cpu MHz : 2400.029
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips : 4803.37
clflush size : 64

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz
stepping : 7
cpu MHz : 2400.029
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips : 4800.11
clflush size : 64

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz
stepping : 7
cpu MHz : 2400.029
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips : 4800.11
clflush size : 64

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU @ 2.40GHz
stepping : 7
cpu MHz : 2400.029
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips : 4800.12
clflush size : 64

#lspci
00:00.0 Host bridge: nVidia Corporation C55 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.3 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.4 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.5 RAM memory: nVidia Corporation C55 Memory Controller (rev a2)
00:00.6 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:00.7 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.0 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.3 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.4 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.5 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:01.6 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:02.0 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:02.1 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:02.2 RAM memory: nVidia Corporation C55 Memory Controller (rev a1)
00:03.0 PCI bridge: nVidia Corporation C55 PCI Express bridge (rev a1)
00:07.0 PCI bridge: nVidia Corporation C55 PCI Express bridge (rev a1)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3)
00:0a.2 RAM memory: nVidia Corporation MCP51 Memory Controller 0 (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2)
00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3)
01:00.0 VGA compatible controller: ATI Technologies Inc RV380 0x3e50 [Radeon X600]
01:00.1 Display controller: ATI Technologies Inc RV380 [Radeon X600] (Secondary)
03:08.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev c0)

The system freezes when it has high load of I / O, then lock the virtual machines and Dom0 following.

Revision history for this message
Bastian Mäuser (mephisto-mephis) wrote :

I have the Same problem XEN/Hardy/i386:

dom0: Linux dom0-1 2.6.24-19-xen #1 SMP Thu Aug 21 03:09:02 UTC 2008 i686 GNU/Linux
domU: Linux nmail.XXXX 2.6.24-19-xen #1 SMP Thu Aug 21 03:09:02 UTC 2008 i686 GNU/Linux

crash every night ..

kern.log:Aug 28 11:33:56 nmail kernel: [55657.356419] BUG: soft lockup - CPU#1 stuck for 11s! [courierpop3logi:29319]
kern.log:Aug 28 11:34:08 nmail kernel: [55668.995642] BUG: soft lockup - CPU#1 stuck for 11s! [courierpop3logi:29319]
kern.log:Aug 28 11:34:19 nmail kernel: [55680.677142] BUG: soft lockup - CPU#1 stuck for 11s! [courierpop3logi:29319]

meanwhile i installed a xen several times, as well i reinstalled the domU's, i used 3 different HP servers (one brandnew), so it must be a problem with hardy.

obviously the xen-kernel shipped with hardy is totally unusable for production.

i have plenty other XEN Systems running - reliable - but not with hardy.

Revision history for this message
Bastian Mäuser (mephisto-mephis) wrote :
Download full text (5.1 KiB)

Additional Crash Info (domU):

Aug 28 12:14:08 nmail kernel: [ 1109.970322] smtpd invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0
Aug 28 12:14:09 nmail kernel: [ 1109.970328] Pid: 9500, comm: smtpd Not tainted 2.6.24-19-xen #1
Aug 28 12:14:09 nmail kernel: [ 1109.970335] [<c01606ca>] oom_kill_process+0x10a/0x120
Aug 28 12:14:09 nmail kernel: [ 1109.970344] [<c0160ac7>] out_of_memory+0x167/0x1a0
Aug 28 12:14:09 nmail kernel: [ 1109.970348] [<c016313c>] __alloc_pages+0x35c/0x390
Aug 28 12:14:09 nmail kernel: [ 1109.970352] [<c016528d>] __do_page_cache_readahead+0x11d/0x250
Aug 28 12:14:09 nmail kernel: [ 1109.970355] [<c015d370>] sync_page+0x0/0x40
Aug 28 12:14:09 nmail kernel: [ 1109.970359] [<c01657cc>] do_page_cache_readahead+0x4c/0x70
Aug 28 12:14:09 nmail kernel: [ 1109.970362] [<c015fbc4>] filemap_fault+0x2f4/0x420
Aug 28 12:14:09 nmail kernel: [ 1109.970366] [<c016b9cf>] __do_fault+0x6f/0x6b0
Aug 28 12:14:09 nmail kernel: [ 1109.970372] [<c0170c69>] handle_mm_fault+0x249/0x1350
Aug 28 12:14:09 nmail kernel: [ 1109.970377] [<c0162456>] __pagevec_free+0x26/0x30
Aug 28 12:14:09 nmail kernel: [ 1109.970381] [<c0329346>] do_page_fault+0x366/0xe90
Aug 28 12:14:09 nmail kernel: [ 1109.970387] [<c01165fb>] check_pgt_cache+0x1b/0x20
Aug 28 12:14:09 nmail kernel: [ 1109.970391] [<c0173667>] unmap_region+0x107/0x120
Aug 28 12:14:09 nmail kernel: [ 1109.970395] [<c0174250>] do_munmap+0x180/0x1f0
Aug 28 12:14:09 nmail kernel: [ 1109.970398] [<c0328fe0>] do_page_fault+0x0/0xe90
Aug 28 12:14:09 nmail kernel: [ 1109.970401] [<c0327c85>] error_code+0x35/0x40
Aug 28 12:14:09 nmail kernel: [ 1109.970405] [<c0320000>] vcc_getsockopt+0xc0/0x170
Aug 28 12:14:09 nmail kernel: [ 1109.970409] =======================
Aug 28 12:14:09 nmail kernel: [ 1109.970410] Mem-info:
Aug 28 12:14:09 nmail kernel: [ 1109.970412] DMA per-cpu:
Aug 28 12:14:09 nmail kernel: [ 1109.970414] CPU 0: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Aug 28 12:14:09 nmail kernel: [ 1109.970416] CPU 1: Hot: hi: 0, btch: 1 usd: 0 Cold: hi: 0, btch: 1 usd: 0
Aug 28 12:14:09 nmail kernel: [ 1109.970418] Normal per-cpu:
Aug 28 12:14:09 nmail kernel: [ 1109.970420] CPU 0: Hot: hi: 186, btch: 31 usd: 96 Cold: hi: 62, btch: 15 usd: 57
Aug 28 12:14:09 nmail kernel: [ 1109.970423] CPU 1: Hot: hi: 186, btch: 31 usd: 130 Cold: hi: 62, btch: 15 usd: 50
Aug 28 12:14:09 nmail kernel: [ 1109.970424] HighMem per-cpu:
Aug 28 12:14:09 nmail kernel: [ 1109.970426] CPU 0: Hot: hi: 90, btch: 15 usd: 16 Cold: hi: 30, btch: 7 usd: 23
Aug 28 12:14:09 nmail kernel: [ 1109.970428] CPU 1: Hot: hi: 90, btch: 15 usd: 82 Cold: hi: 30, btch: 7 usd: 9
Aug 28 12:14:09 nmail kernel: [ 1109.970432] Active:141460 inactive:105254 dirty:0 writeback:2 unstable:0
Aug 28 12:14:09 nmail kernel: [ 1109.970433] free:4339 slab:2328 mapped:10 pagetables:1966 bounce:0
Aug 28 12:14:09 nmail kernel: [ 1109.970436] DMA free:4088kB min:72kB low:88kB high:108kB active:4056kB inactive:3180kB present:16256kB pages_scanned:12705 all_unreclaimable? yes
Aug 28 12:14:09 nmail kernel: [ 1109.970438] lowmem_reserve[]: 0 ...

Read more...

Revision history for this message
forall (forall-stalowka) wrote :

Hi

I look this site http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=478765 and ther is information about this bug is fixed in kernel 2.6.25, but When kernel which support XEN is be fixed? Now is only available 2.6.24-X in ubuntu with support XEN.

Changed in linux-meta:
status: Unknown → Fix Released
Revision history for this message
forall (forall-stalowka) wrote :

Hi

Everybody who have problem with kernel-xen 2.6.24, to suggest installed kernel from debian lenny repository
http://packages.debian.org/lenny/xen-linux-system-2.6.26-1-xen-686

Today I installed this kernel from debian repository and until this time I don't have any problem, the system not crashed when I upgrade the installaed pakcages. I will see after long time of using and load system, if system will be not crashed.

Revision history for this message
AlexKent (lbf-dragonrising) wrote :

Hi Forall,

This is one annoying bug!

Just wondering how you installed the package from lenny into etch?

I added the repository to my etch source.list and that generated 'merge' errors.

I've downloaded the deb file itself from the url you gave, but when I went to install (with dpkg -i) it produced a lot of dependency errors. Am I meant to just keep download deb files and going through dependencies until they eventually resolve, or is there a better way?

Ta,

Alex

Revision history for this message
whs (wolfram-heinen) wrote :

Hi,

I get this bug since the 2.6.24-16-xen kernel mostly in some of my running domU's. Today one domU with the 2.6.24-22-xen kernel stopped running while executing 'apt-get update'. This bug ist not CPU specific. My systems are running on ML110 XEON DualCore, ML110 P4, ML115 Opteron and DL160 QuadCore Systems.
I noticed, that increasing the assigned memory size reduces the chance of running into this bug.

Revision history for this message
Lily (starlily) wrote :

I have a Dell 6650 running whatever the latest xen server image is (and I run update/upgrade/dist-upgrade frequently). One DomU locks up pretty regularly with "BUG: soft lockup - CPU#1 stuck for 11s!", usually during large file transfers. DomU Kernel version is 2.6.24-19. The bug *requires* destroying the DomU and restarting it.

Its pretty clear after reading many bug reports about this that it is in the Kernel somewhere (and the kernel team has responded by changing their policy about bug reporting). It is clearly NOT hardware or application specific, as this is reported on many platforms and appears to not have a consistent trigger.

Potentially, this is related to SMP, or PAE, although I find that listing these as an area of issue is an easy scapegoat, even if it may be true.

Id really like someone who KNOWS what this bug is caused by to provide a definitive answer somewhere that can be seen by the public, and if possible provide a workaround or targeted date for release of fix.

Thanks!
Lily

Revision history for this message
John Leach (johnleach) wrote :

I managed to reproduce this quite reliably so did some trials to find out how to improve things.

This is 64bit dom0 on Xen 3.3.0 (On Centos) on Dell 2940s The domU is a 32bit Hardy box with 1G ram. With all the available hardy Xen kernels, this soft lockup kept happening. I tried also tried "clocksource=jiffies". Then I tried the latest Intrepid kernels and the problem was solved.

As I understand, the Intrepid kernel has the proper kernel.org upstream xen support (rather than the forward ported patch from 2.6.18 as is Hardy iirc). So whilst this solves the problem (for me), it's a pretty big change and isn't something I'd expect to see "backported" to Hardy.

The new upstream Xen stuff changes the way block devices and the console are done, so you can't switch without some tweaks to your Xen configs (and guest OS config) but other than that it seemed to work fine with Hardy.

Incidentally, I replaced this domU with a 64bit Hardy install, with the standard 64bit Hardy kernel, and that also solved the soft lockups.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

[This is an automated message. Apologies if it has reached you inappropriately.]

This bug was reported against the linux-meta package when it likely should have been reported against the linux package instead. We are automatically transitioning this to the linux kernel package so that the appropriate teams are notified and made aware of this issue. Thanks.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
tags: added: xen
Revision history for this message
Vikram Dhillon (dhillon-v10) wrote :

Unfortunately it seems this bug is still an issue. Can you confirm this issue exists with the most recent Lucid Lynx 10.04 release - http://cdimage.ubuntu.com/releases/lucid/alpha-2/. If the issue remains in Lucid, please test the latest 2.6.32 upstream kernel build - https://wiki.ubuntu.com/KernelMainlineBuilds . Let us know your results. Thanks.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Emmanuel Kasper (emmanuel-kasper) wrote :

Got hit by this bug as well on a 08.04 server:

root@zimbra:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 8.04.4 LTS
Release: 8.04
Codename: hardy

root@zimbra:~# uname -a
Linux zimbra 2.6.24-29-xen #1 SMP Tue Oct 11 15:58:37 UTC 2011 i686 GNU/Linux

As a workaround I disabled SMP in the domU config as suggested here: https://bugs.launchpad.net/ubuntu/hardy/+source/linux/+bug/240071/comments/14

Now seems to work stable.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
penalvch (penalvch)
tags: added: hardy needs-upstream-testing
removed: xen
tags: added: xen
Revision history for this message
penalvch (penalvch) wrote :

Maiquel, thank you for reporting this bug and helping make Ubuntu better. Please execute the following command, as it will automatically gather debugging information, in a terminal:
apport-collect -p linux 259487

As well, could you please capture the oops following https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies#Capturing_OOPs ? In addition, according to this report, you are not using the most recent version of this package for your Ubuntu release. Please upgrade to the most recent version as per https://launchpad.net/ubuntu/hardy/+source/linux and let us know if you are still having this issue.

Thanks!

summary: - xen virtual Machines and Dom0 crashes with "BUG: soft lockup -
- CPU#0,CPU#1,CPU#2,CPU#3 stuck for 11s!"
+ xen virtual Machines and Dom0 crashes BUG: soft lockup - CPU#0 stuck for
+ 11s! [savelog:]; EIP is at _spin_lock+0x7/0x10
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-bug
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.