Ubuntu
linux package

[EC2:cg1.4xlarge] CPU#0 stuck for 23s! [migration/0:6] __do_softirq+0x60/0x210

Bug #944923 reported by Stefan Bader on 2012-03-02

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Triaged	Low	Stefan Bader

Bug Description

Scott noticed some warnings when starting a cg1.4xlarge (HVM) Amazon instance. Right after activating all the VCPUs we see the following error repeated a few times (3-4) then boot continues and I could not see any problems resulting from that. I repeated boot and reboot tests with spot instances and normal ones. The same delay happens on every boot.

[ 28.165770] BUG: soft lockup - CPU#0 stuck for 23s! [migration/0:6]
[ 28.169759] Modules linked in:
[ 28.169759] CPU 0
[ 28.169759] Modules linked in:
[ 28.169759]
[ 28.169759] Pid: 6, comm: migration/0 Not tainted 3.2.0-17-virtual #27-Ubuntu Xen HVM domU
[ 28.169759] RIP: 0010:[<ffffffff8106d3e0>] [<ffffffff8106d3e0>] __do_softirq+0x60/0x210
[ 28.169759] RSP: 0018:ffff8805dd403ee0 EFLAGS: 00000206
[ 28.169759] RAX: 0000000000000000 RBX: ffffffff8101ad19 RCX: 0000000000000001
[ 28.169759] RDX: 0000000000000002 RSI: ffffffff81c0d020 RDI: 0000000000000001
[ 28.169759] RBP: ffff8805dd403f40 R08: 0000000000000000 R09: 0000000000000020
[ 28.169759] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8805dd403e58
[ 28.169759] R13: ffffffff8165ab1e R14: ffff8805dd403f40 R15: ffff8805b7933fd8
[ 28.169759] FS: 0000000000000000(0000) GS:ffff8805dd400000(0000) knlGS:0000000000000000
[ 28.169759] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 28.169759] CR2: 0000000000000000 CR3: 0000000001c05000 CR4: 00000000000006f0
[ 28.169759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 28.169759] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 28.169759] Process migration/0 (pid: 6, threadinfo ffff8805b7932000, task ffff8805b7938000)
[ 28.169759] Stack:
[ 28.169759] ffffffff8105fc23 ffff8805b7933fd8 000000000000000a ffff8805b7933fd8
[ 28.169759] 0000000000000000 ffffffff810bed01 ffff8805b7905df8 0000000000000046
[ 28.169759] ffff8805b7933fd8 0000000000000004 ffffffff810bed01 ffff8805b7905df8
[ 28.169759] Call Trace:
[ 28.169759] <IRQ>
[ 28.169759] [<ffffffff8105fc23>] ? scheduler_tick+0xe3/0x300
[ 28.169759] [<ffffffff810bed01>] ? queue_stop_cpus_work+0xa1/0xd0
[ 28.169759] [<ffffffff810bed01>] ? queue_stop_cpus_work+0xa1/0xd0
[ 28.169759] [<ffffffff8165c2ac>] call_softirq+0x1c/0x30
[ 28.169759] [<ffffffff81015215>] do_softirq+0x65/0xa0
[ 28.169759] [<ffffffff8106d80e>] irq_exit+0x8e/0xb0
[ 28.169759] [<ffffffff8165cc4e>] smp_apic_timer_interrupt+0x6e/0x99
[ 28.169759] [<ffffffff8165ab1e>] apic_timer_interrupt+0x6e/0x80
[ 28.169759] <EOI>
[ 28.169759] [<ffffffff810bede5>] ? stop_machine_cpu_stop+0xb5/0xf0
[ 28.169759] [<ffffffff810bed30>] ? queue_stop_cpus_work+0xd0/0xd0
[ 28.169759] [<ffffffff810bef03>] cpu_stopper_thread+0xe3/0x1b0
[ 28.169759] [<ffffffff810546ca>] ? finish_task_switch+0x4a/0xf0
[ 28.169759] [<ffffffff8164f7dc>] ? __schedule+0x3cc/0x6f0
[ 28.169759] [<ffffffff8104b1e8>] ? __wake_up_common+0x58/0x90
[ 28.169759] [<ffffffff810bee20>] ? stop_machine_cpu_stop+0xf0/0xf0
[ 28.169759] [<ffffffff8108881c>] kthread+0x8c/0xa0
[ 28.169759] [<ffffffff8165c1b4>] kernel_thread_helper+0x4/0x10
[ 28.169759] [<ffffffff81088790>] ? flush_kthread_worker+0xa0/0xa0
[ 28.169759] [<ffffffff8165c1b0>] ? gs_change+0x13/0x13

Tags:

Revision history for this message

Stefan Bader (smb) wrote on 2012-03-02:

Complete dmesg of Precise boot Edit (47.5 KiB, text/plain)

Revision history for this message

Stefan Bader (smb) wrote on 2012-03-02:

Complete dmesg of Oneiric kernel boot Edit (33.7 KiB, text/plain)

Repeating the boot after replacing the Precise kernel by an Oneiric (3.0.0) one there actually is a longer delay about the same time in boot. The difference is that is seems seems shorter 60s vs. 112s and does not cause the softlockup warnings. But something is delaying that boot as well.

Revision history for this message

Stefan Bader (smb) wrote on 2012-03-02:

Looking at the disassembly of the __do_softirq function (because every softlockup warning does show exactly the same instruction pointer address):

0xffffffff8106d3cd <__do_softirq+77>: nopl (%rax)
0xffffffff8106d3d0 <__do_softirq+80>: movl $0x0,%gs:0x13540
0xffffffff8106d3dc <__do_softirq+92>: sti
0xffffffff8106d3dd <__do_softirq+93>: xchg %ax,%ax
0xffffffff8106d3e0 <__do_softirq+96>: xchg %ax,%ax
0xffffffff8106d3e3 <__do_softirq+99>: mov $0xffffffff81c04080,%rbx
0xffffffff8106d3ea <__do_softirq+106>: nopw 0x0(%rax,%rax,1)

this should be matching the following in the kernel C code:

restart:
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);

local_irq_enable(); <-- sti + 2x xchg as nops

h = softirq_vec;

So VCPU#0 seems to be stuck between the two xchg instructions which are used as padding nops (as this 6 bytes get replaced after boot). This is not the place where I would expect a CPU to get stuck. And if it would not really be stuck, it still is weird it is always exactly the same place on all warnings.
At the moment only two explanations sound reasonable: Either the hypervisor for some odd reason stops servicing the vcpu for some time or there is some progress but every time interrupts are enabled on the VCPU#0 it gets held for long enough time we only see this in the warnings.

My problem is that I cannot reproduce this on my 8-core test system. Neither with the CentOS/Xen 3.4.3 nor with the Precise/Xen 4.1.2 combination. Both boot just fine (I even tried overcommitting to 16 VCPUs).

Is there anything special about on what HW these instances are running? Like real number of cpu/cores/ht? Is there anything that the dom0 logs could tell us?

Revision history for this message

Stefan Bader (smb) wrote on 2012-03-02:

Ok, going even back to 2.6.38 it is delayed by roughly the same amount of time I saw with 3.2 but still without the softlockup warnings. Maybe the 60s in 3.0 was just luck?

Joseph Salisbury (jsalisbury) on 2012-03-06

tags:

added: precise

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

[EC2:cg1.4xlarge] CPU#0 stuck for 23s! [migration/0:6] __do_softirq+0x60/0x210

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package