[EC2:cg1.4xlarge] CPU#0 stuck for 23s! [migration/0:6] __do_softirq+0x60/0x210

Bug #944923 reported by Stefan Bader
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
Low
Stefan Bader

Bug Description

Scott noticed some warnings when starting a cg1.4xlarge (HVM) Amazon instance. Right after activating all the VCPUs we see the following error repeated a few times (3-4) then boot continues and I could not see any problems resulting from that. I repeated boot and reboot tests with spot instances and normal ones. The same delay happens on every boot.

[ 28.165770] BUG: soft lockup - CPU#0 stuck for 23s! [migration/0:6]
[ 28.169759] Modules linked in:
[ 28.169759] CPU 0
[ 28.169759] Modules linked in:
[ 28.169759]
[ 28.169759] Pid: 6, comm: migration/0 Not tainted 3.2.0-17-virtual #27-Ubuntu Xen HVM domU
[ 28.169759] RIP: 0010:[<ffffffff8106d3e0>] [<ffffffff8106d3e0>] __do_softirq+0x60/0x210
[ 28.169759] RSP: 0018:ffff8805dd403ee0 EFLAGS: 00000206
[ 28.169759] RAX: 0000000000000000 RBX: ffffffff8101ad19 RCX: 0000000000000001
[ 28.169759] RDX: 0000000000000002 RSI: ffffffff81c0d020 RDI: 0000000000000001
[ 28.169759] RBP: ffff8805dd403f40 R08: 0000000000000000 R09: 0000000000000020
[ 28.169759] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8805dd403e58
[ 28.169759] R13: ffffffff8165ab1e R14: ffff8805dd403f40 R15: ffff8805b7933fd8
[ 28.169759] FS: 0000000000000000(0000) GS:ffff8805dd400000(0000) knlGS:0000000000000000
[ 28.169759] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 28.169759] CR2: 0000000000000000 CR3: 0000000001c05000 CR4: 00000000000006f0
[ 28.169759] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 28.169759] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 28.169759] Process migration/0 (pid: 6, threadinfo ffff8805b7932000, task ffff8805b7938000)
[ 28.169759] Stack:
[ 28.169759] ffffffff8105fc23 ffff8805b7933fd8 000000000000000a ffff8805b7933fd8
[ 28.169759] 0000000000000000 ffffffff810bed01 ffff8805b7905df8 0000000000000046
[ 28.169759] ffff8805b7933fd8 0000000000000004 ffffffff810bed01 ffff8805b7905df8
[ 28.169759] Call Trace:
[ 28.169759] <IRQ>
[ 28.169759] [<ffffffff8105fc23>] ? scheduler_tick+0xe3/0x300
[ 28.169759] [<ffffffff810bed01>] ? queue_stop_cpus_work+0xa1/0xd0
[ 28.169759] [<ffffffff810bed01>] ? queue_stop_cpus_work+0xa1/0xd0
[ 28.169759] [<ffffffff8165c2ac>] call_softirq+0x1c/0x30
[ 28.169759] [<ffffffff81015215>] do_softirq+0x65/0xa0
[ 28.169759] [<ffffffff8106d80e>] irq_exit+0x8e/0xb0
[ 28.169759] [<ffffffff8165cc4e>] smp_apic_timer_interrupt+0x6e/0x99
[ 28.169759] [<ffffffff8165ab1e>] apic_timer_interrupt+0x6e/0x80
[ 28.169759] <EOI>
[ 28.169759] [<ffffffff810bede5>] ? stop_machine_cpu_stop+0xb5/0xf0
[ 28.169759] [<ffffffff810bed30>] ? queue_stop_cpus_work+0xd0/0xd0
[ 28.169759] [<ffffffff810bef03>] cpu_stopper_thread+0xe3/0x1b0
[ 28.169759] [<ffffffff810546ca>] ? finish_task_switch+0x4a/0xf0
[ 28.169759] [<ffffffff8164f7dc>] ? __schedule+0x3cc/0x6f0
[ 28.169759] [<ffffffff8104b1e8>] ? __wake_up_common+0x58/0x90
[ 28.169759] [<ffffffff810bee20>] ? stop_machine_cpu_stop+0xf0/0xf0
[ 28.169759] [<ffffffff8108881c>] kthread+0x8c/0xa0
[ 28.169759] [<ffffffff8165c1b4>] kernel_thread_helper+0x4/0x10
[ 28.169759] [<ffffffff81088790>] ? flush_kthread_worker+0xa0/0xa0
[ 28.169759] [<ffffffff8165c1b0>] ? gs_change+0x13/0x13

Tags: precise
Revision history for this message
Stefan Bader (smb) wrote :
Revision history for this message
Stefan Bader (smb) wrote :

Repeating the boot after replacing the Precise kernel by an Oneiric (3.0.0) one there actually is a longer delay about the same time in boot. The difference is that is seems seems shorter 60s vs. 112s and does not cause the softlockup warnings. But something is delaying that boot as well.

Revision history for this message
Stefan Bader (smb) wrote :

Looking at the disassembly of the __do_softirq function (because every softlockup warning does show exactly the same instruction pointer address):

0xffffffff8106d3cd <__do_softirq+77>: nopl (%rax)
0xffffffff8106d3d0 <__do_softirq+80>: movl $0x0,%gs:0x13540
0xffffffff8106d3dc <__do_softirq+92>: sti
0xffffffff8106d3dd <__do_softirq+93>: xchg %ax,%ax
0xffffffff8106d3e0 <__do_softirq+96>: xchg %ax,%ax
0xffffffff8106d3e3 <__do_softirq+99>: mov $0xffffffff81c04080,%rbx
0xffffffff8106d3ea <__do_softirq+106>: nopw 0x0(%rax,%rax,1)

this should be matching the following in the kernel C code:

restart:
        /* Reset the pending bitmask before enabling irqs */
        set_softirq_pending(0);

        local_irq_enable(); <-- sti + 2x xchg as nops

        h = softirq_vec;

So VCPU#0 seems to be stuck between the two xchg instructions which are used as padding nops (as this 6 bytes get replaced after boot). This is not the place where I would expect a CPU to get stuck. And if it would not really be stuck, it still is weird it is always exactly the same place on all warnings.
At the moment only two explanations sound reasonable: Either the hypervisor for some odd reason stops servicing the vcpu for some time or there is some progress but every time interrupts are enabled on the VCPU#0 it gets held for long enough time we only see this in the warnings.

My problem is that I cannot reproduce this on my 8-core test system. Neither with the CentOS/Xen 3.4.3 nor with the Precise/Xen 4.1.2 combination. Both boot just fine (I even tried overcommitting to 16 VCPUs).

Is there anything special about on what HW these instances are running? Like real number of cpu/cores/ht? Is there anything that the dom0 logs could tell us?

Revision history for this message
Stefan Bader (smb) wrote :

Ok, going even back to 2.6.38 it is delayed by roughly the same amount of time I saw with 3.2 but still without the softlockup warnings. Maybe the 60s in 3.0 was just luck?

tags: added: precise
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.