Comment 7 for bug 929941

Revision history for this message
Matt Wilson (msw-amazon) wrote : Re: Kernel deadlock in scheduler on m2.2xlarge EC2 instance

I also suspect something going sideways in the PV spinlock code, but nothing has changed in the underlying hardware or hypervisor in this area. There have been bugs in the PV spinlock code in the past, including using mb() instead of barrier() in the unlock path, which could cause the VCPU holding a lock to trigger a kick on the VCPU waiting before the memory write is complete. I looked at the 10.04 kernel, and this particular bug is already addressed in the PV spinlock code.

These instances are under load when they hang. Here's the uptime and /proc/interrupts output from one instance before it hung, but after it was operational:

Linux ip-10-94-81-231 2.6.32-341-ec2 #42-Ubuntu SMP Tue Dec 6 14:56:13 UTC 2011 x86_64 GNU/Linux
16:10:54 up 16 days, 19:52, 0 users, load average: 9.86, 5.01, 3.41"
           CPU0 CPU1 CPU2 CPU3
 16: 186872780 170473347 170447163 170493692 Dynamic-percpu timer
 17: 191775644 350788322 357828130 357481319 Dynamic-percpu resched
 18: 67019 74008 66602 66485 Dynamic-percpu callfunc
 19: 189590 193987 188670 181119 Dynamic-percpu call1func
 20: 0 0 0 0 Dynamic-percpu reboot
 21: 165290618 177938588 177538577 177157514 Dynamic-percpu spinlock
 22: 410 0 0 0 Dynamic-level xenbus
 23: 0 0 0 0 Dynamic-level suspend
 24: 341 0 74 180 Dynamic-level xencons
 25: 392339 664199 899350 700455 Dynamic-level blkif
 26: 19953668 46164431 58214738 57029478 Dynamic-level blkif
 27: 1483445834 0 0 0 Dynamic-level eth0
NMI: 0 0 0 0 Non-maskable interrupts
RES: 191775644 350788323 357828131 357481320 Rescheduling interrupts
CAL: 256609 267995 255272 247604 Function call interrupts

Over the weekend, m2.4xlarge instances hung as well. I'll work on getting dmesg output.