I also suspect something going sideways in the PV spinlock code, but nothing has changed in the underlying hardware or hypervisor in this area. There have been bugs in the PV spinlock code in the past, including using mb() instead of barrier() in the unlock path, which could cause the VCPU holding a lock to trigger a kick on the VCPU waiting before the memory write is complete. I looked at the 10.04 kernel, and this particular bug is already addressed in the PV spinlock code.
These instances are under load when they hang. Here's the uptime and /proc/interrupts output from one instance before it hung, but after it was operational:
I also suspect something going sideways in the PV spinlock code, but nothing has changed in the underlying hardware or hypervisor in this area. There have been bugs in the PV spinlock code in the past, including using mb() instead of barrier() in the unlock path, which could cause the VCPU holding a lock to trigger a kick on the VCPU waiting before the memory write is complete. I looked at the 10.04 kernel, and this particular bug is already addressed in the PV spinlock code.
These instances are under load when they hang. Here's the uptime and /proc/interrupts output from one instance before it hung, but after it was operational:
Linux ip-10-94-81-231 2.6.32-341-ec2 #42-Ubuntu SMP Tue Dec 6 14:56:13 UTC 2011 x86_64 GNU/Linux
16:10:54 up 16 days, 19:52, 0 users, load average: 9.86, 5.01, 3.41"
CPU0 CPU1 CPU2 CPU3
16: 186872780 170473347 170447163 170493692 Dynamic-percpu timer
17: 191775644 350788322 357828130 357481319 Dynamic-percpu resched
18: 67019 74008 66602 66485 Dynamic-percpu callfunc
19: 189590 193987 188670 181119 Dynamic-percpu call1func
20: 0 0 0 0 Dynamic-percpu reboot
21: 165290618 177938588 177538577 177157514 Dynamic-percpu spinlock
22: 410 0 0 0 Dynamic-level xenbus
23: 0 0 0 0 Dynamic-level suspend
24: 341 0 74 180 Dynamic-level xencons
25: 392339 664199 899350 700455 Dynamic-level blkif
26: 19953668 46164431 58214738 57029478 Dynamic-level blkif
27: 1483445834 0 0 0 Dynamic-level eth0
NMI: 0 0 0 0 Non-maskable interrupts
RES: 191775644 350788323 357828131 357481320 Rescheduling interrupts
CAL: 256609 267995 255272 247604 Function call interrupts
Over the weekend, m2.4xlarge instances hung as well. I'll work on getting dmesg output.