Comment 10 for bug 929941

Revision history for this message
Stefan Bader (smb) wrote : Re: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

Looking at the xen code used for the ec2 guest kernels, this is not overloading the generic spinlock struct with xen data. So at least that cannot overflow. That said, the whole xen spinlock code there is a snapshot from quite a while ago. And I had been working on importing a number of changes to that. But the result was so different from the current released code that moving forward seems rather scary.
First thought on all these CPUs being in the hypercall was that the callback/wakeup from there was failing. But there is also the possibility that somehow the notification about releasing the lock is not sent. The code uses some sort of a stacking list and maybe the workload you found has a better chance of getting that messed up...
Not sure what the best way to go forward would be. Trying to isolate the spinlock related changes from the big update then try those or just have a recent build of the big update and try that. The first option takes more time and probably iterations while the latter may bring other problems.