Comment 11 for bug 929941

Revision history for this message
Stefan Bader (smb) wrote : Re: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

This gives me some headaches. So, I tried to figure out what would make sense to pick from the newer code related to spinlocks. The current code (our ec2 topic branch) seems at least to have a potentially dangerous place in xen_spin_kick. There it only checks whether any other cpu spins on the same lock on the top-level. If I understand the code right, they have that chain, so it can handle a cpu to spin on a lock without interrupts disabled and then get to spin on another one on the same cpu in an interrupt section (which would have interrupts disabled).

However while trying to understand that whole thing, I realized that the new code also defines a different raw_spinlock_t and in there is the following comment:

/*
 * Xen versions prior to 3.2.x have a race condition with HYPERVISOR_poll().
 */

Checking for a XEN_COMPAT greater or equal to 3.2 there is some #define magic which basically turns off *all* the ticket spinlock code to be compatible with earlier hypervisors. And we compile with 3.0.2 compatibility. So if we would use that new code, spinlocks would be done as real spinlocks again (meaning no tickets and no hypervisor / unlock interrupt optimization).

Now, if that is true, then the observed hangs should all have happened on a host running Xen lesser than 3.2. If not, well by the amount of changes that are there it still leaves opportunity for having the bug in there. And then this also opens up several paths.

a) trying to figure out the minimal change which will require likely a few iterations to get right.
b) take the complete new code related to spinlocks. however that will result in the drop of usage of ticket locks as long as we need to be compatible with xen <3.2 and I have at least seen instances running on such hosts in the past. so we could as well
c) just pick the non-ticket implementation. of course that could cause some performance regressions
d) make sure no AWS host is running xen <3.2 anymore and pick a compat level of 3.2 (or at least pick the spinlock code in a
way using ticket locking because I am not really confident that changing the compat level overall would not have side effects)

But anyway I'd be quite interested in finding out whether the hangs are on Xen before 3.2 or not.