Kernel deadlock in scheduler on multiple EC2 instance types

Bug #929941 reported by Matt Wilson
44
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux-ec2 (Ubuntu)
Fix Released
High
Stefan Bader

Bug Description

SRU Justification:

Impact: The version of Xen patches we currently use for the ec2 kernel have a serious flaw in the handling of nested spinlocks. This can result in a complete deadlock under certain workloads.

Fix: The spinlock handling code has been substantially restructured in later versions of the patchset. The changes backport this but also enable the use of ticket-spinlocks (as we do now) when compiling with the compatibility level we use.

Testcase: Not easy to reproduce. But feedback with the patchset applied (see comment #32) look good.

--

After running for some indeterminate period of time, the 2.6.32-341-ec2 and 2.6.32-342-ec2 kernels stop responding when running on m2.2xlarge EC2 instances. No console output is emitted. Stack dumps gathered by examining CPU context information show that all VCPUs are stuck waiting on spinlocks. This could be a deadlock in the scheduling code.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-341-ec2 2.6.32-341.42
ProcVersionSignature: User Name 2.6.32-341.42-ec2 2.6.32.49+drm33.21
Uname: Linux 2.6.32-341-ec2 x86_64
Architecture: amd64
Date: Fri Feb 10 01:56:17 2012
Ec2AMI: ami-55dc0b3c
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.xlarge
Ec2Kernel: aki-427d952b
Ec2Ramdisk: unavailable
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-ec2

Revision history for this message
Matt Wilson (msw-amazon) wrote :
Revision history for this message
Matt Wilson (msw-amazon) wrote :
Revision history for this message
Matt Wilson (msw-amazon) wrote :
Revision history for this message
Matt Wilson (msw-amazon) wrote :
visibility: public → private
Scott Moser (smoser)
visibility: private → public
Revision history for this message
Matt Wilson (msw-amazon) wrote :

Overnight an instance running 2.6.32-316 locked up. The stack traces are attached.

Revision history for this message
Stefan Bader (smb) wrote :

At a first glance this looks like all the hangs occur because the hypercall used to poll for the pv interrupts never returns. Has there been any change to then hypervisor on the affected instances? Or maybe some correlation with a specific type of hw running those?
Can we get the full dmesg of one of the hanging guests after a reboot (which I assume keeps it on the same host)? Did the instances just idle or where they used for some task that you can tell us about? Thanks.

Changed in linux-ec2 (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Matt Wilson (msw-amazon) wrote :

I also suspect something going sideways in the PV spinlock code, but nothing has changed in the underlying hardware or hypervisor in this area. There have been bugs in the PV spinlock code in the past, including using mb() instead of barrier() in the unlock path, which could cause the VCPU holding a lock to trigger a kick on the VCPU waiting before the memory write is complete. I looked at the 10.04 kernel, and this particular bug is already addressed in the PV spinlock code.

These instances are under load when they hang. Here's the uptime and /proc/interrupts output from one instance before it hung, but after it was operational:

Linux ip-10-94-81-231 2.6.32-341-ec2 #42-Ubuntu SMP Tue Dec 6 14:56:13 UTC 2011 x86_64 GNU/Linux
16:10:54 up 16 days, 19:52, 0 users, load average: 9.86, 5.01, 3.41"
           CPU0 CPU1 CPU2 CPU3
 16: 186872780 170473347 170447163 170493692 Dynamic-percpu timer
 17: 191775644 350788322 357828130 357481319 Dynamic-percpu resched
 18: 67019 74008 66602 66485 Dynamic-percpu callfunc
 19: 189590 193987 188670 181119 Dynamic-percpu call1func
 20: 0 0 0 0 Dynamic-percpu reboot
 21: 165290618 177938588 177538577 177157514 Dynamic-percpu spinlock
 22: 410 0 0 0 Dynamic-level xenbus
 23: 0 0 0 0 Dynamic-level suspend
 24: 341 0 74 180 Dynamic-level xencons
 25: 392339 664199 899350 700455 Dynamic-level blkif
 26: 19953668 46164431 58214738 57029478 Dynamic-level blkif
 27: 1483445834 0 0 0 Dynamic-level eth0
NMI: 0 0 0 0 Non-maskable interrupts
RES: 191775644 350788323 357828131 357481320 Rescheduling interrupts
CAL: 256609 267995 255272 247604 Function call interrupts

Over the weekend, m2.4xlarge instances hung as well. I'll work on getting dmesg output.

summary: - Kernel deadlock in scheduler on m2.2xlarge EC2 instance
+ Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance
Revision history for this message
Matt Wilson (msw-amazon) wrote : Re: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance
Revision history for this message
Stefan Bader (smb) wrote :

The most interesting part of the dmesg for me is to give a rough idea about what Xen version it has. And usually it is helping to make sure whether this correlates on all the cases where the hang happens. It looks like some interaction problem but the only code I can look at is the guest.
Recently there has been a fix to upstream and 3.2.y about a spinlock problem but it blamed a commit in 3.2 to have caused that regression. And the ec2 kernels don't have that patch and I guess the dom0 kernel neither. Just for reference those would have been:

commit 84eb950db13ca40a0572ce9957e14723500943d6
  x86, ticketlock: Clean up types and accessors

for breaking and

commit 7a7546b377bdaa25ac77f33d9433c59f259b9688
  x86: xen: size struct xen_spinlock to always fit in arch_spinlock_t

But maybe we should make sure it is not something similar. I'll check the ec2 kernel code and post the numbers here.

Revision history for this message
Stefan Bader (smb) wrote :

Looking at the xen code used for the ec2 guest kernels, this is not overloading the generic spinlock struct with xen data. So at least that cannot overflow. That said, the whole xen spinlock code there is a snapshot from quite a while ago. And I had been working on importing a number of changes to that. But the result was so different from the current released code that moving forward seems rather scary.
First thought on all these CPUs being in the hypercall was that the callback/wakeup from there was failing. But there is also the possibility that somehow the notification about releasing the lock is not sent. The code uses some sort of a stacking list and maybe the workload you found has a better chance of getting that messed up...
Not sure what the best way to go forward would be. Trying to isolate the spinlock related changes from the big update then try those or just have a recent build of the big update and try that. The first option takes more time and probably iterations while the latter may bring other problems.

Revision history for this message
Stefan Bader (smb) wrote :

This gives me some headaches. So, I tried to figure out what would make sense to pick from the newer code related to spinlocks. The current code (our ec2 topic branch) seems at least to have a potentially dangerous place in xen_spin_kick. There it only checks whether any other cpu spins on the same lock on the top-level. If I understand the code right, they have that chain, so it can handle a cpu to spin on a lock without interrupts disabled and then get to spin on another one on the same cpu in an interrupt section (which would have interrupts disabled).

However while trying to understand that whole thing, I realized that the new code also defines a different raw_spinlock_t and in there is the following comment:

/*
 * Xen versions prior to 3.2.x have a race condition with HYPERVISOR_poll().
 */

Checking for a XEN_COMPAT greater or equal to 3.2 there is some #define magic which basically turns off *all* the ticket spinlock code to be compatible with earlier hypervisors. And we compile with 3.0.2 compatibility. So if we would use that new code, spinlocks would be done as real spinlocks again (meaning no tickets and no hypervisor / unlock interrupt optimization).

Now, if that is true, then the observed hangs should all have happened on a host running Xen lesser than 3.2. If not, well by the amount of changes that are there it still leaves opportunity for having the bug in there. And then this also opens up several paths.

a) trying to figure out the minimal change which will require likely a few iterations to get right.
b) take the complete new code related to spinlocks. however that will result in the drop of usage of ticket locks as long as we need to be compatible with xen <3.2 and I have at least seen instances running on such hosts in the past. so we could as well
c) just pick the non-ticket implementation. of course that could cause some performance regressions
d) make sure no AWS host is running xen <3.2 anymore and pick a compat level of 3.2 (or at least pick the spinlock code in a
way using ticket locking because I am not really confident that changing the compat level overall would not have side effects)

But anyway I'd be quite interested in finding out whether the hangs are on Xen before 3.2 or not.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Stefan,

Which commit has the race condition comment? I'm aware of a problem with SUSE's kernel with regard to PV ticketlocks and HYPERVISOR_poll(), but I don't see any mention in upstream 3.2.x or XenLinux 2.6.18.

Your 10.04 2.6.32-era kernel doesn't have ticketlocks, so the underlying hypervisor version should not be a factor. But for the sake of argument, the lockups are observed on Xen hypervisors newer than 3.2.

What are you using for upstream Xen components for 2.6.32? Is it the SUSE tree?

Revision history for this message
Stefan Bader (smb) wrote :

Matt,

which commit is a bit complicated to say. Basically yes, the code is a merge between the 2.6.32 kernel code we have for 10.04 and the Xen patches SUSE had at that point in time. The "new" tree I am talking was an effort to pick the patches from a newer release and try to work out what is missing / has changed. Which is not that simple because the rebase their tree onto something (which Xen source I never was able to find out) and then refresh their patchset.

If you want to see yourself, you can find the current code at:
git://kernel.ubuntu.com/ubuntu/ubuntu-lucid.git
(check out the ec2 branch) and I have pushed the results of reworking the newer patchset to
git://kernel.ubuntu.com/smb/ubuntu-lucid.git
into the ec2-next branch there.

And IMO we do have ticket locks. Se drivers/xen/core/spinlocks.c in the current ec2 branch. Also the fact that you actually see interrupt counts for the spinlock IRQ. Compiling the ec2-next (maybe a bit optimistic name) branch and run that, you will notice that spinlock are now directly an event channel but also do not get incremented (because compiling with compat set to 3.0.2 disables the ticket lock code).

Ok, so at least that does rule out the hypervisor poll call to be the problem and we can go forward from there. And to repeat the answer to your last question: yes based on SUSE. Be careful when reading code in the ec2 tree. Is is a bit of a pain because it still contains all of the 2.6.32 upstream xen components, plus the SUSE (whatever xen version that is based on). So arch/x86/xen is not used for the ec2 kernel, but arch/x86/include/mach-xen/asm is as are copies of x86 files with -xen to them and some parts in drivers/xen (those pulled in by CONFIG_XEN).

Revision history for this message
Stefan Bader (smb) wrote :

Oh, completely forgot to say: the comment I was talking of shows up in ec2-next in arch/x86/include/mach-xen/asm/spinlock_types.h.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

$ git clone git://kernel.ubuntu.com/smb/ubuntu-lucid.git
Cloning into ubuntu-lucid...
remote: error: Could not read b43f7c4d8d293aa9f47a7094852ebd5355e4f38f
remote: fatal: Failed to traverse parents of commit 3becab1d2df01d54a4e889cf2d69ccb902cd43c3
remote: aborting due to possible repository corruption on the remote side.
fatal: early EOF
fatal: index-pack failed

Revision history for this message
Stefan Bader (smb) wrote :

Oops, sorry about that. The push there did not really indicate that the repo went into such an utter state of disaster. :( It is fixed up now.

Revision history for this message
Stefan Bader (smb) wrote :

Started to look into backporting the spinlock changes from the newer patchset. Without changing the XEN_COMPAT this would result in a non-ticket lock implementation (as mentioned before). Not sure how this behaves, but maybe you want to try. I uploaded kernel packages in that state to http://people.canonical.com/~smb/lp929941/.

Next need to find out whether it would be possible to ignore the possible hypervisor race and enable the modified ticket code regardless of the compat setting. But that will take a bit more time.

Revision history for this message
Stefan Bader (smb) wrote :

Now added v2 builds which include the newer spinlock code (also pulling in some other changes to allow it to compile) and change XEN_COMPAT to 3.2 and later. Question would be whether it is a valid assumption that there won't be a Xen version older than 3.2 on EC2.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

The required CONFIG_XEN_COMPAT value for ec2 is documented here: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/AdvancedUsers.html

Revision history for this message
Stefan Bader (smb) wrote :

Yes, this value is used right now. The question was whether this could be moved by now (depending on the AWS rollout status). But anyway, I changed the patch to activate ticket spinlocks even when compiled for 3.0.2 or higher. Which is what we would be the same situation we have right now, just with the code fixes.
Please give those v3 kernels some testing and let me know how those are behaving. Thanks.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

This has also been observed on c1.xlarge, adjusting the summary

summary: - Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance
+ Kernel deadlock in scheduler on multiple EC2 instance types
Revision history for this message
Stefan Bader (smb) wrote :

Matt, any progress in testing the latest (v3) kernels that I provided?

Revision history for this message
Matt Wilson (msw-amazon) wrote :

I've never been able to reproduce the problem with synthetic workloads. I've asked customers that experience the lockup regularly to test the v3 builds in an environment that won't cause production problems, but haven't received results.

Revision history for this message
Stefan Bader (smb) wrote :

Ah, ok. Thanks. We'll have to wait then.

Revision history for this message
Noah Zoschke (noav) wrote :

Stefan,

We've collected enough instance hours on the v1 kernel to feel confident that it is not suffering the deadlock issue. We are continuing to roll over our affected production instances to it.

We have done basic testing on v3 but we haven't collected enough production data on it yet to report anything.

Can you help me understand the trajectory of these patches for our long term planning?

Is there any indication of when v1 or v3 would land in an official linux -ec2 release?

What can we do to help the most here? Collect significant instance hours on v3?

Revision history for this message
Stefan Bader (smb) wrote :

Noah,

thanks for testing and reporting the results. The first thing to do now, is to decide whether v1 or v3 should be the goal. v1 could be considered well tested by now. The downside I see with that, is that to avoid some problems on certain older hypervisor code, this uses real spinning spinlocks. Which means while waiting for a lock, the virtual cpu will busily wait (which could have some impact on the cloud hosts cpu usage. Also this gives no queuing, which means that getting the lock can be unfair in contented situations.
The v3 kernel would in principle use the same implementation, which could theoretically be the wrong thing on older hypervisor versions (though the chance to have an instance launched on such an older host version is likely to get smaller every day). At least it is the same risk as we have now and the lockups happened on newer hypervisor versions. So I would tend towards the v3 solution but for that it would be good to have more hours testing with v3 to see it is not showing other problems that might be related to this change.

Normally the process to get a change into an official kernel means to propose it for SRU (stable release update), I will propose the patches for inclusion and when accepted those get into a proposed kernel. Normally those are prepared and made available and then verification has to be done within a week. Which is not working with a bug like this. But if there is a reasonable confidence that a test kernel has been running on your busy instances without the original issue and new stability problems, this should be a good argument.

Since the time I build the current v3 kernels there have been other updates, too. So I would go ahead and prepare a new set of those. I will post here when those are ready. If you then could start migrating your instances to those and report back here when you feel confident about the stability. Then I would start the steps required to integrate the changes into the official kernels.

Revision history for this message
Stefan Bader (smb) wrote :

Newer kernels have been uploaded to same place as before.

Revision history for this message
Noah Zoschke (noav) wrote :

Thank you for the information. We will begin limited testing of the latest kernels provided.

Revision history for this message
Ilan (ilan) wrote :

We believe we are experiencing this bug as well. The most frequently impacted instance type in our environment seems to be c1.xlarge. However, it appears to be almost entirely at random and rarely hits the same instances twice.

We're currently testing the new kernels on a very limited set of machines. We will report back with our experiences.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

We've had a customer report a very similar looking lockup on 3.0.0-20-virtual. Full version info, "3.0.0-20-virtual (buildd@yellow) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #34~lucid1-Ubuntu SMP Wed May 2 17:24:41 UTC 2012 (Ubuntu 3.0.0-20.34~lucid1-virtual 3.0.30)"

Revision history for this message
Ilan (ilan) wrote :

We've had a few instances running on 2.6.32-345-ec2 #47+lp929941v3 linked from this ticket since 2012-05-09. So far those instances have been stable, but it is not possible for us to determine if the the crash has been resolved, or if the subset of instances we upgraded was just lucky enough to not trigger this bug.

As mentioned before the crashes do not appear to be consistently reproducible, and hits our instances at random.

Revision history for this message
Ilan (ilan) wrote :

Still seeing the crash with the most recent kernel update in lucid: 2.6.32-345-ec2 #49-Ubuntu SMP

Haven't yet seen a crash on 2.6.32-345-ec2_2.6.32-345.47+lp929941v3_amd64.deb.

Stefan Bader (smb)
Changed in linux-ec2 (Ubuntu):
status: Incomplete → In Progress
Stefan Bader (smb)
description: updated
Stefan Bader (smb)
Changed in linux-ec2 (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel for lucid in -proposed solves the problem (2.6.32-346.51). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lucid' to 'verification-done-lucid'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-lucid
Revision history for this message
Ilan (ilan) wrote :

Given the sporadic nature of this bug, it would take at least 2 weeks of testing before we could say with even a slight bit of confidence that a given kernel has stopping the crashes we were experiencing.

Revision history for this message
Stefan Bader (smb) wrote :

I would mark this as verified as the intended change has been running in test kernels for some time before and as Ilan said it would take longer than the verification period to hit it.

tags: added: verification-done-lucid
removed: verification-needed-lucid
Revision history for this message
Fabio Kung (fabiokung) wrote :

We are very confident that this bug is not present in the v1 kernel, as we have been only running instances with that kernel for some months now, and we have not seen these issues anymore. Me and noav were some of the original reporters of this bug.

We can help testing the kernel currently in -proposed, but as others have already commented, one week would not be enough to collect enough instance hours. Two weeks would give us much more confidence.

Revision history for this message
Fabio Kung (fabiokung) wrote :

We started with 30 instances running with the -proposed ec2 kernel, in our production environment. We plan to gradually boot more and update this ticket once we collect enough instance hours to be confident that this bug is not present in that version.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-ec2 - 2.6.32-346.51

---------------
linux-ec2 (2.6.32-346.51) lucid-proposed; urgency=low

  [ Stefan Bader ]

  * SAUCE: Update spinlock handling code
    - LP: #929941
  * SAUCE: Use ticket locks for Xen 3.0.2+
    - LP: #929941
  * Rebased to Ubuntu-2.6.32-41.93
  * Release Tracking Bug
    - LP: #1021084

  [ Ubuntu: 2.6.32-41.93 ]

  * No change upload to fix .ddeb generation in the PPA.

  [ Ubuntu: 2.6.32-41.92 ]

  * drm/i915: Move Pineview CxSR and watermark code into update_wm hook.
    - LP: #1004707
  * drm/i915: Add CxSR support on Pineview DDR3
    - LP: #1004707
 -- Stefan Bader <email address hidden> Mon, 25 Jun 2012 11:20:40 +0200

Changed in linux-ec2 (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.