Strange 'fork/clone' blocking behavior under high cpu usage on EC2

Bug #708920 reported by Matt Wilson
208
This bug affects 16 people
Affects Status Importance Assigned to Milestone
linux-ec2 (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

There have been reports of fork() hangs on Lucid when running on EC2. See this YouTube video for an example: http://www.youtube.com/watch?v=rbURfuAmtXw

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-305-ec2 2.6.32-305.9
ProcVersionSignature: User Name 2.6.32-305.9-ec2 2.6.32.11+drm33.2
Uname: Linux 2.6.32-305-ec2 x86_64
Architecture: amd64
Date: Thu Jan 27 22:25:15 2011
Ec2AMI: ami-fd4aa494
Ec2AMIManifest: ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20100427.1.manifest.xml
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.xlarge
Ec2Kernel: aki-0b4aa462
Ec2Ramdisk: unavailable
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-ec2

Revision history for this message
Matt Wilson (msw-amazon) wrote :
Revision history for this message
Matt Wilson (msw-amazon) wrote :

This is a transcription of the test program from the youtube video:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {
  int children = 0;
  int status;
  int i = 0;

  if (argc < 2) {
    printf("Usage: %s <children<\n", argv[0]);
    return 1;
  }

  children = atoi(argv[1]);
  for (i = 0; i < children; i++) {
    printf("Starting child %d\n", i);
    if (fork() == 0) {
      fprintf(stderr, "child%d up %d\n", i, getpid());
      while (1) {
        /* Nothing, just spin */
      }
    } else {
      // parent continues
    }
  }

  wait(&status);
  return 0;
}

visibility: public → private
Revision history for this message
Matt Wilson (msw-amazon) wrote :

On a system in this condition, sometimes hung task traces are seen:

kernel: [65098.694112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: [65098.694117] cron D ffff880001885380 0 21248 569 0x00000000
kernel: [65098.694121] ffff880772e25d20 0000000000000282 0000000000000001 ffff880772e25ca0
kernel: [65098.694124] 0000000000000002 ffff880772e25ce8 ffff8802e100c678 ffff880772e25fd8
kernel: [65098.694126] ffff8802e100c2c0 ffff8802e100c2c0 ffff8802e100c2c0 ffff880772e25fd8
kernel: [65098.694128] Call Trace:
kernel: [65098.694137] [<ffffffff810e312a>] ? cache_alloc_refill+0x6a/0x270
kernel: [65098.694141] [<ffffffff814a199d>] schedule_timeout+0x1dd/0x2c0
kernel: [65098.694145] [<ffffffff814a0b02>] wait_for_common+0xf2/0x1d0
kernel: [65098.694150] [<ffffffff81030260>] ? default_wake_function+0x0/0x10
kernel: [65098.694152] [<ffffffff814a0c78>] wait_for_completion+0x18/0x20
kernel: [65098.694156] [<ffffffff8103ba19>] do_fork+0x149/0x430
kernel: [65098.694160] [<ffffffff81109f04>] ? mntput_no_expire+0x24/0x110
kernel: [65098.694164] [<ffffffff810534e0>] ? set_one_prio+0x70/0xd0
kernel: [65098.694168] [<ffffffff8100fe60>] sys_vfork+0x20/0x30
kernel: [65098.694171] [<ffffffff81009f73>] stub_vfork+0x13/0x20
kernel: [65098.694174] [<ffffffff81009ba8>] ? system_call_fastpath+0x16/0x1b

Revision history for this message
Matt Wilson (msw-amazon) wrote :

It seems that this reproduction case only happens after the system has been used for some unknown amount of time. At that point, fork() hangs can be triggered at will. If the instance is rebooted, the test case no longer causes hangs.

visibility: private → public
Revision history for this message
Matt Wilson (msw-amazon) wrote :

Attaching /proc/slabinfo from a system that can be used to cause fork() hangs.

Revision history for this message
Mikhail P (m-mihasya) wrote :

Quick note: it is not just Lucid; seeing similar behavior in Karmic. In fact, Matt, the node we looked at today was a Karmic node.

Revision history for this message
Mikhail P (m-mihasya) wrote :

Argh never mind, had my alphabet backwards. It was a 10.04 LTS aka Lucid node.

Revision history for this message
Matt Wilson (msw-amazon) wrote :
Revision history for this message
Matt Wilson (msw-amazon) wrote :

The following kernel stack was captured on a system in "fork() hangs" state via "echo t > /proc/sysrq-trigger". The code for libctest is here: https://gist.github.com/2d2b78987ea451c2edd6

<6>[853486.204130] libctest R running task 0 13658 1417 0x00000000
<4>[853486.204132] ffffffff00000000 0000000000000000 ffff88088b03a980 ffff8800e3489ea8
<4>[853486.204134] ffffffff810909f7 ffff88088b03aa48 ffff88088b03a980 ffff8800e3489ec8
<4>[853486.204137] ffffffff81109f04 ffff880889a27c80 0000000000008002 ffff8804d4470680
<4>[853486.204139] Call Trace:
<4>[853486.204141] [<ffffffff810909f7>] ? __call_rcu+0x77/0x1c0
<4>[853486.204143] [<ffffffff81109f04>] ? mntput_no_expire+0x24/0x110
<4>[853486.204145] [<ffffffff811078ab>] ? alloc_fd+0x4b/0x160
<4>[853486.204148] [<ffffffff810f79f0>] ? putname+0x30/0x50
<4>[853486.204150] [<ffffffff810eb526>] ? do_sys_open+0x106/0x160
<4>[853486.204152] [<ffffffff810eb5ab>] ? sys_open+0x1b/0x20
<4>[853486.204154] [<ffffffff81009ba8>] ? system_call_fastpath+0x16/0x1b
<4>[853486.204157] [<ffffffff81009b40>] ? system_call+0x0/0x52
<4>[853486.204159] [<ffffffff81009b40>] ? system_call+0x0/0x52

Revision history for this message
Gavin (gavin-mcquillan) wrote :

I work at a company that has had very similar issues.

One of the suspected theories (because this seems to strike us even within the JVM where we aren't forking) is that it's a problem between the hypervisor and the hardware.

What kind of hardware were you running when you found this problem? We found that there are at least three varieties of m1.xlarge CPU architectures (cat /proc/cpuinfo). Easiest way to distinguish is by checking the cpu cache size. We've seen 1M cache, 4M and 6M. The problem you're detailing has been consistently reproduced for us on the 1M cache variety of instances.

Is this true for you, too?

Revision history for this message
Gavin (gavin-mcquillan) wrote :

Correction, we see the problem with the 4M cache CPUs.

Revision history for this message
Gavin (gavin-mcquillan) wrote :

Bad architecture:

vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
stepping : 5
cpu MHz : 2266.746
cache size : 4096 KB

Good architecture:

vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
stepping : 10
cpu MHz : 2660.000
cache size : 6144 KB

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Gavin,

Can you reproduce the issue at will? I'm struggling to find a way to reproduce the issue on a freshly booted instance.

Revision history for this message
Mike Malone (mmalone) wrote :

The node we were working on this morning was:

vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5550 @ 2.67GHz
stepping : 5
cpu MHz : 2666.760
cache size : 8192 KB

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Mike, can you click on the "affects me" for this bug?

Revision history for this message
Stefan Bader (smb) wrote :

Can someone confirm whether this is an issue with a recent version of he kernel. Unsuccessfully tried the sample program with 2.6.32-311-ec2.

Revision history for this message
Scott Moser (smoser) wrote :

ami-fd4aa494 is one of our official images: us-east-1 ubuntu-lucid-10.04-amd64-server-20100427.1

That said, its very old. The build there (unless changed) is running kernel linux-image-2.6.32-305-ec2 2.6.32-305.9
The most recent lucid amd64 is:
  ami-f8f40591 ubuntu-lucid-10.04-amd64-server-20101228
running kernel linux-image-2.6.32-311-ec2 2.6.32-311.23

You can see the extensive list of changes between the two kernel versions at
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux-ec2/linux-ec2_2.6.32-311.23/changelog

You should be able to try the new kernel by
a.) specify it at launch time
    ec2-run-instances --kernel aki-74f4051d ami-fd4aa494 ...
   Then, apt-get update && apt-get upgrade to get yourself the kernel modules to match the running kernel.
   This should work, as the lucid kernel has networking and disk builtin
b.) try the newer ami 'ami-f8f40591'
c.) install grub-legacy-ec2, stop instance, start instance with pv-grub aki aki-427d952b. From then on, you could reboot into a different kernel.

Determining if we can reproduce this on a recent kernel will just give us much more information to go from, and avoid chasing ghosts.

Revision history for this message
Mike Malone (mmalone) wrote :

We are running ami-fd4aa494 with 2.6.32-305-ec2 in us-east. I'll see what I can do about setting up a couple nodes with the more recent 2.6.32 kernel build and report back.

We've already started running a few Maverick instances with 2.6.35-24-virtual, and so far they appear to be more stable. Unfortunately, the issue is not easy to reproduce initially (as evidenced by the efforts in this thread). Recently restarted instances appear to be more stable than those that have been running for a while under load. It looks like some heisenbug gets the system into a sideways state, and once that happens you can lock things up pretty deterministically with something as trivial as a tight loop. So it's possible that Maverick will go sideways too at some point and we simply haven't seen it yet. Hard to say for sure without knowing what the trigger is.

Revision history for this message
Paul Lathrop (paul-simplegeo) wrote :

We've booted up a couple instances using the new kernel and are attempting to break them. As Mike pointed out, it is hard for us to prove a negative; this is something that we can reliably reproduce once we get the machine to do it once, but before that it is inconsistent.

Revision history for this message
Mike Malone (mmalone) wrote :

We're still working on a way to repro this on a new instance. Meanwhile, we're moving forward with testing on the newer 2.6.32 kernel and on Maverick's 2.6.35 kernel.

One observation we have made is that if we run the libctest in a loop (`while :; do ./libctest; done`) on 2.6.32 it will eventually hang the process (apparently forever, I've left it as long as 4 hours). This behavior is reproducible on a fresh instance and happens on both 2.6.32-305 and 2.6.32-311. We are NOT able to repro on 2.6.35. Could be unrelated, or expected behavior, but from my review of the libctest code what we're seeing does appear pathological.

Revision history for this message
Mike Malone (mmalone) wrote :

I am able to reproduce this behavior on an instance running kernel version 2.3.32-311 using ami-f8f40591. As with the older kernel, new instances don't immediately exhibit symptoms.

Revision history for this message
Mike Malone (mmalone) wrote :

Fat fingered that kernel version. Should be 2.6.32-311-ec2.

Revision history for this message
Mike Malone (mmalone) wrote :

Also, potentially related, the 2.6.32 kernels seem to sometimes drop timer signals for CLOCK_PROCESS_CPUTIME_ID and continue to report unusual process CPU times in system monitoring tools like top and via the proc filesystem. I can reproduce the signal behavior with this program: https://gist.github.com/14585329e013d1bf5134 -- some runs are uneventful and proceed as expected, others hang forever without ever receiving a signal.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Hi Mike,

Let's focus on the fork() hangs in this bug. It's true that the two could be related, but the symptoms don't quite line up.

You say you can reproduce the behavior on 2.3.32-311. Do you have a procedure for getting an instance into the broken state, so you can then cause fork() hangs with spinning CPUs?

Also, is irqbalance running on your instances? If not, does the fork() hanging behavior change if irqbalance is running on the instance (started at boot)?

Revision history for this message
Mike Malone (mmalone) wrote :

Matt,

Still no way to reproduce deterministically. I've just been running variations of the test I posted above, writing to /dev/null and setting timer signals. At some point the tests/system start hanging.

irqbalance is not running on any of the test instances that are hanging (it appears to be started automatically on the Maverick ami, but that node is happy). I believe we've run it in the past on some nodes. I'll check if we there are any that are hanging that also have irqbalance started at boot. I know that I personally started irqbalance on some nodes after boot to see if that would fix things (to no avail), but it wasn't started at boot.

If we don't have any running nodes with irqbalance I'll have to reboot a node and start over, which may take some time.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

If you have an instance in a state where fork() will hang if you spin a CPU, it would be a good experiment to see if irqbalance helps at all.

Revision history for this message
Alec Thomas (alec-swapoff) wrote :

I am running Eclipse on EC2 over NX and this bug surfaces quite quickly, in the form of total system hangs for .5 to 5 minutes at a time.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Alec,

Do any hung task kernel stack traces get emitted during your hangs?

Revision history for this message
Mike Malone (mmalone) wrote :

Starting irqbalance (without rebooting) on a node in a state where fork() will hang does not help.

Revision history for this message
Jordan Sissel (jls-semicomplete) wrote :

I see this in a particularly annoying way.

apt-get when run from another tool (like puppet) will have /dev/null for stdin. apt-get foolishly select(2)s on stdin which results in 100% cpu usage (stdin is always ready when it is /dev/null). This 100% cpu qualifies for the fork-blocking, and since apt-get forks dpkg (which again forks dpkg) during installation, dpkg never actually forks, so apt-get stalls the entire system for hours if it ever even comes back.

I found a work around for *only* the apt-get or similar situations in that running the process through nice(8) seems to allow fork to unwedge. This is obviously impractical for low-latency services like databases, etc, but can help for others.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Jordan,

Do you see this behavior at boot, or only after your instance has been up and running for a while?

Revision history for this message
Alec Thomas (alec-swapoff) wrote :

@Matt,

No I don't see anything particularly unusual in the logs at all :\

Also, as suggested by Gavin, I rebooted my EC2 instance until I got a machine with CPU model 23 and have not had a repeat of the issue.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Alec,

Do you have instance IDs from your hanging instances?

Revision history for this message
Alec Thomas (alec-swapoff) wrote :

The first one I rebuilt, as I thought it might have been related to a dud EBS volume, but the current one is i-55a92d39. Mail me at <email address hidden> if you want to take this offline.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

If anyone has a machine that they can get into the hanging state (with fork() blocking), can you run run "echo w > /proc/sysrq-trigger" as root and post the results?

Revision history for this message
Mike Malone (mmalone) wrote :

Hey Matt,

I ran it once while the system was idle: https://gist.github.com/fb35566354afc442bf2d

And then again in a tight loop with a 50ms sleep while I locked the system up, in the hopes of catching something interesting just before or during the period when the system was locked: https://gist.github.com/ce0887dcdc125afcca94

Revision history for this message
Matt Wilson (msw-amazon) wrote :

I've done a lot of looking at this today. It feels like the problem may lie in the process scheduler. When I pin the CPU burning process to CPU0 (through "taskset -pc 0 $pid_printed_by_a_out"), and pin a bash shell also to CPU0, I see failure of the bash process to wake after sleeping (i.e., it's runnable, but CFS isn't giving it time). I've seen the bash process start to be scheduled after around 3 minutes, and I've also seen it just sit there.

Every time I've seen a scheduler debug trace (triggered via "echo w > /proc/sysrq-trigger"), there have been other runnable processes on the spinning CPU that don't seem to be getting scheduled at all.

I've not been able to reproduce this problem on the kernel used in the Amazon Linux AMI (currently 2.6.34.7). This is in line with other user's observations (http://twitter.com/#!/synack/status/30415380321140737).

I think that Canonical might need to look into what (if any) changes they've made to CFS in the 10.04 kernel tree. It's also possible that improvements have been made in CFS between 2.6.32 and 2.6.34 that account for better performance.

Revision history for this message
Mike Malone (mmalone) wrote :

Matt,

That's definitely in line with what we're seeing. Not to beat a dead horse, and I am not a kernel hacker (so I apologize for any naivety), etc... but that's basically what led me to believe the CLOCK_PROCESS_CPUTIME_ID timers, top issues, and other CPU time monitoring may be relevant here.

Based on my rather high level understanding of CFS -- the big idea being that processes are scheduled based on their run time -- it seems plausible that a process that appears to be using no CPU time would continue to be prioritized by the scheduler. I'll leave any further analysis to someone who knows how these systems work.

We have done a bit more testing on the 2.6.35 series kernels, with Ubuntu 10.10 (Maverick). So far we have not been able to reproduce this issue there. If we do I'll report back, but for now we are moving forward with migrating effected systems to Maverick. We're happy to assist with efforts to track this further, but I think we've documented most of what we've experienced.

Revision history for this message
Matt Wilson (msw-amazon) wrote :

Mike,

You bring up a good point about CFS' need for good process time accounting. I think that this upstream patch may fix a lot of problems: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8a22b9996b001c88f2bfb54c6de6a05fc39e177a

This patch is in 2.6.34.7, may not be in the 10.04 kernel.

Revision history for this message
Andy Whitcroft (apw) wrote :

The commit mentioned in comment #39 is included in the Ubuntu-2.6.32-311.23 EC2 kernel.

Revision history for this message
Stefan Bader (smb) wrote :

I tried the approach from comment #30 and also comment #37 with ami-fa01f193 in us-east. This runs a 2.6.32-312.24 kernel currently in proposed and was running on hw that showed 6M cache in cpuinfo. Both methods did not cause the scheduler bug for me.

I will have a look at the source code and see whether something can be found there. This unfortunately is a bit bulky.

Revision history for this message
Stefan Bader (smb) wrote :

While there is not yet a patch that I would suspect of being the one fixing this issue, there were a larger number of changes that had been done upstream but missed the Lucid ec2 kernel because it is using xen specific copies of x86 files. Some of them will have no effect as compile options cause the code not to be included but some were in areas which may have effect on this issue.

I have prepared some test kernel packages (and also some test aki's) which contain the sync with all current stable updates. It would be awesome if someone could check that kernel and see whether it removes the problem already (at least not introduces new ones). I booted a i386 and amd64 instance with it and could not see immediate problems.

The debian packages (for pv-grub) would be at http://people.canonical.com/~smb/lp708920/, the aki's are:

us-west-1 aki-4faefe0a x86_64
us-east-1 aki-9ab546f3 x86_64
eu-west-1 aki-bdf2c6c9 x86_64
ap-southeast-1 aki-8689f7d4 x86_64

us-west-1 aki-49aefe0c i386
us-east-1 aki-6cb44705 i386
eu-west-1 aki-a7f2c6d3 i386
ap-southeast-1 aki-8c89f7de i386

Revision history for this message
Jeremy Deininger (jeremydei) wrote :

Hi guys,

I've been following this bug very closely and I just had it happen to me on an Instance running Maverick. Kernel version was 2.6.35-22-virtual and I realize a newer -25- is out. Will see if I can repro with that as well. This was a classic example of the bug, it happened during a large apt-get install sequence and stopped allowing forking of new processes. Anyone else seen this in Maverick (10.10)?

Revision history for this message
Alec Thomas (alec-swapoff) wrote :

We are now seeing this on a whole bunch of production machines all running 10.04, 2.6.32-312-ec2. Has anyone confirmed a kernel (old or new) that doesn't have this bug?

We've just upgraded one machine to 10.10 to see if it fixes the issue.

We haven't tried Stefan's kernels, but may if 10.10 doesn't show improvement.

Revision history for this message
Andrew Frolov (andrunah) wrote :

Same as Alex - we were having this issue with our production servers on 2.6.32-309-ec2 kernel.
After upgrading to Maverick (2.6.35-24-virtual) issue is gone. So far (5 days) we haven't met this bug.
I would also know about kernel version that doesn't have this issue anymore.

Revision history for this message
Stefan Bader (smb) wrote :

Well maybe the kernel I posted. Kind of hard to say without anybody testing it. ;-) The problem is that Lucid and Maverick kernels for EC2 are very different approaches. Maverick already has enough Xen code in the mainline kernel that only the configuration is different to the normal kernel. Lucid required a substantial number of patches on top of the code that was upstream and the approach used there has the risk of getting a skew between changes to the generic kernel code and the xen code. In the test kernels, at least that skew should be fixed and maybe that was causing this issue.

Revision history for this message
Jeremy Deininger (jeremydei) wrote :

Stefan,
I have some boxes that seem to reproduce this behaviour rather frequently. This is great news since this bug is so hard to reproduce.

I just launched with the new testing kernels you provided (aki-9ab546f3 x86_64) on a server that has 4096meg cache (the bad behaving size). Unfortunately I'm already seeing signs that this kernel has the CPU reporting bug where CPU times are in the thousands of days. Usually this means the box will be crashing in the next 1-2 days.. Will post back when I confirm that the fork bug still exists in these kernels. The CPUtime bug still exists. It was speculated that these were unrelated.. This machine was just launched and within minutes it displayed wrong cpu times.

process list snip from top:

26614 root 20 0 570m 448m 4972 S 0 2.9 0:39.41 ruby
26671 root 20 0 539m 419m 4948 S 0 2.7 0:35.79 ruby
28441 root 20 0 528m 409m 4212 S 0 2.7 24207187w ruby
26310 root 20 0 523m 405m 5040 S 3 2.6 0:51.46 ruby
29320 root 20 0 523m 405m 4948 S 0 2.6 91970855d ruby

Revision history for this message
Stefan Bader (smb) wrote :

Hi Jeremy, I have not heard back, so I am wondering whether the hang/crashing occurred after all?

Revision history for this message
Jeremy Deininger (jeremydei) wrote :

Stefan,
The test machine did not exhibit the fork/clone behavior in this bug. The wrong cpu times were pretty bad though and perhaps unrelated to this bug. The experiment lasted 3 days on a loaded 4096meg cache large instance. That *should have been long enough but with this bug it's tough to tell.

Revision history for this message
Jordan Sissel (jls-semicomplete) wrote :

This ticket has been idle for a month, anyone have a resolution yet? :(

Revision history for this message
Stefan Bader (smb) wrote :

The main problem here is the question whether there still is a problem. There has been a first round of cleanup patches going into the ec2 kernels with 2.6.32-314.27. The last update/response I got was from Jeremy and he seemed at least not to see the blocking behavior with those. So as long as there are no confirmations that this particular problem was still observed with kernels after 314.27, I would tend to close this ticket as resolved.

Revision history for this message
Mike Malone (mmalone) wrote :

Since no one developed a reproducible test case it is, unfortunately, difficult to say whether this bug is resolved. We moved to a 2.6.35 series kernel and stopped seeing this particular problem (although we have seen other problems that are eerily similar). All of the information to reproduce the behavior is in the comments above, but it does take a lot of time and patience to trigger it. Either that, or hundreds of instances running with production load.

Revision history for this message
Stefan Bader (smb) wrote :

This is a general request for some testing help for Ubuntu 10.04 (Lucid) on ec2. If there are other problem reports about strange process cpu times or hangs without any message it would be good to pass this on.
Trying to pull in a wide range of upstream fixes, I have prepared linux-ec2 packages that include all the fixes to Xen code which could be found in other places. This is quite a big change, so while I have successfully booted the test kernels on various ec2 instances, this usually does not really provide much insight about what happens when facing more or less real workloads. So anybody who is willing and able to help testing would be highly welcome.

The kernel packages are at: http://people.canonical.com/~smb/lucid-ec2-ng/
To use those, http://uec-images.ubuntu.com/lucid/current/ has current AMIs and using --kernel with a matching AKI from https://lists.ubuntu.com/archives/ubuntu-cloud/2010-December/000466.html will use pv-grub for booting. So the new kernel can just be installed via dpkg and activated by rebooting.

Please report back anything that seems to be worse or better than before. Thanks!

Revision history for this message
Robert Newson (robert-newson) wrote :

Stefan,

I've been testing your kernels and stability seems much improved. It's only been a few days but the server has been under quite a lot of load without an issue, with the stock ec2 kernel this was definitely not the case.

Cautious optimism, but I'd love to isolate which of the patches was the fix. do you have a list of the patches you incorporated?

Revision history for this message
Stefan Bader (smb) wrote :

Robert, that really sounds like good news. To answer the question what exactly improved things could get a bit difficult. It also depends a bit what exactly the last stock ec2 kernel was you are comparing against. I had been doing a first round of changes which Jeremy has been testing (from what he wrote I was assuming that the fork/clone issue was gone but still some process time accounting going wrong).
With this second, bigger round of changes, there were changes to the interrupt handling and also some timer, and accounting changes which could have have an effect. But also quite a bit more. If you want to look at the changes yourself, I try to keep the delta up to date as long as it is not in the stock kernel at "git://kernel.ubuntu.com/smb/ubuntu-lucid.git ec2-next". The reason this quite big is a bit of a long story. In short the ec2 code is based on a patch-set that is rebased. And we don't know exactly against what. So individual patches can change without proper explanation why. I tried to document things as much as I was able, still it is not always nice.

Revision history for this message
Robert Newson (robert-newson) wrote :

Hi Stefan,

Just an update. That same node has run flawlessly under high load since my last post, no lockups. What are your thoughts on the claim that this bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/727459 is the cause of the problem?

B.

Revision history for this message
Stefan Bader (smb) wrote :

Hi Robert,

thanks for the update and really good to hear. The bug you mentioned might be one of the possible reasons. There also have been changes to change timer and ipi interrupts to be per-cpu and the rest to use fasteoi instead of level (moving around masking code as well). Having irqs use the level based handler was claimed to potentially cause missing interrupts which may or may not have some impact in this case. The other changes to process time accounting probably only have impact on runtime output which was showing impossible values (larger than uptime) but who knows. So yeah, one of the above likely. :)

Revision history for this message
Chetan Sarva (csarva) wrote :

Robert,

Can you confirm what CPU is running on that node? The Librato guys have apparently narrowed it down to only happening on the Intel E5507. See this blog post for more:

https://silverline.librato.com/blog/main/EC2_Users_Should_be_Cautious_When_Booting_Ubuntu_10_04_AMIs

I checked one of our problem nodes that's still up and sure enough, it's running on an E5507. We have another node using the same CPU but with the same test kernel that was posted by Stefan above. So far, it looks ok, but we haven't put too much load on it yet and it's only been about 4 days.

Revision history for this message
Stefan Bader (smb) wrote :

Which would be exactly the bug Robert was pointing to. Hrm, it sounds more and more sensible to pick fix up sooner than later as the full changeset is harder to get approved.

Revision history for this message
Eduardo Ferreira (eduardo-ferreira) wrote :

Hey, Stefan,

We're having instances hanging throughout the day (https://forums.aws.amazon.com/thread.jspa?threadID=68692) and it looks like it may be related to this.

Looking at your comment (#53) I wanted to apply the changes you mentioned, but I need some help on how to install them. I'm familiar with apt-get install, etc, but I'm not sure what I need to do to use the packages you prepared.

Would you let me know how I can do that?

Thank you.

Revision history for this message
Eduardo Ferreira (eduardo-ferreira) wrote :

Nevermind, I found out how to do that, i.e. launch new instances using AMI (ami-40db2229) with matching AKI (aki-427d952b).
It's all running now... hopeful that it won't hang anymore.
Thank you!

Revision history for this message
Scott Moser (smoser) wrote :

Eduardo,
  Just fyi, any recent (this year) lucid image will already use the pv-grub kernels by default.

Revision history for this message
Jan Gerle (jan-u) wrote :

Hi Stefan,

we experienced the kernel lockups under heavy load on Ubuntu 10.04 running the standard 2.6.32-314-ec2 kernel from the ami-3202f25b (us-east-1). The underlying hardware is a Intel(R) Xeon(R) CPU E5507 @ 2.27GHz w/ 4MB cache.

I have installed the linux-image-2.6.32-317-ec2_2.6.32-317.31~preview2_amd64 from .deb and now the lockups are gone.

Cheers,
Jan

Revision history for this message
Stefan Bader (smb) wrote :

Jan, hopefully we got the fix for this in the next official update. It currently is waiting to clear proposed but was delayed by findings in some of the changes used by all kernels.
But thanks for testing the preview kernel. That is build from the bigger set of changes which I would like to merge in the future. And positive feedback helps to argue for that change.

Revision history for this message
Eduardo Ferreira (eduardo-ferreira) wrote :

Ah, good to know (#62). Next time we'll use the new instances.
By the way, it's running non-stop since last week.
Great job!

Revision history for this message
Don Spidell (493poc-admin-d18eca) wrote :

FWIW, I upgraded to 2.6.32-316-ec2 #31 on Ubuntu 10.04 running on an E5507 and I have not run into this issue on any of the boxes I'm running since the upgrade.

Revision history for this message
Stefan Bader (smb) wrote :

Based on the latest comments (unfortunately this bug was not referenced when committing the change), I am closing this bug as fixed. If anybody thinks it is still a problem, feel free to re-open it (or a new report). Thanks.

Changed in linux-ec2 (Ubuntu):
status: New → Fix Released
Revision history for this message
kng ops (paolo-negri) wrote :

I want to report that the CPU time bug is still present in the following kernel

2.6.32-317-ec2 on AWS instance m1.large

with CPU

model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz

I logged the output of ps every minute and I can now show the bug happening, as you can see the CPU time originally at 35790:43 does get corrupted and the next minute jumps at 17179869:11 to reach 29322904:44 1 minute later

www-data 8269 31.3 9.2 853516 725496 ? Sl Aug23 35789:36 /usr/local/lib/erlang/erts-5.8.4/bin/beam.smp -P ...
www-data 8269 31.3 9.2 866128 731640 ? Sl Aug23 35790:09 /usr/local/lib/erlang/erts-5.8.4/bin/beam.smp -P ...
www-data 8269 31.3 9.3 868704 738800 ? Sl Aug23 35790:43 /usr/local/lib/erlang/erts-5.8.4/bin/beam.smp -P ...
www-data 8269 31.3 9.4 872964 742660 ? Sl Aug23 35791:17 /usr/local/lib/erlang/erts-5.8.4/bin/beam.smp -P ...
www-data 8269 120093018 9.4 885060 748036 ? Sl Aug23 17179869:11 /usr/local/lib/erlang/erts-5.8.4/bin/beam.smp -P ... www-data 8269 10652 9.3 868748 739144 ? Sl Aug23 29322904:44 /usr/local/lib/erlang/erts-5.8.4/bin/beam.smp -P ...

It took us around 3 months running this process at around 30% load to reach this condition

Because of this problem the machine currently reports a load of 0 when instead we know it should report between 20 and 40% depending by the moment of the day.

Please let me know If I need to open a specific ticket for this issue.

Today we're going to upgrade to 2.6.32-319-ec2

thanks,

Paolo Negri

Revision history for this message
Stefan Bader (smb) wrote :

Reading back through the previous comments I think this should be handled a a separate issue. The original issue caused processes not to be forked while I understand this seems to be rather load/system time related but not as fatal as the main issue of this report. Could you please open a new ticket for this? And let me know the bug number. Thanks

Revision history for this message
kng ops (paolo-negri) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.