race condition causing lxc to not detect container init process exit

Bug #1168526 reported by Pavel Bennett
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
lxc (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

For the purpose of the repro, my lxc init process is node.js v0.11.0 (built from source) with a single line:

process.exit(0);

When running it in lxc, sometimes lxc doesn't exit. lxc-start remains a parent of a defunct node process without reaping it or exiting.

I've made a custom build of lxc 0.9.0 to extract more information about this, adding only an INFO line, as follows:

start.c:

        if (ret != sizeof(siginfo)) {
                ERROR("unexpected siginfo size");
                return -1;
        }
+ INFO("got signal %d from pid %d while expecting SIGCHLD(17) from pid %d | uid = %d, status = %d", siginfo.ssi_signo, siginfo.ssi_pid, *pid, siginfo.ssi_uid, siginfo.ssi_status);

        if (siginfo.ssi_signo != SIGCHLD) {
                kill(*pid, siginfo.ssi_signo);
                INFO("forwarded signal %d to pid %d", siginfo.ssi_signo, *pid);
                return 0;
        }

I've tried this with a 3 official kernels. There is one difference in output.

Kernels 3.7.9, 3.8.6:

Successful case:

      lxc-start 1365724008.446 NOTICE lxc_start - '/usr/local/bin/node' started with pid '19458'
      lxc-start 1365724008.446 INFO lxc_console - no console will be used
      lxc-start 1365724008.446 INFO lxc_start - got signal 17 from pid 18165 while expecting SIGCHLD(17) from pid 19458 | uid = 0, status = 1
      lxc-start 1365724008.446 WARN lxc_start - invalid pid for SIGCHLD
      lxc-start 1365724038.306 INFO lxc_start - got signal 17 from pid 19458 while expecting SIGCHLD(17) from pid 19458 | uid = 0, status = 0
      lxc-start 1365724038.306 DEBUG lxc_start - container init process exited

Hanging case:

      lxc-start 1365795195.358 NOTICE lxc_start - '/usr/local/bin/node' started with pid '8650'
      lxc-start 1365795195.358 INFO lxc_console - no console will be used
      lxc-start 1365795195.358 INFO lxc_start - got signal 17 from pid 8626 while expecting SIGCHLD(17) from pid 8650 | uid = 0, status = 1
      lxc-start 1365795195.358 WARN lxc_start - invalid pid for SIGCHLD
      lxc-start 1365795333.347 INFO lxc_start - got signal 2 from pid 0 while expecting SIGCHLD(17) from pid 8650 | uid = 0, status = 0
      lxc-start 1365795333.347 INFO lxc_start - forwarded signal 2 to pid 8650

Kernel 3.9.0-rc6:

Successful case is the same, but the hanging case changes to just:

      lxc-start 1365794343.870 NOTICE lxc_start - '/usr/local/bin/node' started with pid '3432'
      lxc-start 1365794343.870 INFO lxc_console - no console will be used
      lxc-start 1365794343.870 INFO lxc_start - got signal 17 from pid 2851 while expecting SIGCHLD(17) from pid 3432 | uid = 0, status = 1
      lxc-start 1365794343.870 WARN lxc_start - invalid pid for SIGCHLD

... without forwarding signal 2 (SIGINT).

Notes:
- I'm on Mint 14 Nadia with raring packages, if that helps.
- In all cases, there is signal 17 (SIGCHLD) coming in to lxc-start, but it comes from a different pid and is ignored by lxc. Any idea what this could be? This process seems to have been cleaned up and no longer appears in ps aux.
- The lxc-start process should be getting notified with a SIGCHLD from the child's pid when the child (init process) exits.
- This could be a kernel bug, but it's probably something unique that lxc is doing to trigger it.
- I've tried other init processes (node.js without the process.exit and a custom c++ app with a stdout write and exit 0), which greatly reduce the frequency of this happening.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1168526] [NEW] race condition causing lxc to not detect container init process exit

Quoting Pavel Bennett (<email address hidden>):
> Public bug reported:
>
> For the purpose of the repro, my lxc init process is node.js v0.11.0
> (built from source) with a single line:
>
> process.exit(0);

Thanks. This is a race condition we've seen before in lxc-execute.

Precisely which version of lxc were you using? The patch

0001-fix-race-with-fast-init

in ubuntu raring's recent lxc packages is meant to work around this,
but we don't have a clean fix yet.

> Notes:
> - I'm on Mint 14 Nadia with raring packages, if that helps.
> - In all cases, there is signal 17 (SIGCHLD) coming in to lxc-start, but it comes from a different pid and is ignored by lxc. Any idea what this could be? This process seems to have been cleaned up and no longer appears in ps aux.

That is the test for kernel support for reboot in a pidns. It's not
related to this.

Revision history for this message
Pavel Bennett (pavben) wrote :

> Precisely which version of lxc were you using?

I just put back version 0.9.0-0ubuntu2 (as opposed to the 0.9.0 I built from source) while on kernel 3.7.9-030709-generic and haven't yet run into this issue (I assume that's the patch you mentioned). However, when I update to kernel 3.8.6-030806-generic (leaving the same version of lxc), 12 out of 100 containers experienced what looks like this exact problem:

      lxc-start 1365805078.586 NOTICE lxc_start - '/usr/local/bin/node' started with pid '4107'
      lxc-start 1365805078.586 INFO lxc_console - no console will be used
      lxc-start 1365805078.586 WARN lxc_start - invalid pid for SIGCHLD
      lxc-start 1365805115.998 INFO lxc_start - forwarded signal 2 to pid 4107

Revision history for this message
Pavel Bennett (pavben) wrote :

I should add that these "forwarded signal 2" lines are due to me pressing Ctrl+C and are not actually relevant.

Have you been able to repro this bug on kernel 3.8.6?

I'm thinking how to fix this as lxc_spawn is what gets the pid which is needed by lxc_poll to listen for SIGCHLD from the correct pid, but lxc_poll should logically go before lxc_spawn to avoid this race.

How about starting lxc_poll first in a queueing mode, so it just accumulates the signals but doesn't process them yet? When "handler->pid = lxc_clone(do_start, handler, handler->clone_flags);" returns, notify lxc_poll of the pid, thus it should now have all the info it needs, switching it out of queueing mode and initiating a loop to process the queued events now that the pid is known.

Revision history for this message
Pavel Bennett (pavben) wrote :

Btw, that "queueing mode" would simply mean not calling epoll_wait until the pid is available. This shouldn't require managing a queue ourselves. Can you think of anything that this would break?

Or we could go with the patch you've written, although I haven't looked into why the problem appears to be back in 3.8.6 with the patch.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1168526] Re: race condition causing lxc to not detect container init process exit

Quoting Pavel Bennett (<email address hidden>):
> Btw, that "queueing mode" would simply mean not calling epoll_wait until
> the pid is available. This shouldn't require managing a queue ourselves.
> Can you think of anything that this would break?

No - I had wanted to do this originally, but there was some reason it
seemed not doable - I don't remember why. Do you have a patch try it?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

The newest kernel I've tests on is 3.8.0-17-generic. I'll need to set up a system with the daily upstream build and re-test.

Changed in lxc (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Serge Hallyn (serge-hallyn)
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

No, I cannot reproduce this with the latest upstream kernel build
(3.9.0-030900rc7-generic #201304171402)

What I did:

sudo lxc-create -t ubuntu- n r1
cat > exit0.c << EOF
#include <stdlib.h>
int main() {
exit(0);
}
EOF
make exit0
sudo cp exit0 /var/lib/lxc/r0/rootfs/bin/
sudo lxc-start -n r1 -- /bin/exit0

Which only shows

lxc-start: No such file or directory - stat(/proc/2620/root/dev/lxc/console)

but exits 0 correctly. If I do

sudo lxc-start -n r1 -l info -o outout -- /bin/exit0

then outout shows:

      lxc-start 1366317913.390 INFO lxc_start - init has already exited

Are you sure when you tested the newer kernel you weren't using a different (unpatched) lxc?

Changed in lxc (Ubuntu):
status: New → Incomplete
assignee: Serge Hallyn (serge-hallyn) → nobody
Revision history for this message
Pavel Bennett (pavben) wrote :

I've also tried it with a C++ app very similar to yours and was unable to repro. There is something about having node.js as the init process running a "process.exit(0);" js. The init process (node v0.11.0) does exit as "ps faux" shows it as a zombie and a child of lxc-start.

I went back to kernel 3.5.0-27 and lxc 0.8.0~rc1-4ubuntu39.12.10.2 for now as it seems to be the one without problems for our use case. The lxc I was using with 3.8.6 was 0.9.0-ubuntu2.

Changed in lxc (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Pavel Bennett (pavben) wrote :

Hey Serge, were you able to get a reliable repro for this? I have a reason to upgrade to Raring, and this seems to be the only blocker. We've reproduced the issue with the stock Linux Mint 15.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Pavel Bennett (<email address hidden>):
> Hey Serge, were you able to get a reliable repro for this? I have a

No, I wasn't (assuming you mean with our workaround in place in the
ubuntu package).

In comment #8 you said node.js was able to reproduce this. Regarding
that,

1. Does that reproduce the bug in the saucy package?

2. Could you please give me step by step reproductions for setting that
up (and include the distro release, kernel version and lxc package
version)?

> reason to upgrade to Raring, and this seems to be the only blocker.
> We've reproduced the issue with the stock Linux Mint 15.

thanks,
-serge

Revision history for this message
Pavel Bennett (pavben) wrote :

I can't try Saucy right now, but the repro instructions with kernel versions are in the original post and in #2.

We've tried node v0.11.2 as well on Raring and got the repro.

Repro summary:

Install any of the above kernels, such as the one with the Raring installer, then install lxc from apt.
Get node v0.11.0 - v0.11.2 or try any other recent version if that's easier, just not below v0.10.
lxc init process is "/usr/local/bin/node /boot/mio-init.js" which contains only "process.exit(0);" and try running 100 of these, whether in parallel or one at a time. Continue until lxc-start doesn't exit. Run "ps faux" and look for zombie children of lxc-start.

Let me know if that works for you :)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Pavel Bennett (<email address hidden>):
> I can't try Saucy right now, but the repro instructions with kernel
> versions are in the original post and in #2.
>
> We've tried node v0.11.2 as well on Raring and got the repro.
>
> Repro summary:
>
> Install any of the above kernels, such as the one with the Raring installer, then install lxc from apt.
> Get node v0.11.0 - v0.11.2 or try any other recent version if that's easier, just not below v0.10.

Sorry, this is where I'm unclear. Can you please point me to exactly
where you grab the source (or which package you are using)?

> lxc init process is "/usr/local/bin/node /boot/mio-init.js" which contains only "process.exit(0);" and try running 100 of these, whether in parallel or one at a time. Continue until lxc-start doesn't exit. Run "ps faux" and look for zombie children of lxc-start.

thanks,
-serge

Revision history for this message
Pavel Bennett (pavben) wrote :

Sure, run these inside the container:

git clone https://github.com/joyent/node.git --depth 1

cd node
./configure
make -j9
sudo make install

Then the binary will be at /usr/local/bin/node

It's v0.11.3-pre, but should still repro.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Pavel Bennett (<email address hidden>):
> Sure, run these inside the container:
>
> git clone https://github.com/joyent/node.git --depth 1
>
> cd node
> ./configure
> make -j9
> sudo make install
>
> Then the binary will be at /usr/local/bin/node
>
> It's v0.11.3-pre, but should still repro.

Thanks.

I tried this both in a saucy kvm vm, and on a precise bare metal host,
but wasn't able to reproduce.

What's odd is that I can't even reproduce it with the daily ppa build,
which doesn't have the workaround which is in the ubuntu package.

Revision history for this message
Pavel Bennett (pavben) wrote :

I created a VM with Ubuntu Server 13.04 just for this bug. At first, I was able to run the steps outlined above 50 times with no issues. What was I missing? Concurrency! I rebooted the VM after adding 1 more core, and... bingo! Zombies on the 3rd try.

The VM disk image I have here should be compatible with KVM, but give it a try on your Saucy VM with 2 or more cores. If you can't repro it then, I'll send you my repro VM :)

Revision history for this message
Pavel Bennett (pavben) wrote :

Hey Serge, let me know if that repro worked for you or when you're planning to give it a try. I'm keeping the VM image around in case you need it.

> What's odd is that I can't even reproduce it with the daily ppa build,
> which doesn't have the workaround which is in the ubuntu package.

Did you try it on a single-core machine/VM? Looks like you need 2 or more.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ok - thanks - I was able to reproduce this in a raring VM with 4 cores.

I thought this might be a tiny race window between us blocking signals and creating the signalfd, so I reversed those (a patch which I may yet send upstream) - but that didn't solve the issue.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

It does seem like this must be a kernel bug in epoll+signalfd (or a hard to spot misuse thereof in lxc).

When I instrument the signal_handler which is executed when epoll_wait returns a signalfd event, I do get a sigchld for the very first task which is spawned (a test to see if kernel supports pidns reboot), but do not get one for the container init - which becomes <defunct> and therefore clearly did in fact send a sigchld.

I'll send my tiny patch upstream, and test on other kernels for comparison.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

There is at least one fundamental bug in start.c's signal_handler, as should
be fixed by the below. However, this alone did not fix it for me, so more is wrong. There is a minimal testcase at http://people.canonical.com/~serge/signalfd.c which originally reproduced this bug, then was fixed by using the waitpid loop as below.

From 7333b77759794f5420ea6898494073a28cac445f Mon Sep 17 00:00:00 2001
From: Serge Hallyn <email address hidden>
Date: Wed, 3 Jul 2013 14:13:08 -0500
Subject: [PATCH 1/1] start.c:signal_handler: fix wrong assumption about
 sigchld

The monitor process adds a signalfd to the set of fds it watches
with epoll. It calls signal_handler() when a signal is found in
the sigfd. That returns 1 if the container init was found to
be the exiting process.

The flaw in reasoning (pointed out by Andy Whitcroft - thanks!)
here is that if two children have exited, we assume we will get
two sigchilds - in fact we may only get one. So when we get one,
we need to reap all the children we have and check if any is the
container init.

Signed-off-by: Serge Hallyn <email address hidden>
---
 src/lxc/start.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/lxc/start.c b/src/lxc/start.c
index 8c8af9c..3fa50ad 100644
--- a/src/lxc/start.c
+++ b/src/lxc/start.c
@@ -162,7 +162,7 @@ static int signal_handler(int fd, void *data,
       struct lxc_epoll_descr *descr)
 {
  struct signalfd_siginfo siginfo;
- int ret;
+ int ret, status, retval = 0;
  pid_t *pid = data;

  ret = read(fd, &siginfo, sizeof(siginfo));
@@ -188,16 +188,16 @@ static int signal_handler(int fd, void *data,
   return 0;
  }

- /* more robustness, protect ourself from a SIGCHLD sent
- * by a process different from the container init
+ /*
+ * wait for any and all children which are ours which are reapable,
+ * since if >1 children have exited, we'll only get one sigchld.
   */
- if (siginfo.ssi_pid != *pid) {
- WARN("invalid pid for SIGCHLD");
- return 0;
- }
+ while ((ret = waitpid(-1, &status, WNOHANG)) > 0)
+ if (ret == *pid)
+ retval = 1; // we reaped the container init

  DEBUG("container init process exited");
- return 1;
+ return retval;
 }

 int lxc_set_state(const char *name, struct lxc_handler *handler, lxc_state_t state)
--
1.8.1.2

Changed in lxc (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Ok, for the record I cannot reproduce this no precise - I've tried with both an 8 core fast precise host, and a 4 core precise kvm guest - but using the same lxc version from the daily ubuntu-lxc ppa.

So it appears to be a kernel regression between 3.2 and 3.8.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

This bug is introduced between v3.7 and v3.8, by commit:

af4b8a83add95ef40716401395b44a1b579965f4 pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1

tags: added: bot-stop-nagging
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1168526

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

The following kernel patches fixes it for me, will send to lkml:

diff --git a/debian.master/changelog b/debian.master/changelog
index f8f7a35a..081e666 100644
--- a/debian.master/changelog
+++ b/debian.master/changelog
@@ -1,3 +1,9 @@
+linux (3.11.0-4.9debug1) saucy; urgency=low
+
+ * debug 1
+
+ -- Serge Hallyn <email address hidden> Thu, 29 Aug 2013 13:34:43 +0000
+
 linux (3.11.0-4.9) saucy; urgency=low

   [ Tim Gardner ]
diff --git a/debian/rules b/debian/rules
index 2d3358b..f87f26c 100755
--- a/debian/rules
+++ b/debian/rules
@@ -13,6 +13,8 @@ DEBIAN=$(shell awk -F= '($$1 == "DEBIAN") { print $$2 }' <debian/debian.env)
 # with the kernel build.
 unexport CFLAGS
 unexport LDFLAGS
+export skipmodules=true
+export skipabi=true

 export LC_ALL=C
 export SHELL=/bin/bash -e
diff --git a/debian/scripts/module-check b/debian/scripts/module-check
index c754ea3..280b6e9 100755
--- a/debian/scripts/module-check
+++ b/debian/scripts/module-check
@@ -4,6 +4,7 @@ $flavour = shift;
 $prev_abidir = shift;
 $abidir = shift;
 $skipmodule = shift;
+$skipmodule = 1;

 print "II: Checking modules for $flavour...";

diff --git a/kernel/pid.c b/kernel/pid.c
index 66505c1..3cccab3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -275,6 +275,10 @@ void free_pid(struct pid *pid)
   case 0:
    schedule_work(&ns->proc_work);
    break;
+ default:
+ if (ns->child_reaper->flags & PF_EXITING)
+ wake_up_process(ns->child_reaper);
+ break;
   }
  }
  spin_unlock_irqrestore(&pidmap_lock, flags);

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Separately, a patch has been committed to upstream lxc to eliminate any chance of a race stopping the lxc monitor from seeing the container init exit. Note that this doesn't stop the kernel bug from happening.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in lxc (Ubuntu):
status: Triaged → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 1.0.0~alpha1-0ubuntu2

---------------
lxc (1.0.0~alpha1-0ubuntu2) saucy; urgency=low

  * Add allow-stderr to autopkgtst restrictions as the Ubuntu template
    uses policy-rc.d to disable some daemons and that causes a message to
    be printed on stderr when the service tries to start.
 -- Stephane Graber <email address hidden> Thu, 12 Sep 2013 13:57:17 -0400

Changed in lxc (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.