Comment 11 for bug 717345

Revision history for this message
William Grant (wgrant) wrote :

Digging more into the original production incident, I have further insight and also compelling evidence that all 47 forking service processes killed well after the incident were hung the same way as tellurium's.

At 11:01, right before the file handle exhaustion began, 117 forking services were running. Assuming that bzr-sftp needs 3 handles to talk to each child plus 1 for the SSH connection, that should only be around 500 handles. Well below the limit.

At 11:23 there were 140 forking services. 64 children remained from before file handle exhaustion, the oldest from a minute after codehosting was started post-rollout. 48 children were uninitialized -- they all had the original ps, having not yet had the branch path appended.

The master forking service was stopped a few seconds after those counts were taken, at which point it began a 300s timeout waiting for all its children to die. At 11:24 136 remained, but by 11:25 the master had been kill -9'd and this was down to 47. Those same 47 remained for days until a LOSA killed them all, and all were uninitialised, so it looks like every initialised processes died in a timely manner.

Even though some were more than an hour old, the 117 children alive at the time of the major incident should not have been enough to do any damage. The forked children doesn't seem to be *too* badly behaved, as we often have multi-hour processes with the old service too.