Batch jobs intermittently fail to leave "="queue when complete

Bug #126204 reported by N7DR
6
Affects Status Importance Assigned to Milestone
at (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

This is a bit complicated to describe, but I will do my best. I estimate that the bug manifests itself more than 0.1% of the time, but less than 0.5% of the time (so I suspect some kind of race condition). This is almost certainly an upstream bug.

So here is what I am doing:

1. I have a series of several thousand batch jobs to be run.
2. Once a minute, with a cron job, I check the number of lines of output from the "atq" command
3. If the result of step #2 is 2 or greater, I do nothing.
4. If the result of step #2 is 1 or 0, I submit either 1 or 2 batch jobs, to bring the total up to 2. (This is so that I can insert additional jobs into the queue if I want to, without having several thousand other jobs in front of the new job).

atd is running with the parameter "-l 2.2", on an SMP dual-core system.

Approximately once every couple of days, I discover that the CPU is not busy. Upon investigation I discover that the "atq" output states that two jobs are running, both on the "=" queue. However, the two jobs in question have completed and are no longer requesting OS resources. (The output has been created and "ps" confirms that the processes no longer exist on the system.)

The system remains in this state until I manually use "atrm" to remove the jobs, at which point the next time that the cron job executes two new jobs are (correctly) started.

So the problem is that "atq" doesn't seem to realise that the jobs have completed.

This is on 64-bit dapper with all updates applied.

Revision history for this message
Scott Kitterman (kitterman) wrote :

You are going to need to provide a usable test case to replicate this. Preferably a minimal test case, but otherwise the code that leads to this problem. Otherwise, there is little a develper can do.

Revision history for this message
N7DR (doc-evans) wrote : Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

On 15/07/07, Scott Kitterman <email address hidden> wrote:
> You are going to need to provide a usable test case to replicate this.
> Preferably a minimal test case, but otherwise the code that leads to
> this problem. Otherwise, there is little a develper can do.

OK, here is the simplest that I am sure will work. Probably several
items can be changed/removed and the problem will still show up, but
since I am not completely certain of that, I would rather give
instructions that I know will cause the problem to appear.

On an SMP 64-bit system:

1. Create a program that takes several minutes to execute. Call this
program X. It doesn't matter at all what this program does, we just
want something that will occupy the CPU for about five minutes.

2. Create a file called jobs_to_run, of length at least several
hundred lines, where each line creates a batch job to run X. That is,
each line looks like this
  batch < <some-file>

3. Create a cron job that, once per minute, executes the following
python script:

#! /usr/bin/python

import popen2

# get the list of jobs in Q1
r, w, e = popen2.popen3('cat jobs_to_run')

jobs_in_Q1 = r.readlines() # the list of jobs

n_jobs_in_Q1 = len(jobs_in_Q1)

r.close()
w.close()
e.close()

if (n_jobs_in_Q1 > 0):

# find the number of jobs in Q2
 r, w, e = popen2.popen3('atq | wc -l')
 n_jobs_in_Q2 = r.readline()
 n_jobs_in_Q2 = n_jobs_in_Q2[0:len(n_jobs_in_Q2)-1]
 n_jobs_in_Q2 = int(n_jobs_in_Q2)

 r.close()
 w.close()
 e.close()

 if n_jobs_in_Q2 < 2:

# add a job to Q2
  job = jobs_in_Q1[0]
  job = job[0:len(job)-1]

  command = job

  r, w, e = popen2.popen3(command)
  r.close()
  w.close()
  e.close()

# remove this job from Q1
  jobs_in_Q1 = jobs_in_Q1[1:]

  jobs_to_run = open("jobs_to_run", "w")

  for line in range(len(jobs_in_Q1)):
   jobs_to_run.write(jobs_in_Q1[line])
  jobs_to_run.close()

4. Let it run.

Approximately every day or two, you will end up in the buggy state,
where "atq" says that you have two jobs running, but the jobs have in
fact completed. Once this happens, the system will stay in that state
until you notice and manually remove the jobs from the "=" job queue.

As far as I can see, the bug lies in whatever method is used to tell
the queue manager that jobs have completed -- occasionally, that
mechanism fails so that "atq" thinks that the jobs are still on the
"=" (i.e., the running) queue, even though the jobs have actually
completed processing.

It *may* be that one can cause the bug to appear more often by one of:
  a) making the jobs run for a shorter amount of time than the several
minutes that my jobs take
  b) queuing more than two jobs at a time
Since this is a work system currently running through approximately
10,000 jobs I am not in a position to try either of these methods to
attempt to make the problem appear more frequently than once every day
or two.

I am attempting to instrument my machine a bit more, to produce a bit
more information about the bug, or at least some output that shows it
occurring.

Revision history for this message
N7DR (doc-evans) wrote :
Download full text (7.5 KiB)

OK, I can at least show an example of this bug happening now.

I instrumented the machine to print the following information once per minute:

the time
the output of "atq"
the start of the output of "top"

So here is an example of three minutes of normal operation (i.e., when
everything is working OK):

----

Wed Jul 18 07:33:01 MDT 2007

4232 Wed Jul 18 07:29:00 2007 = n7dr
4233 Wed Jul 18 07:30:00 2007 = n7dr

top - 07:33:01 up 14 days, 1:12, 2 users, load average: 2.37, 2.11, 2.04
Tasks: 243 total, 3 running, 238 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4014428k used, 32144k free, 564304k buffers
Swap: 1020116k total, 1005376k used, 14740k free, 548508k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27683 n7dr 39 14 28428 15m 1740 R 50 0.4 2:54.83 model2
27594 n7dr 39 14 32152 19m 1740 R 49 0.5 3:53.68 model2
27906 n7dr 15 0 10696 1316 880 R 1 0.0 0:00.01 top
    1 root 18 0 2640 504 468 S 0 0.0 0:04.61 init
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1

Wed Jul 18 07:34:01 MDT 2007

4232 Wed Jul 18 07:29:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:34:02 up 14 days, 1:14, 2 users, load average: 1.91, 2.04, 2.02
Tasks: 246 total, 3 running, 241 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4006700k used, 39872k free, 564616k buffers
Swap: 1020116k total, 1005372k used, 14744k free, 548916k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27594 n7dr 39 14 36128 22m 1740 R 48 0.6 4:53.11 model2
28014 n7dr 37 14 16476 3864 1740 R 45 0.1 0:00.47 model2
    1 root 16 0 2640 504 468 S 1 0.0 0:04.62 init
28018 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.01 top
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1

Wed Jul 18 07:35:01 MDT 2007

4235 Wed Jul 18 07:35:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:35:02 up 14 days, 1:15, 2 users, load average: 1.91, 2.02, 2.01
Tasks: 246 total, 5 running, 239 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4002512k used, 44060k free, 564988k buffers
Swap: 1020116k total, 1005372k used, 14744k free, 549564k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28014 n7dr 39 14 28824 15m 1740 R 45 0.4 0:58.55 model2
28149 n7dr 39 14 16204 3604 1740 R 42 0.1 0:00.44 model2
28144 n...

Read more...

Revision history for this message
N7DR (doc-evans) wrote :

This has something to do with the daily messages that appear in syslogd at around 7:44 a.m.:

----

Aug 18 07:44:33 homebrew exiting on signal 15
Aug 18 07:44:35 homebrew syslogd 1.4.1#17ubuntu7.1: restart.

----

I have no idea what these messages mean, but I have noticed that this bug almost always occurs at around this time of the morning.

Revision history for this message
Scott Kitterman (kitterman) wrote :

There is a standard cron job that restarts syslogd on a daily basis. That appears to be what this is. I'm not sure how or if that might be relevant.

SIGTERM = Signal("TERM", 15, "Termination")

Looking at the man page for at, I find this:

"At and batch as presently implemented are not suitable when users are competing for resources. If this is the case for your site, you might want to consider another batch system, such as nqs."

It looks to me like you are running into this known limitation of at. What I would suggest is you either follow the recommendations in the man page and use a different batch system (that will be most robust I would guess) or adjust your cron job to make sure it doesn't run at the same time other cron jobs are running.

Revision history for this message
N7DR (doc-evans) wrote :

On 18/08/07, Scott Kitterman <email address hidden> wrote:
> There is a standard cron job that restarts syslogd on a daily basis.
> That appears to be what this is. I'm not sure how or if that might be
> relevant.
>

No, me neither; but it sure seems to be strongly correlated.

> SIGTERM = Signal("TERM", 15, "Termination")
>
> Looking at the man page for at, I find this:
>
> "At and batch as presently implemented are not suitable when users are
> competing for resources. If this is the case for your site, you
> might want to consider another batch system, such as nqs."
>

That's incredibly vague... "competing for resources" is meaningless.
My jobs aren't doing anything that would be competing for anything
remotely unusual: they are mostly CPU-bound, with occasional writes to
a file descriptor (for an ordinary file in the current working
directory of the job). If that's "competing for resources" then
*anything* could be so called, and it would be unsafe to run anything
at all through the batch system..

> It looks to me like you are running into this known limitation of at.
> What I would suggest is you either follow the recommendations in the man
> page and use a different batch system (that will be most robust I would
> guess) or adjust your cron job to make sure it doesn't run at the same
> time other cron jobs are running.

As far as I can tell, there is no mention of "nqs" in the dapper
64-bit repositories. Where would I get it?

There's also no way to follow your latter suggestion: I have no way of
knowing how long each individual job will take to run, so I can't
simply (for example) stop executing new jobs at (say) 7:15. Some of
the jobs run for two minutes, some run for over 200, so it's simply
impractical to try to not schedule one to be running 7:44 or
thereabouts (which is when the daily job appears to run).

Is the daily system job important? Maybe I could simply stop that from
executing?

Revision history for this message
N7DR (doc-evans) wrote :

How would I contact the authors of the queue system used in kubuntu,
to try to talk to them directly about this? Having batched jobs fail
to leave the queue is definitely not good behaviour, and I would think
that they would be concerned and want to fix this.

(And it's really pretty weird that the actual jobs complete just fine;
it's just that the queue manager fails to remove them from the queue
after they've completed. That smells of something that should be
fixable.)

Revision history for this message
Scott Kitterman (kitterman) wrote : Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

Perhaps just skipping the minute when the daily syslog restart is scheduled
would do it. I agree the man page is vague, but that's what it says.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.