Bug #126204 “Batch jobs intermittently fail to leave “="queue wh...” : Bugs : at package : Ubuntu

Revision history for this message

Scott Kitterman (kitterman) wrote on 2007-07-15:

#1

You are going to need to provide a usable test case to replicate this. Preferably a minimal test case, but otherwise the code that leads to this problem. Otherwise, there is little a develper can do.

Revision history for this message

N7DR (doc-evans) wrote on 2007-07-17: Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

#2

On 15/07/07, Scott Kitterman <email address hidden> wrote:
> You are going to need to provide a usable test case to replicate this.
> Preferably a minimal test case, but otherwise the code that leads to
> this problem. Otherwise, there is little a develper can do.

OK, here is the simplest that I am sure will work. Probably several
items can be changed/removed and the problem will still show up, but
since I am not completely certain of that, I would rather give
instructions that I know will cause the problem to appear.

On an SMP 64-bit system:

1. Create a program that takes several minutes to execute. Call this
program X. It doesn't matter at all what this program does, we just
want something that will occupy the CPU for about five minutes.

2. Create a file called jobs_to_run, of length at least several
hundred lines, where each line creates a batch job to run X. That is,
each line looks like this
batch < <some-file>

3. Create a cron job that, once per minute, executes the following
python script:

#! /usr/bin/python

import popen2

# get the list of jobs in Q1
r, w, e = popen2.popen3('cat jobs_to_run')

jobs_in_Q1 = r.readlines() # the list of jobs

n_jobs_in_Q1 = len(jobs_in_Q1)

r.close()
w.close()
e.close()

if (n_jobs_in_Q1 > 0):

# find the number of jobs in Q2
r, w, e = popen2.popen3('atq | wc -l')
n_jobs_in_Q2 = r.readline()
n_jobs_in_Q2 = n_jobs_in_Q2[0:len(n_jobs_in_Q2)-1]
n_jobs_in_Q2 = int(n_jobs_in_Q2)

r.close()
w.close()
e.close()

if n_jobs_in_Q2 < 2:

# add a job to Q2
job = jobs_in_Q1[0]
job = job[0:len(job)-1]

command = job

  r, w, e = popen2.popen3(command)
  r.close()
  w.close()
  e.close()

# remove this job from Q1
jobs_in_Q1 = jobs_in_Q1[1:]

jobs_to_run = open("jobs_to_run", "w")

  for line in range(len(jobs_in_Q1)):
   jobs_to_run.write(jobs_in_Q1[line])
  jobs_to_run.close()

4. Let it run.

Approximately every day or two, you will end up in the buggy state,
where "atq" says that you have two jobs running, but the jobs have in
fact completed. Once this happens, the system will stay in that state
until you notice and manually remove the jobs from the "=" job queue.

As far as I can see, the bug lies in whatever method is used to tell
the queue manager that jobs have completed -- occasionally, that
mechanism fails so that "atq" thinks that the jobs are still on the
"=" (i.e., the running) queue, even though the jobs have actually
completed processing.

It *may* be that one can cause the bug to appear more often by one of:
a) making the jobs run for a shorter amount of time than the several
minutes that my jobs take
b) queuing more than two jobs at a time
Since this is a work system currently running through approximately
10,000 jobs I am not in a position to try either of these methods to
attempt to make the problem appear more frequently than once every day
or two.

I am attempting to instrument my machine a bit more, to produce a bit
more information about the bug, or at least some output that shows it
occurring.

On 15/07/07, Scott Kitterman <ubuntu@kitterman.com> wrote:
> You are going to need to provide a usable test case to replicate this.
> Preferably a minimal test case, but otherwise the code that leads to
> this problem.  Otherwise, there is little a develper can do.

OK, here is the simplest that I am sure will work. Probably several
items can be changed/removed and the problem will still show up, but
since I am not completely certain of that, I would rather give
instructions that I know will cause the problem to appear.

On an SMP 64-bit system:

1. Create a program that takes several minutes to execute. Call this
program X. It doesn't matter at all what this program does, we just
want something that will occupy the CPU for about five minutes.

2. Create a file called jobs_to_run, of length at least several
hundred lines, where each line creates a batch job to run X. That is,
each line looks like this
  batch < <some-file>

3. Create a cron job that, once per minute, executes the following
python script:

#! /usr/bin/python

import popen2

# get the list of jobs in Q1
r, w, e = popen2.popen3('cat jobs_to_run')

jobs_in_Q1 = r.readlines()    # the list of jobs

n_jobs_in_Q1 = len(jobs_in_Q1)

r.close()
w.close()
e.close()

if (n_jobs_in_Q1 > 0):
	
# find the number of jobs in Q2
	r, w, e = popen2.popen3('atq | wc -l')
	n_jobs_in_Q2 = r.readline()
	n_jobs_in_Q2 = n_jobs_in_Q2[0:len(n_jobs_in_Q2)-1]
	n_jobs_in_Q2 = int(n_jobs_in_Q2)
	
	r.close()
	w.close()
	e.close()
	
	if n_jobs_in_Q2 < 2:
	
# add a job to Q2
		job = jobs_in_Q1[0]
		job = job[0:len(job)-1]

command = job
		
		r, w, e = popen2.popen3(command)
		r.close()
		w.close()
		e.close()
		
# remove this job from Q1
		jobs_in_Q1 = jobs_in_Q1[1:]
		
		jobs_to_run = open("jobs_to_run", "w")
		
		for line in range(len(jobs_in_Q1)):
			jobs_to_run.write(jobs_in_Q1[line])
		jobs_to_run.close()

4. Let it run.

Approximately every day or two, you will end up in the buggy state,
where "atq" says that you have two jobs running, but the jobs have in
fact completed. Once this happens, the system will stay in that state
until you notice and manually remove the jobs from the "=" job queue.

As far as I can see, the bug lies in whatever method is used to tell
the queue manager that jobs have completed -- occasionally, that
mechanism fails so that "atq" thinks that the jobs are still on the
"=" (i.e., the running) queue, even though the jobs have actually
completed processing.

It *may* be that one can cause the bug to appear more often by one of:
  a) making the jobs run for a shorter amount of time than the several
minutes that my jobs take
  b) queuing more than two jobs at a time
Since this is a work system currently running through approximately
10,000 jobs I am not in a position to try either of these methods to
attempt to make the problem appear more frequently than once every day
or two.

I am attempting to instrument my machine a bit more, to produce a bit
more information about the bug, or at least some output that shows it
occurring.

Revision history for this message

N7DR (doc-evans) wrote on 2007-07-18:

#3

Download full text (7.5 KiB)

OK, I can at least show an example of this bug happening now.

I instrumented the machine to print the following information once per minute:

the time
the output of "atq"
the start of the output of "top"

So here is an example of three minutes of normal operation (i.e., when
everything is working OK):

----

Wed Jul 18 07:33:01 MDT 2007

4232 Wed Jul 18 07:29:00 2007 = n7dr
4233 Wed Jul 18 07:30:00 2007 = n7dr

top - 07:33:01 up 14 days, 1:12, 2 users, load average: 2.37, 2.11, 2.04
Tasks: 243 total, 3 running, 238 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4014428k used, 32144k free, 564304k buffers
Swap: 1020116k total, 1005376k used, 14740k free, 548508k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27683 n7dr 39 14 28428 15m 1740 R 50 0.4 2:54.83 model2
27594 n7dr 39 14 32152 19m 1740 R 49 0.5 3:53.68 model2
27906 n7dr 15 0 10696 1316 880 R 1 0.0 0:00.01 top
    1 root 18 0 2640 504 468 S 0 0.0 0:04.61 init
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1

Wed Jul 18 07:34:01 MDT 2007

4232 Wed Jul 18 07:29:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:34:02 up 14 days, 1:14, 2 users, load average: 1.91, 2.04, 2.02
Tasks: 246 total, 3 running, 241 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4006700k used, 39872k free, 564616k buffers
Swap: 1020116k total, 1005372k used, 14744k free, 548916k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27594 n7dr 39 14 36128 22m 1740 R 48 0.6 4:53.11 model2
28014 n7dr 37 14 16476 3864 1740 R 45 0.1 0:00.47 model2
    1 root 16 0 2640 504 468 S 1 0.0 0:04.62 init
28018 n7dr 15 0 10692 1312 880 R 1 0.0 0:00.01 top
    2 root RT 0 0 0 0 S 0 0.0 0:03.38 migration/0
    3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
    4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
    5 root RT 0 0 0 0 S 0 0.0 0:03.18 migration/1

Wed Jul 18 07:35:01 MDT 2007

4235 Wed Jul 18 07:35:00 2007 = n7dr
4234 Wed Jul 18 07:34:00 2007 = n7dr

top - 07:35:02 up 14 days, 1:15, 2 users, load average: 1.91, 2.02, 2.01
Tasks: 246 total, 5 running, 239 sleeping, 0 stopped, 2 zombie
Cpu(s): 3.8% us, 1.2% sy, 80.0% ni, 14.1% id, 0.4% wa, 0.2% hi, 0.3% si
Mem: 4046572k total, 4002512k used, 44060k free, 564988k buffers
Swap: 1020116k total, 1005372k used, 14744k free, 549564k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28014 n7dr 39 14 28824 15m 1740 R 45 0.4 0:58.55 model2
28149 n7dr 39 14 16204 3604 1740 R 42 0.1 0:00.44 model2
28144 n...

OK, I can at least show an example of this bug happening now.

I instrumented the machine to print the following information once per minute:

the time
the output of "atq"
the start of the output of "top"

So here is an example of three minutes of normal operation (i.e., when
everything is working OK):

----

Wed Jul 18 07:33:01 MDT 2007

4232	Wed Jul 18 07:29:00 2007 = n7dr
4233	Wed Jul 18 07:30:00 2007 = n7dr

top - 07:33:01 up 14 days,  1:12,  2 users,  load average: 2.37, 2.11, 2.04
Tasks: 243 total,   3 running, 238 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4014428k used,    32144k free,   564304k buffers
Swap:  1020116k total,  1005376k used,    14740k free,   548508k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27683 n7dr      39  14 28428  15m 1740 R   50  0.4   2:54.83 model2
27594 n7dr      39  14 32152  19m 1740 R   49  0.5   3:53.68 model2
27906 n7dr      15   0 10696 1316  880 R    1  0.0   0:00.01 top
    1 root      18   0  2640  504  468 S    0  0.0   0:04.61 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1

Wed Jul 18 07:34:01 MDT 2007

4232	Wed Jul 18 07:29:00 2007 = n7dr
4234	Wed Jul 18 07:34:00 2007 = n7dr

top - 07:34:02 up 14 days,  1:14,  2 users,  load average: 1.91, 2.04, 2.02
Tasks: 246 total,   3 running, 241 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4006700k used,    39872k free,   564616k buffers
Swap:  1020116k total,  1005372k used,    14744k free,   548916k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27594 n7dr      39  14 36128  22m 1740 R   48  0.6   4:53.11 model2
28014 n7dr      37  14 16476 3864 1740 R   45  0.1   0:00.47 model2
    1 root      16   0  2640  504  468 S    1  0.0   0:04.62 init
28018 n7dr      15   0 10692 1312  880 R    1  0.0   0:00.01 top
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1

Wed Jul 18 07:35:01 MDT 2007

4235	Wed Jul 18 07:35:00 2007 = n7dr
4234	Wed Jul 18 07:34:00 2007 = n7dr

top - 07:35:02 up 14 days,  1:15,  2 users,  load average: 1.91, 2.02, 2.01
Tasks: 246 total,   5 running, 239 sleeping,   0 stopped,   2 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4002512k used,    44060k free,   564988k buffers
Swap:  1020116k total,  1005372k used,    14744k free,   549564k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28014 n7dr      39  14 28824  15m 1740 R   45  0.4   0:58.55 model2
28149 n7dr      39  14 16204 3604 1740 R   42  0.1   0:00.44 model2
28144 n7dr      15   0 10692 1312  880 R    1  0.0   0:00.02 top
28150 n7dr      21   4 44976 2992 2052 S    1  0.1   0:00.01 mail
    1 root      19   0  2640  504  468 S    0  0.0   0:04.62 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0

----

You can see the CPU is being consumed by the "model2" jobs, which are
the jobs on the running batch queue.

Now look what happens a few minutes later:

----

Wed Jul 18 07:40:02 MDT 2007

4235	Wed Jul 18 07:35:00 2007 = n7dr
4234	Wed Jul 18 07:34:00 2007 = n7dr

top - 07:40:02 up 14 days,  1:20,  2 users,  load average: 2.69, 2.73, 2.34
Tasks: 247 total,   4 running, 240 sleeping,   0 stopped,   3 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4023040k used,    23532k free,   704060k buffers
Swap:  1020116k total,  1001744k used,    18372k free,   195968k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28014 n7dr      39  14 36444  22m 1740 R   50  0.6   5:40.84 model2
26130 n7dr      15   0 10692 1312  880 R    1  0.0   0:00.02 top
    1 root      15   0  2640  504  468 S    0  0.0   0:04.62 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1
    6 root      34  19     0    0    0 S    0  0.0   0:00.01 ksoftirqd/1

----

There is now only one model2 job consuming CPU resources, but "atq"
still lists two jobs as on the running queue.

And then, a minute later...

----

Wed Jul 18 07:41:02 MDT 2007

4235	Wed Jul 18 07:35:00 2007 = n7dr
4234	Wed Jul 18 07:34:00 2007 = n7dr

top - 07:41:02 up 14 days,  1:21,  2 users,  load average: 1.95, 2.53, 2.29
Tasks: 243 total,   2 running, 238 sleeping,   0 stopped,   3 zombie
Cpu(s):  3.8% us,  1.2% sy, 80.0% ni, 14.1% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  4024852k used,    21720k free,   612560k buffers
Swap:  1020116k total,  1001744k used,    18372k free,   159620k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27744 n7dr      15   0 10696 1316  880 R    1  0.0   0:00.02 top
    1 root      16   0  2640  504  468 S    0  0.0   0:04.62 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.38 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0
    4 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/0
    5 root      RT   0     0    0    0 S    0  0.0   0:03.18 migration/1
    6 root      34  19     0    0    0 S    0  0.0   0:00.01 ksoftirqd/1
    7 root      RT   0     0    0    0 S    0  0.0   0:00.00 watchdog/1

----

There are no model2 jobs, but "atq" thinks that they are still running.

an out-of-band check confirmed at this point that there were no
"model2" processes on the machine, and that the expected output from
them had been created.

So (when I noticed that we had triggered the bug, a few minutes later)
I had to manually remove jobs 4234 and 4235 from the queue (using
"atrm"), and then everything went back to normal operation again:

----

Wed Jul 18 08:27:01 MDT 2007

4236	Wed Jul 18 08:25:00 2007 = n7dr
4237	Wed Jul 18 08:26:00 2007 = n7dr

top - 08:27:01 up 14 days,  2:07,  2 users,  load average: 2.38, 1.48, 1.16
Tasks: 218 total,   4 running, 213 sleeping,   0 stopped,   1 zombie
Cpu(s):  3.8% us,  1.2% sy, 79.9% ni, 14.3% id,  0.4% wa,  0.2% hi,  0.3% si
Mem:   4046572k total,  2660228k used,  1386344k free,   203016k buffers
Swap:  1020116k total,    99044k used,   921072k free,   815020k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
29473 n7dr      39  14 25460  11m 1740 R   50  0.3   0:56.02 model2
29389 n7dr      39  14 34168  19m 1740 R   45  0.5   1:49.27 model2
28779 n7dr      15   0  323m  59m  33m S    3  1.5   0:17.45 amarokapp
 4814 root      15   0     0    0    0 S    2  0.0   1:54.15 nfsd
29553 n7dr      15   0 10692 1284  880 R    1  0.0   0:00.01 top
    1 root      18   0  2640  504  468 S    0  0.0   0:04.63 init
    2 root      RT   0     0    0    0 S    0  0.0   0:03.40 migration/0
    3 root      34  19     0    0    0 S    0  0.0   0:00.21 ksoftirqd/0

----

Revision history for this message

N7DR (doc-evans) wrote on 2007-08-18:

#4

This has something to do with the daily messages that appear in syslogd at around 7:44 a.m.:

----

Aug 18 07:44:33 homebrew exiting on signal 15
Aug 18 07:44:35 homebrew syslogd 1.4.1#17ubuntu7.1: restart.

----

I have no idea what these messages mean, but I have noticed that this bug almost always occurs at around this time of the morning.

Revision history for this message

Scott Kitterman (kitterman) wrote on 2007-08-18:

#5

There is a standard cron job that restarts syslogd on a daily basis. That appears to be what this is. I'm not sure how or if that might be relevant.

SIGTERM = Signal("TERM", 15, "Termination")

Looking at the man page for at, I find this:

"At and batch as presently implemented are not suitable when users are competing for resources. If this is the case for your site, you might want to consider another batch system, such as nqs."

It looks to me like you are running into this known limitation of at. What I would suggest is you either follow the recommendations in the man page and use a different batch system (that will be most robust I would guess) or adjust your cron job to make sure it doesn't run at the same time other cron jobs are running.

Revision history for this message

N7DR (doc-evans) wrote on 2007-08-21:

#6

On 18/08/07, Scott Kitterman <email address hidden> wrote:
> There is a standard cron job that restarts syslogd on a daily basis.
> That appears to be what this is. I'm not sure how or if that might be
> relevant.
>

No, me neither; but it sure seems to be strongly correlated.

> SIGTERM = Signal("TERM", 15, "Termination")
>
> Looking at the man page for at, I find this:
>
> "At and batch as presently implemented are not suitable when users are
> competing for resources. If this is the case for your site, you
> might want to consider another batch system, such as nqs."
>

That's incredibly vague... "competing for resources" is meaningless.
My jobs aren't doing anything that would be competing for anything
remotely unusual: they are mostly CPU-bound, with occasional writes to
a file descriptor (for an ordinary file in the current working
directory of the job). If that's "competing for resources" then
*anything* could be so called, and it would be unsafe to run anything
at all through the batch system..

> It looks to me like you are running into this known limitation of at.
> What I would suggest is you either follow the recommendations in the man
> page and use a different batch system (that will be most robust I would
> guess) or adjust your cron job to make sure it doesn't run at the same
> time other cron jobs are running.

As far as I can tell, there is no mention of "nqs" in the dapper
64-bit repositories. Where would I get it?

There's also no way to follow your latter suggestion: I have no way of
knowing how long each individual job will take to run, so I can't
simply (for example) stop executing new jobs at (say) 7:15. Some of
the jobs run for two minutes, some run for over 200, so it's simply
impractical to try to not schedule one to be running 7:44 or
thereabouts (which is when the daily job appears to run).

Is the daily system job important? Maybe I could simply stop that from
executing?

Revision history for this message

N7DR (doc-evans) wrote on 2007-08-21:

#7

How would I contact the authors of the queue system used in kubuntu,
to try to talk to them directly about this? Having batched jobs fail
to leave the queue is definitely not good behaviour, and I would think
that they would be concerned and want to fix this.

(And it's really pretty weird that the actual jobs complete just fine;
it's just that the queue manager fails to remove them from the queue
after they've completed. That smells of something that should be
fixable.)

Revision history for this message

Scott Kitterman (kitterman) wrote on 2007-08-21: Re: [Bug 126204] Re: Batch jobs intermittently fail to leave "="queue when complete

#8

Perhaps just skipping the minute when the daily syslog restart is scheduled
would do it. I agree the man page is vague, but that's what it says.

Ubuntu
at package

Batch jobs intermittently fail to leave "="queue when complete

Bug Description

Other bug subscribers

Remote bug watches

Ubuntuat package

Batch jobs intermittently fail to leave "="queue when complete

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
at package