Bug #505913 “TestTwistedJobRunner.test_timeout fails intermitten...” : Bugs : Launchpad itself

Guilherme Salgado (salgado) on 2010-01-11

tags:

added: spurious-test-failure

Curtis Hovey (sinzui) on 2010-01-11

affects:	launchpad → launchpad-foundations
Changed in launchpad-foundations:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2010-01-11:

#1

For the record, the error is:

Failure in test lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeout
Traceback (most recent call last):
  File "/srv/buildbot/slaves/launchpad/db-devel/build/lib/lp/testing/__init__.py", line 449, in run
    testMethod()
  File "/srv/buildbot/slaves/launchpad/db-devel/build/lib/lp/services/job/tests/test_runner.py", line 314, in test_timeout
    self.assertEqual('TimeoutError', oops.type)
  File "/srv/buildbot/slaves/launchpad/db-devel/build/lib/lp/testing/__init__.py", line 338, in assertEqual
    % (message, pformat(a), pformat(b)))
  File "/usr/lib/python2.5/unittest.py", line 301, in fail
    raise self.failureException, msg
AssertionError: not equal:
a = 'TimeoutError'
b = 'ProcessTerminated'

Revision history for this message

Jonathan Lange (jml) wrote on 2011-02-20:

#2

The test just failed with this error, different from the quoted traceback, but I'd imagine part of the same error.

------------
Text attachment: traceback
------------
Traceback (most recent call last):
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/runtest.py", line 169, in _run_user
    return fn(*args, **kwargs)
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 499, in _run_test_method
    return self._get_test_method()()
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/lib/lp/services/job/tests/test_runner.py", line 379, in test_timeout
    self.assertEqual(1, len(runner.completed_jobs))
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 268, in assertEqual
    self.assertThat(observed, matcher)
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 345, in assertThat
    % (matchee, matcher, mismatch.describe()))
AssertionError: Match failed. Matchee: "2"
Matcher: Equals(1)
Difference: 1 != 2

------------

Changed in launchpad:
importance:	High → Critical

Revision history for this message

William Grant (wgrant) wrote on 2011-03-09:

#3

r12555 makes it no longer intermittent. Running just that test with -m lp.services.job.tests.test_runner in r12554 also shows the failure.

Changed in launchpad:
assignee:	nobody → William Grant (wgrant)
status:	Triaged → In Progress

Revision history for this message

William Grant (wgrant) wrote on 2011-03-09:

#4

I couldn't make any significant progress on this, so I'm just fixing the immediate issue: importing canonical.testing.layers in lp.testing causes the subprocess to always fail. Presumably a circular import issue.

Changed in launchpad:
status:	In Progress → Invalid
status:	Invalid → Triaged
assignee:	William Grant (wgrant) → nobody

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-17:

#5

Download full text (8.1 KiB)

Note that this test consistently fails when run locally by itself:

jml@truth:~/src/launchpad/stable$ ./bin/test -cvv lp.services.job.tests.test_runner test_timeout
Running tests at level 1
Running canonical.testing.layers.LaunchpadZopelessLayer tests:
  Set up canonical.testing.layers.BaseLayer in 0.065 seconds.
  Set up canonical.testing.layers.ZopelessLayer in 6.982 seconds.
  Set up canonical.testing.layers.DatabaseLayer in 0.989 seconds.
  Set up canonical.testing.layers.LibrarianLayer in 9.038 seconds.
  Set up canonical.testing.layers.MemcachedLayer in 0.122 seconds.
  Set up canonical.testing.layers.LaunchpadLayer in 0.000 seconds.
  Set up canonical.testing.layers.LaunchpadScriptLayer in 0.015 seconds.
  Set up canonical.testing.layers.LaunchpadZopelessLayer in 0.000 seconds.
  Running:
lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeoutTraceback (most recent call last):
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 286, in addCallbacks
    self._runCallbacks()
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 172, in got_task
    task_consumer.taskStarted(task)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 306, in taskStarted
    d = defer.maybeDeferred(task)
--- <exception caught here> ---
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 405, in <lambda>
    yield lambda: self.runJobInSubprocess(job)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 376, in runJobInSubprocess
    RunJobCommand, job_id = job_id, _deadline=deadline)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 310, in doWork
    self.startAWorker()
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 196, in startAWorker
    ampChildArgs=self.ampChildArgs)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 243, in startAMPProcess
    return self.startPythonProcess(prot, *args)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 259, in startPythonProcess
    usePTY=self.usePTY, packages=self.packages)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 275, in spawnProcess
    p = os.path.split(imp.find_module(pkg)[1])[0]
exceptions.ImportError: No module named _pythonpath
Traceback (most recent call last):
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 286, in addCallbacks
    self._runCallbacks()
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64...

Note that this test consistently fails when run locally by itself:

jml@truth:~/src/launchpad/stable$ ./bin/test -cvv lp.services.job.tests.test_runner test_timeout
Running tests at level 1
Running canonical.testing.layers.LaunchpadZopelessLayer tests:
  Set up canonical.testing.layers.BaseLayer in 0.065 seconds.
  Set up canonical.testing.layers.ZopelessLayer in 6.982 seconds.
  Set up canonical.testing.layers.DatabaseLayer in 0.989 seconds.
  Set up canonical.testing.layers.LibrarianLayer in 9.038 seconds.
  Set up canonical.testing.layers.MemcachedLayer in 0.122 seconds.
  Set up canonical.testing.layers.LaunchpadLayer in 0.000 seconds.
  Set up canonical.testing.layers.LaunchpadScriptLayer in 0.015 seconds.
  Set up canonical.testing.layers.LaunchpadZopelessLayer in 0.000 seconds.
  Running:
 lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeoutTraceback (most recent call last):
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 286, in addCallbacks
    self._runCallbacks()
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 172, in got_task
    task_consumer.taskStarted(task)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 306, in taskStarted
    d = defer.maybeDeferred(task)
--- <exception caught here> ---
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 405, in <lambda>
    yield lambda: self.runJobInSubprocess(job)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 376, in runJobInSubprocess
    RunJobCommand, job_id = job_id, _deadline=deadline)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 310, in doWork
    self.startAWorker()
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 196, in startAWorker
    ampChildArgs=self.ampChildArgs)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 243, in startAMPProcess
    return self.startPythonProcess(prot, *args)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 259, in startPythonProcess
    usePTY=self.usePTY, packages=self.packages)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 275, in spawnProcess
    p = os.path.split(imp.find_module(pkg)[1])[0]
exceptions.ImportError: No module named _pythonpath
Traceback (most recent call last):
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 286, in addCallbacks
    self._runCallbacks()
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 172, in got_task
    task_consumer.taskStarted(task)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 306, in taskStarted
    d = defer.maybeDeferred(task)
--- <exception caught here> ---
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 405, in <lambda>
    yield lambda: self.runJobInSubprocess(job)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 376, in runJobInSubprocess
    RunJobCommand, job_id = job_id, _deadline=deadline)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 310, in doWork
    self.startAWorker()
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 196, in startAWorker
    ampChildArgs=self.ampChildArgs)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 243, in startAMPProcess
    return self.startPythonProcess(prot, *args)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 259, in startPythonProcess
    usePTY=self.usePTY, packages=self.packages)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 275, in spawnProcess
    p = os.path.split(imp.find_module(pkg)[1])[0]
exceptions.ImportError: No module named _pythonpath

Failure in test lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeout
Traceback (most recent call last):
_StringException: Text attachment: librarian-log
------------
2011-03-17 18:31:05+0530 [-] Log opened.
2011-03-17 18:31:05+0530 [-] twistd 10.2.0 (/usr/bin/python2.6 2.6.6) starting up.
2011-03-17 18:31:05+0530 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2011-03-17 18:31:05+0530 [-] canonical.librarian.libraryprotocol.FileUploadFactory starting on 54995
2011-03-17 18:31:05+0530 [-] Starting factory <canonical.librarian.libraryprotocol.FileUploadFactory instance at 0x3fd9440>
2011-03-17 18:31:05+0530 [-] twisted.web.server.Site starting on 39415
2011-03-17 18:31:05+0530 [-] Starting factory <twisted.web.server.Site instance at 0x3fd9518>
2011-03-17 18:31:05+0530 [-] canonical.librarian.libraryprotocol.FileUploadFactory starting on 60708
2011-03-17 18:31:05+0530 [-] Starting factory <canonical.librarian.libraryprotocol.FileUploadFactory instance at 0x3fd93f8>
2011-03-17 18:31:05+0530 [-] twisted.web.server.Site starting on 36546
2011-03-17 18:31:05+0530 [-] Starting factory <twisted.web.server.Site instance at 0x3fd9320>
2011-03-17 18:31:05+0530 [-] Not using upstream librarian
2011-03-17 18:31:05+0530 [-] daemon ready!
2011-03-17 18:31:05+0530 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [17/Mar/2011:13:01:05 +0000] "GET / HTTP/1.0" 200 340 "-" "Python-urllib/1.17"
2011-03-17 18:31:06+0530 [HTTPChannel,1,127.0.0.1] 127.0.0.1 - - [17/Mar/2011:13:01:05 +0000] "GET / HTTP/1.0" 200 340 "-" "Python-urllib/1.17"
------------
Text attachment: traceback
------------
Traceback (most recent call last):
  File "/home/jml/src/launchpad/devel/eggs/testtools-0.9.8-py2.6.egg/testtools/runtest.py", line 169, in _run_user
    return fn(*args, **kwargs)
  File "/home/jml/src/launchpad/devel/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 499, in _run_test_method
    return self._get_test_method()()
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/tests/test_runner.py", line 379, in test_timeout
    self.assertEqual(1, len(runner.completed_jobs))
  File "/home/jml/src/launchpad/devel/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 268, in assertEqual
    self.assertThat(observed, matcher)
  File "/home/jml/src/launchpad/devel/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 345, in assertThat
    % (matchee, matcher, mismatch.describe()))
AssertionError: Match failed. Matchee: "0"
Matcher: Equals(1)
Difference: 1 != 0

------------

Ran 1 tests with 1 failures and 0 errors in 0.990 seconds.
Tearing down left over layers:
  Tear down canonical.testing.layers.LaunchpadZopelessLayer in 0.000 seconds.
  Tear down canonical.testing.layers.LaunchpadScriptLayer in 0.002 seconds.
  Tear down canonical.testing.layers.ZopelessLayer ... not supported
  Tear down canonical.testing.layers.LaunchpadLayer in 0.000 seconds.
  Tear down canonical.testing.layers.LibrarianLayer in 0.002 seconds.
  Tear down canonical.testing.layers.MemcachedLayer in 0.001 seconds.
  Tear down canonical.testing.layers.DatabaseLayer in 0.286 seconds.
  Tear down canonical.testing.layers.BaseLayer in 0.003 seconds.

Tests with failures:
   lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeout (subunit.RemotedTestCase)

Confirmed on wgrant's machine & mine.

Jonathan Lange (jml) on 2011-03-18

Changed in launchpad:
status:	Triaged → In Progress
assignee:	nobody → Jonathan Lange (jml)

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-18:

#6

Filed bug 737497 to track the "can't run standalone" issue.

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-18:

#7

Got interrupted by real life in the middle of the day. Currently playing around with OOPSes, since the whole way we gather OOPSes here is both confusing and racy. (Because it's using a shared directory for OOPS reports from other tests and other runs).

I eventually got some inspiration though. I *think* this is actually a bug in ampoule. Essentially, it can get so caught up in running the job that it forgets to timeout.

If you apply this patch:

=== modified file 'lib/lp/services/job/tests/test_runner.py'
--- lib/lp/services/job/tests/test_runner.py 2011-03-18 19:14:44 +0000
+++ lib/lp/services/job/tests/test_runner.py 2011-03-18 19:16:20 +0000
@@ -352,7 +352,7 @@ class StuckJob(BaseRunnableJob):
         # and so we soak up the ZCML loading time. For the second job, have a
         # short lease so we hit the timeout.
         if self.id == 2:
- lease_length = 1
+ lease_length = -1
         else:
             lease_length = 10000
         return self.job.acquireLease(lease_length)

You'll get consistent failures. (Although you'll need some of the other stuff in my branch to get going).

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-18:

#8

Download full text (6.0 KiB)

OK, more information. It's a bug in ampoule, sort of.

What's happening:
* the job "lease" mechanism works by specifying a timeout for the job using the "deadline" mechanism in ampoule
* if the deadline is too early, the deadline can be triggered before AMPChild.__enter__ can be processed (in our case, JobRunnerProcess.__enter__)
* when the deadline is triggered, ampoule kills the child process with SIGHUP
* we're using __enter__ to set a SIGHUP handler that raises an error, our test expects that error to be raised
* when the deadline occurs before our __enter__ is processed, AMP just kills the process and then starts another worker

What I don't understand is why the test considers the job to have completed successfully. Probably something to do with the many layers of indirection between ampoule and the actual job.

Not sure what the next action is. Changing the timeout to 5s instead of 1s would fix the intermittency, at a cost of test performance. Alternatively, we might want to reconsider the testing approach here (e.g. delete this test and add some tests that are more unit-y). Probably better to fix ampoule though.

# With deadline set too early
jml@truth:~/src/launchpad/test-timeout-505913$ PYTHONPATH=.:$PYTHONPATH ./bin/py examplerunner.py
/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py:4: DeprecationWarning: the sets module is deprecated
import sets
2011-03-18 20:42:05+0000 [-] Log opened.
2011-03-18 20:42:05+0000 [-] Subprocess 0 started.
2011-03-18 20:42:05+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] Process: 0 ended
2011-03-18 20:42:05+0000 [-] AMP connection lost (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] ************* JOB RAISED
2011-03-18 20:42:05+0000 [-] Traceback (most recent call last):
2011-03-18 20:42:05+0000 [-] Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 1.
2011-03-18 20:42:05+0000 [-]
2011-03-18 20:42:05+0000 [-] FATAL: Restarting after [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 1.
]
2011-03-18 20:42:05+0000 [-] Subprocess 1 started.
2011-03-18 20:42:05+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] Main loop terminated.

# With deadline set "late enough"
jml@truth:~/src/launchpad/test-timeout-505913$ PYTHONPATH=.:$PYTHONPATH ./bin/py examplerunner.py
/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py:4: DeprecationWarning: the sets module is deprecated
import sets
2011-03-18 20:42:21+0000 [-] Log opened.
2011-03-18 20:42:21+0000 [-] Subprocess 0 started.
2011-03-18 20:42:21+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] Log opened.
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] using set_wakeup_fd
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] /home/jml/src...

OK, more information. It's a bug in ampoule, sort of.

What's happening:
 * the job "lease" mechanism works by specifying a timeout for the job using the "deadline" mechanism in ampoule
 * if the deadline is too early, the deadline can be triggered before AMPChild.__enter__ can be processed (in our case, JobRunnerProcess.__enter__)
 * when the deadline is triggered, ampoule kills the child process with SIGHUP
 * we're using __enter__ to set a SIGHUP handler that raises an error, our test expects that error to be raised
 * when the deadline occurs before our __enter__ is processed, AMP just kills the process and then starts another worker

What I don't understand is why the test considers the job to have completed successfully. Probably something to do with the many layers of indirection between ampoule and the actual job.

Not sure what the next action is. Changing the timeout to 5s instead of 1s would fix the intermittency, at a cost of test performance. Alternatively, we might want to reconsider the testing approach here (e.g. delete this test and add some tests that are more unit-y). Probably better to fix ampoule though.

# With deadline set too early
jml@truth:~/src/launchpad/test-timeout-505913$ PYTHONPATH=.:$PYTHONPATH ./bin/py examplerunner.py 
/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py:4: DeprecationWarning: the sets module is deprecated
  import sets
2011-03-18 20:42:05+0000 [-] Log opened.
2011-03-18 20:42:05+0000 [-] Subprocess 0 started.
2011-03-18 20:42:05+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] Process: 0 ended
2011-03-18 20:42:05+0000 [-] AMP connection lost (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] ************* JOB RAISED
2011-03-18 20:42:05+0000 [-] Traceback (most recent call last):
2011-03-18 20:42:05+0000 [-] Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 1.
2011-03-18 20:42:05+0000 [-] 
2011-03-18 20:42:05+0000 [-] FATAL: Restarting after [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 1.
	]
2011-03-18 20:42:05+0000 [-] Subprocess 1 started.
2011-03-18 20:42:05+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] Main loop terminated.

# With deadline set "late enough"
jml@truth:~/src/launchpad/test-timeout-505913$ PYTHONPATH=.:$PYTHONPATH ./bin/py examplerunner.py 
/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py:4: DeprecationWarning: the sets module is deprecated
  import sets
2011-03-18 20:42:21+0000 [-] Log opened.
2011-03-18 20:42:21+0000 [-] Subprocess 0 started.
2011-03-18 20:42:21+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] Log opened.
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] using set_wakeup_fd
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] /home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py:4: exceptions.DeprecationWarning: the sets module is deprecated
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] JobRunnerProcess connection established (HOST:<twisted.internet._posixstdio.PipeAddress object at 0x1a6a990> PEER:<twisted.internet._posixstdio.PipeAddress object at 0x1a6a950>)
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] Hello
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] Got command
2011-03-18 20:42:24+0000 [-] FROM 0: 2011-03-18 20:42:24+0000 [-] Unhandled Error
2011-03-18 20:42:24+0000 [-] FROM 0: 	Traceback (most recent call last):
2011-03-18 20:42:24+0000 [-] FROM 0: 	  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/protocols/amp.py", line 884, in _commandReceived
2011-03-18 20:42:24+0000 [-] FROM 0: 	    deferred = self.dispatchCommand(box)
2011-03-18 20:42:24+0000 [-] FROM 0: 	  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/protocols/amp.py", line 941, in dispatchCommand
2011-03-18 20:42:24+0000 [-] FROM 0: 	    return maybeDeferred(responder, box)
2011-03-18 20:42:24+0000 [-] FROM 0: 	  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 133, in maybeDeferred
2011-03-18 20:42:24+0000 [-] FROM 0: 	    result = f(*args, **kw)
2011-03-18 20:42:24+0000 [-] FROM 0: 	  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/protocols/amp.py", line 1030, in doit
2011-03-18 20:42:24+0000 [-] FROM 0: 	    return maybeDeferred(aCallable, **kw).addCallback(
2011-03-18 20:42:24+0000 [-] FROM 0: 	--- <exception caught here> ---
2011-03-18 20:42:24+0000 [-] FROM 0: 	  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 133, in maybeDeferred
2011-03-18 20:42:24+0000 [-] FROM 0: 	    result = f(*args, **kw)
2011-03-18 20:42:24+0000 [-] FROM 0: 	  File "examplerace.py", line 57, in runJobCommand
2011-03-18 20:42:24+0000 [-] FROM 0: 	    time.sleep(5)
2011-03-18 20:42:24+0000 [-] FROM 0: 	  File "examplerace.py", line 46, in handler
2011-03-18 20:42:24+0000 [-] FROM 0: 	    raise TimeoutError
2011-03-18 20:42:24+0000 [-] FROM 0: 	examplerace.TimeoutError: Wooeuohuateh
2011-03-18 20:42:24+0000 [-] FROM 0: 2011-03-18 20:42:24+0000 [-] JobRunnerProcess connection lost (HOST:<twisted.internet._posixstdio.PipeAddress object at 0x1a6a990> PEER:<twisted.internet._posixstdio.PipeAddress object at 0x1a6a950>)
2011-03-18 20:42:24+0000 [-] ************* JOB RAISED
2011-03-18 20:42:24+0000 [-] Traceback (most recent call last):
2011-03-18 20:42:24+0000 [-] Failure: twisted.protocols.amp.UnknownRemoteError: Code<UNKNOWN>: Unknown Error
2011-03-18 20:42:24+0000 [-] 
2011-03-18 20:42:24+0000 [-] Main loop terminated.

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-18:

#9

examplerace.py Edit (2.3 KiB, text/x-python)

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-18:

#10

examplerunner.py Edit (255 bytes, text/x-python)

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-22:

#11

I was wrong. The minimal example is invalid because the test calls one job after the other, and the minimal example calls only one job. This is relevant, because the first job in the test guarantees that the signal handler is installed.

What's _really_ happening is that when the timeout is really small, the signal is being raised before the message is sent to the child process to run the job. The signal handler in the child process is triggered, and raises an exception, but nothing cares. The reactor treats it as an "unhandled error" and moves on its merry way. When the message actually arrives, it is processed and the job is run.

This difference explains "why the test considers the job to have completed successfully", which was confusing me in comment 8.

Talking w/ abentley about options to fix this.

Revision history for this message

Jonathan Lange (jml) wrote on 2011-03-22:

#12

Probably the best strategy here is to change 'handler' in __enter__ to kill the process and do something to indicate to parent processes that it was a timeout that did it (like logging an OOPS, or using a particular return code or both).

Some work will be required to do the translation in a neat way. We probably want to preserve the property that only one oops gets logged and that oops is a timeout error. We can probably suffer a little noise in the log.

Jonathan Lange (jml) on 2011-03-24

Changed in launchpad:
status:	In Progress → Fix Released

Launchpad itself

TestTwistedJobRunner.test_timeout fails intermittently

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches