TestTwistedJobRunner.test_timeout fails intermittently

Bug #505913 reported by Guilherme Salgado
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Jonathan Lange

Bug Description

As seen on https://lpbuildbot.canonical.com/builders/db_lp/builds/407/steps/shell_7/logs/summary, that test may fail in some circumstances.

Locally, I managed to get it to fail just by running 'make build' on a separate LP branch while running that test.

Related branches

tags: added: spurious-test-failure
Curtis Hovey (sinzui)
affects: launchpad → launchpad-foundations
Changed in launchpad-foundations:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

For the record, the error is:

Failure in test lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeout
Traceback (most recent call last):
  File "/srv/buildbot/slaves/launchpad/db-devel/build/lib/lp/testing/__init__.py", line 449, in run
    testMethod()
  File "/srv/buildbot/slaves/launchpad/db-devel/build/lib/lp/services/job/tests/test_runner.py", line 314, in test_timeout
    self.assertEqual('TimeoutError', oops.type)
  File "/srv/buildbot/slaves/launchpad/db-devel/build/lib/lp/testing/__init__.py", line 338, in assertEqual
    % (message, pformat(a), pformat(b)))
  File "/usr/lib/python2.5/unittest.py", line 301, in fail
    raise self.failureException, msg
AssertionError: not equal:
a = 'TimeoutError'
b = 'ProcessTerminated'

Revision history for this message
Jonathan Lange (jml) wrote :

The test just failed with this error, different from the quoted traceback, but I'd imagine part of the same error.

------------
Text attachment: traceback
------------
Traceback (most recent call last):
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/runtest.py", line 169, in _run_user
    return fn(*args, **kwargs)
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 499, in _run_test_method
    return self._get_test_method()()
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/lib/lp/services/job/tests/test_runner.py", line 379, in test_timeout
    self.assertEqual(1, len(runner.completed_jobs))
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 268, in assertEqual
    self.assertThat(observed, matcher)
  File "/srv/buildbot/slaves/launchpad/lucid-db-devel/build/orig_sourcecode/eggs/testtools-0.9.8-py2.6.egg/testtools/testcase.py", line 345, in assertThat
    % (matchee, matcher, mismatch.describe()))
AssertionError: Match failed. Matchee: "2"
Matcher: Equals(1)
Difference: 1 != 2

------------

Changed in launchpad:
importance: High → Critical
Revision history for this message
William Grant (wgrant) wrote :

r12555 makes it no longer intermittent. Running just that test with -m lp.services.job.tests.test_runner in r12554 also shows the failure.

Changed in launchpad:
assignee: nobody → William Grant (wgrant)
status: Triaged → In Progress
Revision history for this message
William Grant (wgrant) wrote :

I couldn't make any significant progress on this, so I'm just fixing the immediate issue: importing canonical.testing.layers in lp.testing causes the subprocess to always fail. Presumably a circular import issue.

Changed in launchpad:
status: In Progress → Invalid
status: Invalid → Triaged
assignee: William Grant (wgrant) → nobody
Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (8.1 KiB)

Note that this test consistently fails when run locally by itself:

jml@truth:~/src/launchpad/stable$ ./bin/test -cvv lp.services.job.tests.test_runner test_timeout
Running tests at level 1
Running canonical.testing.layers.LaunchpadZopelessLayer tests:
  Set up canonical.testing.layers.BaseLayer in 0.065 seconds.
  Set up canonical.testing.layers.ZopelessLayer in 6.982 seconds.
  Set up canonical.testing.layers.DatabaseLayer in 0.989 seconds.
  Set up canonical.testing.layers.LibrarianLayer in 9.038 seconds.
  Set up canonical.testing.layers.MemcachedLayer in 0.122 seconds.
  Set up canonical.testing.layers.LaunchpadLayer in 0.000 seconds.
  Set up canonical.testing.layers.LaunchpadScriptLayer in 0.015 seconds.
  Set up canonical.testing.layers.LaunchpadZopelessLayer in 0.000 seconds.
  Running:
 lp.services.job.tests.test_runner.TestTwistedJobRunner.test_timeoutTraceback (most recent call last):
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 286, in addCallbacks
    self._runCallbacks()
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 542, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 172, in got_task
    task_consumer.taskStarted(task)
  File "/home/jml/src/launchpad/stable/lib/lp/services/twistedsupport/task.py", line 306, in taskStarted
    d = defer.maybeDeferred(task)
--- <exception caught here> ---
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 405, in <lambda>
    yield lambda: self.runJobInSubprocess(job)
  File "/home/jml/src/launchpad/stable/lib/lp/services/job/runner.py", line 376, in runJobInSubprocess
    RunJobCommand, job_id = job_id, _deadline=deadline)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 310, in doWork
    self.startAWorker()
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/pool.py", line 196, in startAWorker
    ampChildArgs=self.ampChildArgs)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 243, in startAMPProcess
    return self.startPythonProcess(prot, *args)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 259, in startPythonProcess
    usePTY=self.usePTY, packages=self.packages)
  File "/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py", line 275, in spawnProcess
    p = os.path.split(imp.find_module(pkg)[1])[0]
exceptions.ImportError: No module named _pythonpath
Traceback (most recent call last):
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64.egg/twisted/internet/defer.py", line 286, in addCallbacks
    self._runCallbacks()
  File "/home/jml/src/launchpad/devel/eggs/Twisted-10.2.0_4395fix_1_4909-py2.6-linux-x86_64...

Read more...

Jonathan Lange (jml)
Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Jonathan Lange (jml)
Revision history for this message
Jonathan Lange (jml) wrote :

Filed bug 737497 to track the "can't run standalone" issue.

Revision history for this message
Jonathan Lange (jml) wrote :

Got interrupted by real life in the middle of the day. Currently playing around with OOPSes, since the whole way we gather OOPSes here is both confusing and racy. (Because it's using a shared directory for OOPS reports from other tests and other runs).

I eventually got some inspiration though. I *think* this is actually a bug in ampoule. Essentially, it can get so caught up in running the job that it forgets to timeout.

If you apply this patch:

=== modified file 'lib/lp/services/job/tests/test_runner.py'
--- lib/lp/services/job/tests/test_runner.py 2011-03-18 19:14:44 +0000
+++ lib/lp/services/job/tests/test_runner.py 2011-03-18 19:16:20 +0000
@@ -352,7 +352,7 @@ class StuckJob(BaseRunnableJob):
         # and so we soak up the ZCML loading time. For the second job, have a
         # short lease so we hit the timeout.
         if self.id == 2:
- lease_length = 1
+ lease_length = -1
         else:
             lease_length = 10000
         return self.job.acquireLease(lease_length)

You'll get consistent failures. (Although you'll need some of the other stuff in my branch to get going).

Revision history for this message
Jonathan Lange (jml) wrote :
Download full text (6.0 KiB)

OK, more information. It's a bug in ampoule, sort of.

What's happening:
 * the job "lease" mechanism works by specifying a timeout for the job using the "deadline" mechanism in ampoule
 * if the deadline is too early, the deadline can be triggered before AMPChild.__enter__ can be processed (in our case, JobRunnerProcess.__enter__)
 * when the deadline is triggered, ampoule kills the child process with SIGHUP
 * we're using __enter__ to set a SIGHUP handler that raises an error, our test expects that error to be raised
 * when the deadline occurs before our __enter__ is processed, AMP just kills the process and then starts another worker

What I don't understand is why the test considers the job to have completed successfully. Probably something to do with the many layers of indirection between ampoule and the actual job.

Not sure what the next action is. Changing the timeout to 5s instead of 1s would fix the intermittency, at a cost of test performance. Alternatively, we might want to reconsider the testing approach here (e.g. delete this test and add some tests that are more unit-y). Probably better to fix ampoule though.

# With deadline set too early
jml@truth:~/src/launchpad/test-timeout-505913$ PYTHONPATH=.:$PYTHONPATH ./bin/py examplerunner.py
/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py:4: DeprecationWarning: the sets module is deprecated
  import sets
2011-03-18 20:42:05+0000 [-] Log opened.
2011-03-18 20:42:05+0000 [-] Subprocess 0 started.
2011-03-18 20:42:05+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] Process: 0 ended
2011-03-18 20:42:05+0000 [-] AMP connection lost (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] ************* JOB RAISED
2011-03-18 20:42:05+0000 [-] Traceback (most recent call last):
2011-03-18 20:42:05+0000 [-] Failure: twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 1.
2011-03-18 20:42:05+0000 [-]
2011-03-18 20:42:05+0000 [-] FATAL: Restarting after [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ProcessTerminated'>: A process has ended with a probable error condition: process ended by signal 1.
 ]
2011-03-18 20:42:05+0000 [-] Subprocess 1 started.
2011-03-18 20:42:05+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:05+0000 [-] Main loop terminated.

# With deadline set "late enough"
jml@truth:~/src/launchpad/test-timeout-505913$ PYTHONPATH=.:$PYTHONPATH ./bin/py examplerunner.py
/home/jml/src/launchpad/devel/eggs/ampoule-0.2.0-py2.6.egg/ampoule/main.py:4: DeprecationWarning: the sets module is deprecated
  import sets
2011-03-18 20:42:21+0000 [-] Log opened.
2011-03-18 20:42:21+0000 [-] Subprocess 0 started.
2011-03-18 20:42:21+0000 [-] AMP connection established (HOST:('no host',) PEER:('subprocess',))
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] Log opened.
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] using set_wakeup_fd
2011-03-18 20:42:22+0000 [-] FROM 0: 2011-03-18 20:42:22+0000 [-] /home/jml/src...

Read more...

Revision history for this message
Jonathan Lange (jml) wrote :
Revision history for this message
Jonathan Lange (jml) wrote :
Revision history for this message
Jonathan Lange (jml) wrote :

I was wrong. The minimal example is invalid because the test calls one job after the other, and the minimal example calls only one job. This is relevant, because the first job in the test guarantees that the signal handler is installed.

What's _really_ happening is that when the timeout is really small, the signal is being raised before the message is sent to the child process to run the job. The signal handler in the child process is triggered, and raises an exception, but nothing cares. The reactor treats it as an "unhandled error" and moves on its merry way. When the message actually arrives, it is processed and the job is run.

This difference explains "why the test considers the job to have completed successfully", which was confusing me in comment 8.

Talking w/ abentley about options to fix this.

Revision history for this message
Jonathan Lange (jml) wrote :

Probably the best strategy here is to change 'handler' in __enter__ to kill the process and do something to indicate to parent processes that it was a timeout that did it (like logging an OOPS, or using a particular return code or both).

Some work will be required to do the translation in a neat way. We probably want to preserve the property that only one oops gets logged and that oops is a timeout error. We can probably suffer a little noise in the log.

Jonathan Lange (jml)
Changed in launchpad:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.