test_operationalerror_view_integration fails intermittently in parallel tests

Bug #974617 reported by Gary Poster
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Benji York
Python PGBouncer
Triaged
High
Unassigned

Bug Description

This is the traceback. See the attachment for the full test information

======================================================================
ERROR: lp.services.webapp.tests.test_error.TestDatabaseErrorViews.test_operationalerror_view_integration
----------------------------------------------------------------------

[...]

Traceback (most recent call last):
  File "/var/lib/buildbot/slaves/slave/lucid-devel/build/lib/lp/services/webapp/tests/test_error.py", line 116, in test_operationalerror_view_integration
    urllib2.urlopen(url)
  File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 503: Service Unavailable

Related branches

Revision history for this message
Gary Poster (gary) wrote :
Revision history for this message
Gary Poster (gary) wrote :

I'm not able to dupe this after repeated attempts in a non-parallel lxc lucid container. Intermittent issues often signal timing issues, and the possibility that jumps out to me is that the bouncer isn't quite started yet. In pgbouncer.fixture.PGBouncerFixture.start, the code waits to see the bouncer's pid. Perhaps the presence of the bouncer's pid is not sufficient to indicate that the bouncer is ready. In that case, a proper fix might be to change bouncer code so that the pid is in fact a reliable indicator; or to change the fixture to look for some other more reliable signal. A more temporary fix (and/or an easier indication that this hypothesis has merit) might be to retry the assertion in this test a few time with a time.sleep(0.5) between each attempt. Working with a pgbouncer dev might be helpful.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 974617] Re: test_operational_view_integration fails intermittently in parallel tests

Starting with a 0.5 sec sleep after pgbouncerFixture.setUp() sounds
like a sensible diagnostic.

Changed in python-pgbouncer:
status: New → Triaged
importance: Undecided → High
Brad Crittenden (bac)
Changed in launchpad:
assignee: nobody → Brad Crittenden (bac)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r15086 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/15086) by a commit, but not testable.

tags: added: qa-untestable
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
Gary Poster (gary)
Changed in launchpad:
status: Fix Released → Triaged
assignee: Brad Crittenden (bac) → nobody
Revision history for this message
Gary Poster (gary) wrote : Re: test_operational_view_integration fails intermittently in parallel tests

The error happened again, even with 5 tries spread across .5 seconds between each. See line 7419 and previous of http://pastebin.ubuntu.com/934203/ .

(Additionally the format string uses a Python 2.7 feature, leading to the error seen in the traceback. "bouncer did not come up after {} attempts.".format(retries) needs to be something like "bouncer did not come up after {count} attempts.".format(count=retries) in 2.6.)

Brad Crittenden (bac)
summary: - test_operational_view_integration fails intermittently in parallel tests
+ test_operationalerror_view_integration fails intermittently in parallel
+ tests
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :

Fixed in stable r15157 (http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/15157) by a commit, but not testable.

Changed in launchpad:
assignee: nobody → Brad Crittenden (bac)
status: Triaged → Fix Committed
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
Gary Poster (gary)
Changed in launchpad:
status: Fix Released → In Progress
status: In Progress → Triaged
Revision history for this message
Gary Poster (gary) wrote :

This still isn't quite addressed. I got another instance today (breaking our 12-run green streak!) http://pastebin.ubuntu.com/975858/ .

Right now, in the pgbouncer fixture, we are saying that we have finished with the "start" method as soon as there is a pgbouncer subprocess and we can poll it. There's no guarantee that this will mean that pgbouncer is actually "open for business". There might or might not be a guarantee that Launchpad is ready to recognize that pgbouncer has returned.

I think we should use the method Brad did initially (retrying the url) but just give it some more time. Let's keep the code in the bouncer's start method too--it's good. But let's re-add this (see https://code.launchpad.net/~bac/launchpad/bug-974617-2/+merge/103339), and increase the retry count to 10.

26 - # bouncer.start() can sometimes return before the service is actually
27 - # available for use. To be defensive, let's retry a few times. See
28 - # bug 974617.
29 - retries = 5
30 - for i in xrange(retries):
31 - try:
32 - urllib2.urlopen(url)
33 - except urllib2.HTTPError as e:
34 - if e.code != httplib.SERVICE_UNAVAILABLE:
35 - raise
36 - else:
37 - break
38 - time.sleep(0.5)
39 - else:
40 - raise TimeoutException(
41 - "bouncer did not come up after {} attempts.".format(retries))

Benji York (benji)
Changed in launchpad:
status: Triaged → In Progress
assignee: Brad Crittenden (bac) → Benji York (benji)
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-untestable
Changed in launchpad:
status: In Progress → Fix Committed
Gary Poster (gary)
tags: added: qa-untestable
removed: qa-needstesting
Revision history for this message
Gary Poster (gary) wrote :

Still not resolved. Failed again today with Benji's patch. Benji's patch shows that pgbouncer is ready, but not that LP is ready. I still think we should re-add explicitly what is in comment 7. If that fails, after five seconds of retries and the checks Benji added, I'd say we need to look elsewhere for why we keep getting this failure.

Changed in launchpad:
status: Fix Committed → Triaged
Gary Poster (gary)
Changed in launchpad:
status: Triaged → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.