server errors accessing attachments of private bug reports

Bug #1000805 reported by Brian Murray
68
This bug affects 14 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Unassigned

Bug Description

The librarian is erroring intermittently and is receiving socket errors in its backend.

-ops are restarting it when this happens, which appears to fix the symptoms.

 I regularly receive errors when trying to access the attachments of bug reports (all private afaict) using the API.

Here is some sample code:

for attachment in bug.attachments:
    f = attachment.data.open()
    print(f.filename)

I'll receive errors like:

ServerError Traceback (most recent call last)
/home/theman/documents/canonical/launchpadlib/<ipython-input-8-649cbaee9cc1> in <module>()
      1 for attachment in bug.attachments:
----> 2 f = attachment.data.open()
      3 print(f.filename)
...
status: 500
transfer-encoding: chunked
vary: Accept-Encoding
via: 1.0 banana.canonical.com:3128 (squid/2.7.STABLE7), 1.1 wildcard.restricted.launchpadlibrarian.net
x-cache: MISS from banana.canonical.com
x-cache-lookup: MISS from banana.canonical.com:3128
---
Response body:
---
<html><head><title>Processing Failed</title></head><body><b>Processing Failed</b></body></html>

Changed in launchpad:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
William Grant (wgrant) wrote :
Download full text (3.9 KiB)

2012-05-18 02:41:44+0000 [-] Unhandled error in Deferred:
2012-05-18 02:41:44+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.6/threading.py", line 504, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/__init__.py", line 37, in retry_transaction_decorator
            return func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/sqlbase.py", line 555, in reset_store_decorator
            return func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/__init__.py", line 73, in write_transaction_decorator
            ret = func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/web.py", line 129, in _getFileAlias
            alias = self.storage.getFileAlias(aliasID, token, path)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/storage.py", line 82, in getFileAlias
            return self.library.getAlias(aliasid, token, path)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/db.py", line 60, in getAlias
            TimeLimitedToken.path==path).is_empty()
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/store.py", line 1077, in is_empty
            result = self._store._connection.execute(select)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/databases/postgres.py", line 266, in execute
            return Connection.execute(self, statement, params, noresult)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/database.py", line 238, in execute
            raw_cursor = self.raw_execute(statement, params)
          File "/srv/laun...

Read more...

tags: added: oops
tags: added: oem-services
Revision history for this message
Francis J. Lacoste (flacoste) wrote :

My understanding is that such errors should be retried transparently. Can the PG 9.1 upgrade changed slightly the error messages so that our retry filters are now broken?

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

Yes, retry_transaction only catches:

             except (DisconnectionError, IntegrityError,
                    TransactionRollbackError):
So OperationalError isn't in that list.

Revision history for this message
Brian Murray (brian-murray) wrote :

I believe I am encountering this in the web interface now where I receive 'Processing failed' messages when trying to open attachments.

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

This is what I'm finding at backend, and pgbouncer:
- backend : 27 tcp connections to pgbouncer:
  $ /sbin/ss -p state established dport = :5433 | egrep twistd | wc -l
  27

- pgbouncer: 20 PG conns (still 27 TCPs, obviously):
   $ psql .. -c 'SHOW POOLS;'|sed -n '1p;/librarian /p'
       database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait
      <...> | librarian | 20 | 0 | 20 | 0 | 0 | 0 | 0 | 0
   $ /sbin/ss state established dst ${backend_ip} sport = :5433|wc -l
   28 ## (one for the headerline)

Revision history for this message
Stuart Bishop (stub) wrote :

Storm needs to catch this (and also the other OperationalError in Bug #986148):

psycopg2.OperationalError: could not send data to server: Connection timed out

psycopg2 2.4 or libpq could be the source of the new exception being seen in Launchpad.

On the Launchpad side, sockets failing like this is worrying. We are using an antique version of pg_bouncer so upgrading this might be worth a punt.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 1000805] Re: server errors accessing attachments of private bug reports

On Fri, Jun 1, 2012 at 8:21 AM, Stuart Bishop
<email address hidden> wrote:

> On the Launchpad side, sockets failing like this is worrying.
+1000000

Revision history for this message
William Grant (wgrant) wrote :

Francis, it's unrelated to the retry logic. This sort of error should not be retried.

Revision history for this message
Robert Collins (lifeless) wrote :

this error is basically the same as any timeout error: with LP appserves, socket timeout is > request timeout - we'll kill the request anyway.

With the Librarian we don't have that strict timing protocol in place, which is why this becomes visible.

This needs urgent investigation, but I see no reason to change storm (and lots of reasons not too).

Revision history for this message
Robert Collins (lifeless) wrote :

Do we know if the error is originating from librarian<->pgbouncer, or pgbouncer<->DB ?

no longer affects: storm
Revision history for this message
Robert Collins (lifeless) wrote :

The other worrying aspect is that we're not seeing an oops id in the headers; but that may just not be glued up atm.

Revision history for this message
Chris Van Hoof (vanhoof) wrote :

Seeing this again regularily across a large amount of bugs, ping me if you'd like specific bugs where I've seen this happen.

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

Generated attached PNG for Connection.timed.out, with:

===========================
set xdata time
set timefmt "%Y-%m-%d %H:"
set format x "%b %d"
# generated with:
# egrep -h -B40 Connection.timed.out /srv/launchpadlibrarian.net/production-logs/librarian.log*|egrep Unhandled.Error|cut -c1-14|sort |uniq -c|sed -r 's/[0-9]+/&\t/'|tee /tmp/librarian.timeout.dat
set term png size 800,600; set out '/tmp/out.png'
plot '/tmp/librarian.timeout.dat' using 2:1 with linespoints
===========================

-> http://people.canonical.com/~jjo/librarian.timeout.20120611.png

Note that we had restarted librarians on May/31st, did it again about 1hr ago

Revision history for this message
Robert Collins (lifeless) wrote :

What motivated the restarts?

Revision history for this message
Tom Haddon (mthaddon) wrote :

Current theory is that this is related to firewall updates, since we reset connections before applying changes, and the librarian app servers are in a different DC to the database server. Need to try and correlate the problems we've seen with firewall updates, to confirm both positively and negatively that this is the cause.

description: updated
Haw Loeung (hloeung)
tags: added: canonical-webops-lp
Revision history for this message
Liam Young (gnuoy) wrote :

This doesn't seem to be related to firewall reloads. The firewall logs have been checked the last few times the issue was reported and there was no corresponding firewall update

Revision history for this message
Stuart Bishop (stub) wrote :

I've reopened the Storm fix as Bug #1025264 , as this bug seems to have become about diagnosing Launchpad production network issues. The Storm fix would hide the underlying issue.

Revision history for this message
Stuart Bishop (stub) wrote :

@lifeless, per comment #11: From the traceback we can't tell if there was a network issue between client and pg_bouncer or between pg_bouncer and postgresql, just that the client was unable to send data to its socket.

We should probably check the pgbouncer logs for interesting messages when we see new instances. I believe I checked an earlier report and saw nothing, but that is just anecdata.

Revision history for this message
Stuart Bishop (stub) wrote :

We found a connection limit in an unexpected place (PostgreSQL database role connection limit, probably set several years ago). This has been removed, and hopefully the problem will disappear.

Haw Loeung (hloeung)
Changed in launchpad:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.