Bug #1000805 “server errors accessing attachments of private bug...” : Bugs : Launchpad itself

Laura Czajkowski (czajkowski) on 2012-05-17

Changed in launchpad:
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

William Grant (wgrant) wrote on 2012-05-18:

#1

Download full text (3.9 KiB)

2012-05-18 02:41:44+0000 [-] Unhandled error in Deferred:
2012-05-18 02:41:44+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.6/threading.py", line 504, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/__init__.py", line 37, in retry_transaction_decorator
            return func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/sqlbase.py", line 555, in reset_store_decorator
            return func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/__init__.py", line 73, in write_transaction_decorator
            ret = func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/web.py", line 129, in _getFileAlias
            alias = self.storage.getFileAlias(aliasID, token, path)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/storage.py", line 82, in getFileAlias
            return self.library.getAlias(aliasid, token, path)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/db.py", line 60, in getAlias
            TimeLimitedToken.path==path).is_empty()
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/store.py", line 1077, in is_empty
            result = self._store._connection.execute(select)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/databases/postgres.py", line 266, in execute
            return Connection.execute(self, statement, params, noresult)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/database.py", line 238, in execute
            raw_cursor = self.raw_execute(statement, params)
          File "/srv/laun...

2012-05-18 02:41:44+0000 [-] Unhandled error in Deferred:
2012-05-18 02:41:44+0000 [-] Unhandled Error
        Traceback (most recent call last):
          File "/usr/lib/python2.6/threading.py", line 504, in __bootstrap
            self.__bootstrap_inner()
          File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
            self.run()
          File "/usr/lib/python2.6/threading.py", line 484, in run
            self.__target(*self.__args, **self.__kwargs)
        --- <exception caught here> ---
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/threadpool.py", line 207, in _worker
            result = context.call(ctx, function, *args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 118, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/Twisted-11.1.0-py2.6-linux-x86_64.egg/twisted/python/context.py", line 81, in callWithContext
            return func(*args,**kw)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/__init__.py", line 37, in retry_transaction_decorator
            return func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/sqlbase.py", line 555, in reset_store_decorator
            return func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/database/__init__.py", line 73, in write_transaction_decorator
            ret = func(*args, **kwargs)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/web.py", line 129, in _getFileAlias
            alias = self.storage.getFileAlias(aliasID, token, path)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/storage.py", line 82, in getFileAlias
            return self.library.getAlias(aliasid, token, path)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/lib/lp/services/librarianserver/db.py", line 60, in getAlias
            TimeLimitedToken.path==path).is_empty()
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/store.py", line 1077, in is_empty
            result = self._store._connection.execute(select)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/databases/postgres.py", line 266, in execute
            return Connection.execute(self, statement, params, noresult)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/database.py", line 238, in execute
            raw_cursor = self.raw_execute(statement, params)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/databases/postgres.py", line 276, in raw_execute
            return Connection.raw_execute(self, statement, params)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/database.py", line 322, in raw_execute
            self._check_disconnect(raw_cursor.execute, *args)
          File "/srv/launchpadlibrarian.net/production/launchpad2-rev-15135/eggs/storm-0.19.0.99_lpwithnodatetime_r406-py2.6-linux-x86_64.egg/storm/database.py", line 371, in _check_disconnect
            return function(*args, **kwargs)
        psycopg2.OperationalError: could not send data to server: Connection timed out

Robert Collins (lifeless) on 2012-05-30

tags:

added: oops

Cody A.W. Somerville (cody-somerville) on 2012-05-30

tags:

added: oem-services

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2012-05-31:

#2

My understanding is that such errors should be retried transparently. Can the PG 9.1 upgrade changed slightly the error messages so that our retry filters are now broken?

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2012-05-31:

#3

Yes, retry_transaction only catches:

except (DisconnectionError, IntegrityError,
TransactionRollbackError):
So OperationalError isn't in that list.

Revision history for this message

Brian Murray (brian-murray) wrote on 2012-05-31:

#4

I believe I am encountering this in the web interface now where I receive 'Processing failed' messages when trying to open attachments.

Revision history for this message

JuanJo Ciarlante (jjo) wrote on 2012-05-31:

#6

This is what I'm finding at backend, and pgbouncer:
- backend : 27 tcp connections to pgbouncer:
$ /sbin/ss -p state established dport = :5433 | egrep twistd | wc -l
27

- pgbouncer: 20 PG conns (still 27 TCPs, obviously):
   $ psql .. -c 'SHOW POOLS;'|sed -n '1p;/librarian /p'
       database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait
      <...> | librarian | 20 | 0 | 20 | 0 | 0 | 0 | 0 | 0
   $ /sbin/ss state established dst ${backend_ip} sport = :5433|wc -l
   28 ## (one for the headerline)

Revision history for this message

Stuart Bishop (stub) wrote on 2012-05-31:

#7

Storm needs to catch this (and also the other OperationalError in Bug #986148):

psycopg2.OperationalError: could not send data to server: Connection timed out

psycopg2 2.4 or libpq could be the source of the new exception being seen in Launchpad.

On the Launchpad side, sockets failing like this is worrying. We are using an antique version of pg_bouncer so upgrading this might be worth a punt.

Revision history for this message

Robert Collins (lifeless) wrote on 2012-05-31: Re: [Bug 1000805] Re: server errors accessing attachments of private bug reports

#8

On Fri, Jun 1, 2012 at 8:21 AM, Stuart Bishop
<email address hidden> wrote:

> On the Launchpad side, sockets failing like this is worrying.
+1000000

Revision history for this message

William Grant (wgrant) wrote on 2012-06-01:

#9

Francis, it's unrelated to the retry logic. This sort of error should not be retried.

Revision history for this message

Robert Collins (lifeless) wrote on 2012-06-01:

#10

this error is basically the same as any timeout error: with LP appserves, socket timeout is > request timeout - we'll kill the request anyway.

With the Librarian we don't have that strict timing protocol in place, which is why this becomes visible.

This needs urgent investigation, but I see no reason to change storm (and lots of reasons not too).

Revision history for this message

Robert Collins (lifeless) wrote on 2012-06-01:

#11

Do we know if the error is originating from librarian<->pgbouncer, or pgbouncer<->DB ?

no longer affects:

storm

Revision history for this message

Robert Collins (lifeless) wrote on 2012-06-01:

#12

The other worrying aspect is that we're not seeing an oops id in the headers; but that may just not be glued up atm.

Revision history for this message

Chris Van Hoof (vanhoof) wrote on 2012-06-11:

#13

Seeing this again regularily across a large amount of bugs, ping me if you'd like specific bugs where I've seen this happen.

Revision history for this message

JuanJo Ciarlante (jjo) wrote on 2012-06-11:

#14

Generated attached PNG for Connection.timed.out, with:

===========================
set xdata time
set timefmt "%Y-%m-%d %H:"
set format x "%b %d"
# generated with:
# egrep -h -B40 Connection.timed.out /srv/launchpadlibrarian.net/production-logs/librarian.log*|egrep Unhandled.Error|cut -c1-14|sort |uniq -c|sed -r 's/[0-9]+/&\t/'|tee /tmp/librarian.timeout.dat
set term png size 800,600; set out '/tmp/out.png'
plot '/tmp/librarian.timeout.dat' using 2:1 with linespoints
===========================

-> http://people.canonical.com/~jjo/librarian.timeout.20120611.png

Note that we had restarted librarians on May/31st, did it again about 1hr ago

Revision history for this message

Robert Collins (lifeless) wrote on 2012-06-11:

#15

What motivated the restarts?

Revision history for this message

Tom Haddon (mthaddon) wrote on 2012-06-12:

#16

Current theory is that this is related to firewall updates, since we reset connections before applying changes, and the librarian app servers are in a different DC to the database server. Need to try and correlate the problems we've seen with firewall updates, to confirm both positively and negatively that this is the cause.

Robert Collins (lifeless) on 2012-06-12

description:

updated

Haw Loeung (hloeung) on 2012-06-17

tags:

added: canonical-webops-lp

Revision history for this message

Liam Young (gnuoy) wrote on 2012-07-16:

#17

This doesn't seem to be related to firewall reloads. The firewall logs have been checked the last few times the issue was reported and there was no corresponding firewall update

Revision history for this message

Stuart Bishop (stub) wrote on 2012-07-16:

#18

I've reopened the Storm fix as Bug #1025264 , as this bug seems to have become about diagnosing Launchpad production network issues. The Storm fix would hide the underlying issue.

Revision history for this message

Stuart Bishop (stub) wrote on 2012-07-16:

#19

@lifeless, per comment #11: From the traceback we can't tell if there was a network issue between client and pg_bouncer or between pg_bouncer and postgresql, just that the client was unable to send data to its socket.

We should probably check the pgbouncer logs for interesting messages when we see new instances. I believe I checked an earlier report and saw nothing, but that is just anecdata.

Revision history for this message

Stuart Bishop (stub) wrote on 2012-07-23:

#20

We found a connection limit in an unexpected place (PostgreSQL database role connection limit, probably set several years ago). This has been removed, and hopefully the problem will disappear.

Haw Loeung (hloeung) on 2012-09-06

Changed in launchpad:
status:	Triaged → Fix Released

Launchpad itself

server errors accessing attachments of private bug reports

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches