Launchpad itself

Connection reaper killed connection due to a SoftRequestTimeout

Bug #335172 reported by Diogo Matsubara on 2009-02-26

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Won't Fix	Critical	Unassigned

Bug Description

As seen in OOPS-1152EA162 a DisconnectionError was logged due to a previous query taking too much time to be executed.

In OOPS-1152EA161 one can see that the preceding request to the DisconnectionError took 174946 ms of SQL time.

https://pastebin.canonical.com/14335/ shows the output of the cronjob that kills idle connections around the time the OOPS was logged.

Tags:

Revision history for this message

Diogo Matsubara (matsubara) wrote on 2009-02-26:

From today's meeting:

<matsubara> herb, anything happened to the DB during the time of this OOPS-1152EA162?
<matsubara> or maybe stub might know ^
<herb> matsubara: nothing in the incident log.
<stub> matsubara: That is one of the connection reaper scripts kicking in
<herb> matsubara: I think that's also on the void between LOSAs.
<herb> ah, there we go.
<stub> We kill connections idle in a transaction more than a few hours (and should be more agressive), and appserver connections that have been in a transaction for more than 2 minutes.
<Ursinha> stub, I see
<matsubara> stub, ok. so if we start seeing too many of those, we have a problem somewhere and a few is kinda normal?
<stub> The notification gets sent to the error-reports list (where we can confirm that this is indeed what happened)
<matsubara> stub, aha. that's better. I'll chase the lp-errors for that one
<matsubara> s/lp-errors/lp-errors list/
<stub> If we see many of them, we have a problem. One is probably a problem - appserver requests taking two minutes on the db means we need to investigate why the normal timeout mechanisms didn't work.
<matsubara> right. thanks for the explanation
<stub> -1 second non-sql time, 0 seconds total time indicates a problem at the appserver? The request never got started?
<matsubara> I'll file a bug about that one and we can discuss there
<stub> hmm... might be a reconnection bug - perhaps the previous request handled by that thread got killed?
<stub> I don't know if we Retry on DisconnectionError exceptions, or if it is a good idea in all cases.
<matsubara> ok

Curtis Hovey (sinzui) on 2010-01-25

Changed in launchpad-foundations:
status:	New → Triaged
importance:	Undecided → Low

Robert Collins (lifeless) on 2011-01-12

Changed in launchpad:
importance:	Low → Critical

Revision history for this message

Robert Collins (lifeless) wrote on 2011-03-17:

So, I'm 99% sure this is a race condition with the idle killer, its inherent in the way we're tackling the problem and unsolvable in that model.

Changed in launchpad:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.