Connection reaper killed connection due to a SoftRequestTimeout

Bug #335172 reported by Diogo Matsubara
2
Affects Status Importance Assigned to Milestone
Launchpad itself
Won't Fix
Critical
Unassigned

Bug Description

As seen in OOPS-1152EA162 a DisconnectionError was logged due to a previous query taking too much time to be executed.

In OOPS-1152EA161 one can see that the preceding request to the DisconnectionError took 174946 ms of SQL time.

https://pastebin.canonical.com/14335/ shows the output of the cronjob that kills idle connections around the time the OOPS was logged.

Revision history for this message
Diogo Matsubara (matsubara) wrote :

From today's meeting:

<matsubara> herb, anything happened to the DB during the time of this OOPS-1152EA162?
<matsubara> or maybe stub might know ^
<herb> matsubara: nothing in the incident log.
<stub> matsubara: That is one of the connection reaper scripts kicking in
<herb> matsubara: I think that's also on the void between LOSAs.
<herb> ah, there we go.
<stub> We kill connections idle in a transaction more than a few hours (and should be more agressive), and appserver connections that have been in a transaction for more than 2 minutes.
<Ursinha> stub, I see
<matsubara> stub, ok. so if we start seeing too many of those, we have a problem somewhere and a few is kinda normal?
<stub> The notification gets sent to the error-reports list (where we can confirm that this is indeed what happened)
<matsubara> stub, aha. that's better. I'll chase the lp-errors for that one
<matsubara> s/lp-errors/lp-errors list/
<stub> If we see many of them, we have a problem. One is probably a problem - appserver requests taking two minutes on the db means we need to investigate why the normal timeout mechanisms didn't work.
<matsubara> right. thanks for the explanation
<stub> -1 second non-sql time, 0 seconds total time indicates a problem at the appserver? The request never got started?
<matsubara> I'll file a bug about that one and we can discuss there
<stub> hmm... might be a reconnection bug - perhaps the previous request handled by that thread got killed?
<stub> I don't know if we Retry on DisconnectionError exceptions, or if it is a good idea in all cases.
<matsubara> ok

Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low
Changed in launchpad:
importance: Low → Critical
Revision history for this message
Robert Collins (lifeless) wrote :

So, I'm 99% sure this is a race condition with the idle killer, its inherent in the way we're tackling the problem and unsolvable in that model.

Changed in launchpad:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.