appear to be failing to record oops for all +translate HTTP 503 errors

Bug #422960 reported by Steve McInerney
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Invalid
High
Unassigned

Bug Description

In the last two days we've been getting a large increase in 5xx errors reported against the production LP app servers:
https://lpstats.canonical.com/graphs/AppServer5XXsLpnet/20090827/20090903/

Analysis of the logs on vanadium suggests that the vast majority of these are to URL's with "+translate".

A quick breakdown of type & quantity of 5xx error for the 1st and 2nd of Sept. (at time of writing this was only 3.5 hours of the 2nd, with a full 24 hours from the 1st). This is measured from the apache logs on vanadium.

9912 x 503
78 x 502
2 x 500

A Q&D hunt thru the oops directories over the similar period suggests an order of magnitude difference in the number of OOPS recorded:
spm@devpad:/srv/launchpad.net-logs/production$ fgrep -i '+translate' {gandwana,gangotri,palladium,potassium}/2009-09-0{1,2}/* | fgrep -c PATH_INFO

==> 1015

Thus when we come to look at the oops' reports, this class of failure doesn't appear to be anywhere near as significant as it actually is.

Even HTTP 404 errors are under-reported as we logged on vanadium 1709 404's against the translations.launchpad.net vhost.

affects: oops-tools → launchpad-foundations
Revision history for this message
Diogo Matsubara (matsubara) wrote :

Looks like those 500s are requests failing at the apache level which makes sense for them to not generate an OOPS report. We need to debug why apache is returning the 500. This looks like a dupe of bug 193062

Changed in launchpad-foundations:
status: New → Incomplete
Revision history for this message
Steve McInerney (spm) wrote :

yeah - it does sound similar.

I'd note tho:
* the errors were massively against only *one* vhost. translations.lp.net, not bugs or answers or code

* analysing a section of the logs: of ~ 3200 5xx errors; ~ 3000 were against the translations vhost. and about 2900 of those were due solely to msnbot.
If apache is having issues it's extremely one sided; and once the bot goes away, so do the errors.

* not just 5xx errors - seeing a similar issue around 4xx's as well, just not as severe.

* when analysing the app server trace logs - the numbers from the apache logs line up:

egrep -c "2009-09-01[^ ]+ 5[0-9][0-9] " launchpad-trace1.log.1
1063
trace2: 1090
trace3: 1016
trace4: 1080

These all had the 'A' class in the trace log if that helps as a cross verify?
eg: "A 264213136 2009-09-01T00:01:31 503 4822"
So I take this to mean that the app server is sending back the 5xx response and apache is doing it's thing with what it's been given.

???

Changed in launchpad-foundations:
status: Incomplete → New
Revision history for this message
Stuart Bishop (stub) wrote :

The originally cited graph is generated from data provided by the appservers, so Apache isn't the reason for the error count discrepancy.

My initial thought was request retries caused the discrepancy, but I've just checked the code and we do not count requests that trigger a retry with genuine exceptions.

Revision history for this message
Diogo Matsubara (matsubara) wrote :

I've run tracereport.py on launchpad-trace1.log.4. Results attached. I don't know (yet :-)) how to interpret that data but hopefully it'll be useful to debug this bug.

(Some more background information if anyone is interested in running the tracereport.py script. Use the script from this branch https://code.edge.launchpad.net/~launchpad-pqm/zc.zservertracelog/trunk as the newest version can't cope with the logs generated by Launchpad)

Revision history for this message
Diogo Matsubara (matsubara) wrote :

I had to remove the attachment. tracereport can be found here: https://devpad.canonical.com/~matsubara/launchpad-tracereport.txt.bz2

Curtis Hovey (sinzui)
Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → Low
tags: added: oops
Tom Haddon (mthaddon)
tags: added: canonical-losa-lp
Changed in launchpad:
importance: Low → Critical
tags: removed: oops
Changed in launchpad:
importance: Critical → High
Revision history for this message
Robert Collins (lifeless) wrote :

Deeply historical and no chance of debugging it at this remove.

Changed in launchpad:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.