BugNomination:+editstatus timeout for bugs with many tasks

Bug #874250 reported by Marc Deslauriers
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Critical
Curtis Hovey

Bug Description

Symptoms
========

When trying to approve nominations on bugs that have a large number of projects, launchpad times out.

For example, bug #706999 or bug #732628.

This is preventing the security team and kernel team's workflow from accurately tracking kernel issues.

Analysis
========

death by sql: multiple heat lookup calls, each of which is slow (30ms) for a very large aggregate. The high sql time is commonly correlated with poor sql utilisation, so I would ignore it until the sql statement count is sane (< 100 queries).

Also see bug 724080 which may show up as soon as this particular code path is fixed.

3. 45 1322 29 1293 SQL-main-master

SELECT MAX(Bug.heat), SUM(Bug.heat), COUNT(Bug.id)
FROM Bug,
     BugTask
WHERE BugTask.bug = Bug.id
  AND BugTask.distribution = $INT
  AND BugTask.sourcepackagename = $INT
  AND Bug.duplicateof IS NULL
  AND BugTask.status IN ($INT ... $INT)

Related branches

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

This should be marked with Importance 'High' as this is affecting Ubuntu's ability and effectiveness in maintaining the kernel cadence.

tags: added: platform-blocker
Revision history for this message
Gavin Panella (allenap) wrote :

Can you get an OOPS number?

Changed in launchpad:
importance: Undecided → Critical
status: New → Incomplete
Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

(Error ID: OOPS-2113CG81)

and

(Error ID: OOPS-2113AQ86)

Changed in launchpad:
status: Incomplete → Confirmed
Gavin Panella (allenap)
Changed in launchpad:
status: Confirmed → Triaged
Revision history for this message
Kate Stewart (kate.stewart) wrote :

Please consider this bug as escalated on behalf of ubuntu engineering team.

tags: added: rls-mgr-p-tracking
tags: added: timeout
description: updated
description: updated
summary: - Nominations stop working when bugs have large number of projects
+ BugNomination:+editstatus timeout for bugs with many tasks
tags: added: escalated
Revision history for this message
Jamie Strandboge (jdstrand) wrote :

After talking with the security team and the kernel team, this issue is causing a lot of problems with the kernel cadence in Ubuntu Engineering. The kernel cadence is a process outlined here: https://wiki.ubuntu.com/Kernel/kernel-sru-workflow

While kernel workflow bugs seem to be ok for the moment, more and more CVE tracking bugs are broken because precise and the oneiric backport kernel were added resulting in I'm told 11 extra tasks to the bugs. Bugs are timing out and can no longer be updated, which ultimately wastes developer resources and reduces parallelism. What this means is that bug reports are inaccurate so they can't be trusted and one person becomes a blocker on manually fixing them. This makes it more difficult for people to work together on the security fixes and security fixes therefore lag (meaning our users aren't getting them as quickly).

I realize this bug is already a distro-escalated Critical bug, but the recent opening of Precise and the newly added backport kernel are aggravating the issue, so I wanted to make sure this was captured.

Gavin Panella (allenap)
Changed in launchpad:
assignee: nobody → Gavin Panella (allenap)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
Revision history for this message
Gavin Panella (allenap) wrote :

The fix in lp:~allenap/launchpad/bugnomination-timeout-bug-874250 that
I have just marked as qa-ok does not completely fix this bug, though
it does improve things. There are two more branches to go.

tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Gavin Panella (allenap) wrote :

Marc helped QA this branch. He was not able to get an approval to work, but the OOPS from the final attempt was OOPS-767ea6a157fe7d4ab3d7692ea09a9dda, which shows a large reduction in queries for DistributionSourcePackage (from 270 to 10, iirc). This saves about 0.7s.

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-ok
Revision history for this message
Gavin Panella (allenap) wrote :

Recent attempts to approve nominations with the lazy-heat branch in
place have resulted in more OOPSes, but I think we're heading in the
right direction.

OOPS-2fe32d4d689d2e12b2858678ef72e916
OOPS-95d6996501ebcbc29e8b714f82e39269
OOPS-34fdbd3321178b447f549836429f7d91
OOPS-31c20c55f255f58540c8f0be0a97fa06

Remaining issues:

- Email addresses are being loaded individually.

- updateHeat gets called a *lot*.

Gavin Panella (allenap)
tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → In Progress
Revision history for this message
Gavin Panella (allenap) wrote :

Recent OOPSes:
  OOPS-8be0f33b472a707e0e15394fda5aa328
  OOPS-c7ce6aef19d617d8ac91aaba5b90c911

Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-ok
Changed in launchpad:
status: In Progress → Fix Committed
Revision history for this message
Gavin Panella (allenap) wrote :

lp:~allenap/launchpad/bugnomination-timeout-bug-874250-preload-email is okay, but there are still problems with bug heat calculations, and I now think that lp:~allenap/launchpad/bugnomination-timeout-bug-874250-heat-death will actually make things worse. I suspect we need to move all heat calculations out of the web request and accept that there it will lag slightly.

tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 874250] Re: BugNomination:+editstatus timeout for bugs with many tasks

On Sat, Jan 7, 2012 at 4:14 AM, Gavin Panella
<email address hidden> wrote:
> lp:~allenap/launchpad/bugnomination-timeout-bug-874250-preload-email is
> okay, but there are still problems with bug heat calculations, and I now
> think that lp:~allenap/launchpad/bugnomination-timeout-bug-874250-heat-
> death will actually make things worse. I suspect we need to move all
> heat calculations out of the web request and accept that there it will
> lag slightly.

I have initiated a dicussion to disable heat updates on the pillars;
AFAICT this is the primary cause of heat death during updates -
contention on the contexts.

So I'd hold off bringing in an offline system when we may have a much
simpler way forward.

-Rob

Gavin Panella (allenap)
Changed in launchpad:
status: Fix Committed → Fix Released
Changed in launchpad:
status: Fix Released → Triaged
Revision history for this message
Gavin Panella (allenap) wrote :

The remaining problem here is bug heat updates.

One problem is that there are a lot of heat calculations being done
for bugs with lots of bugtasks and nominations.

Another is that these queries aggregate quite a lot of data and can be
a little slow even when tuned.

The third problem is the one that Rob mentions: these aggregate
results are used to update heavily used rows (e.g. Product,
ProductSeries), and any other activity that also updates these rows
will contend for locks.

William Grant (wgrant)
Changed in launchpad:
assignee: Gavin Panella (allenap) → William Grant (wgrant)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-ok
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant)
tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Fix Released
Revision history for this message
John Johansen (jjohansen) wrote :

This is still not fixed for us

when trying to approve the oneiric nomination for 706999, the api timesout returning the attached error

Revision history for this message
John Johansen (jjohansen) wrote :

Manually approving the nominations for 706999 and 732628 in a web browser results in the following error message being returned

Timeout error
Sorry, something just went wrong in Launchpad.

We’ve recorded what happened, and we’ll fix it as soon as possible. Apologies for the inconvenience.

Trying again in a couple of minutes might work.

(Error ID: OOPS-6e6b4b29ef635939095468528e92c7ec)

Changed in launchpad:
status: Fix Released → Triaged
Revision history for this message
William Grant (wgrant) wrote :

It looks like about half the SQL time is going to BugTask inserts, probably due to the BugSummary triggers not handling the bug well -- it has 73 tasks, an order of magnitude more than this was all designed for.

Revision history for this message
Robert Collins (lifeless) wrote :

On Tue, Feb 28, 2012 at 11:59 AM, William Grant <email address hidden> wrote:
> It looks like about half the SQL time is going to BugTask inserts,
> probably due to the BugSummary triggers not handling the bug well -- it
> has 73 tasks, an order of magnitude more than this was all designed for.

FWIW bugsummary was intended to handle such outlier bugs - because we
know we had them in the system. I'd expect it to be a bug, not a
system issue; but suggest whomever works on it reproduce locally and
gather a profile.

William Grant (wgrant)
Changed in launchpad:
assignee: William Grant (wgrant) → nobody
Changed in launchpad:
assignee: nobody → Richard Harding (rharding)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-ok
Changed in launchpad:
status: In Progress → Fix Committed
William Grant (wgrant)
tags: added: qa-ok
removed: qa-needstesting
William Grant (wgrant)
Changed in launchpad:
status: Fix Committed → Triaged
Curtis Hovey (sinzui)
Changed in launchpad:
assignee: Richard Harding (rharding) → nobody
Curtis Hovey (sinzui)
tags: added: bug-nomination
Revision history for this message
Curtis Hovey (sinzui) wrote :

We has not see this timeout in recent weeks because Lp has a timeout feature flag set for this page. This is the last oops I can find and it is in the old oops site which dates from 2012-06: https://lp-oops.canonical.com/oops.py/?oopsid=OOPS-6e6b4b29ef635939095468528e92c7ec . It shows repeated event subscriber actions:
    * Inserts into bugtask
    * inserts into bug activity
    * bug subscription notifications

Revision history for this message
Curtis Hovey (sinzui) wrote :

The repeat info implies 1s or 1/10 of the SQL time could be saved by building the bug notification recipient set once. There are two general calls to find recipients, one for the bug, and the other for the structural subscriptions. I think the next optimisation would be to ensure that the notification recipient set is created once for this action/request because we know that the set cannot mutate after is it first created.

Curtis Hovey (sinzui)
Changed in launchpad:
assignee: nobody → Curtis Hovey (sinzui)
status: Triaged → In Progress
Revision history for this message
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
removed: qa-ok
Changed in launchpad:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
tags: added: qa-ok
removed: qa-needstesting
Revision history for this message
Curtis Hovey (sinzui) wrote :

The potential fix is now in production. We are going to watch the oopses for a while to be certain this long standing issue is fixed.

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Thanks!

Curtis Hovey (sinzui)
Changed in launchpad:
status: Fix Committed → Fix Released
Revision history for this message
Curtis Hovey (sinzui) wrote :

We think the majority of issues related to this bug are fixed. Bug #857109 and Bug #110195 overlap with a few issues in this bug, and we believe they are the proper place to track subsequent fixes that relate to why Lp wants to nominate/target so many packages at once.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

This really doesn't appear to be fixed...for example, I'm unable to switch some of the statuses in bug 706999 without hitting a timeout...

For example: OOPS-91d11c991cbabec55e1af4d24a337656

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

OK, I've opened bug 1076512 with this new issue.

Revision history for this message
Curtis Hovey (sinzui) wrote :

The oops shows you were editing a bugtask, not a bugnomination.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

Yeah, sorry about that, I noticed the title of this bug after I wrote that, which is why I opened a new one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.