Consistent retry failures on web UI

Bug #560422 reported by Scott Kitterman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Unassigned

Bug Description

I have attempted to retry https://launchpad.net/ubuntu/+source/compiz/1:0.8.4-0ubuntu15/+build/1682401/+retry several times and have gotten an oops each time. Error ID: OOPS-1562O184 is the most recent. This impacts the ability of non-Canonical archive-admins to get the archive in a consistent state prior to release.

Tags: lp-soyuz
Revision history for this message
William Grant (wgrant) wrote :

IntegrityError: duplicate key value violates unique constraint "buildpackagejob__build__key"

The build failed without removing the BPJ, BQ and Job? I am scared.

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Indeed does sound scary... I'd initially hoped another possibility could have been that the records were removed, but the first retry OOPS may have left the build in an inconsistent state, but this doesn't seem possible either (build.retry() updates the buildstate before creating the new queue records).

Changed in soyuz:
status: New → Confirmed
importance: Undecided → High
milestone: none → 10.04
Changed in soyuz:
status: Confirmed → Triaged
Revision history for this message
Michael Nelson (michael.nelson) wrote :

I just got a chance to check the logs, and there's nothing of interest there either, other than the repetitive:

2010-04-09 11:42:25+0100 [QueryWithTimeoutProtocol,client] <floe:http://floe.buildd:8221/> marked as done. [4]

As to cause, at the moment all I can see is a number of things that could fail between setting the buildstate and deleting the related queue record etc.:

{{{
        self.buildstate = BuildStatus.FAILEDTOBUILD
        self.storeBuildInfo(librarian, slave_status)
        self.buildqueue_record.builder.cleanSlave()
        self.notify()
        self.buildqueue_record.destroySelf()
}}}

As to a temporary fix, I'll organise to have the offending records deleted first thing tomorrow (unless someone beats me to it).

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 560422] Re: Consistent retry failures on web UI

On Monday 12 April 2010 17:16:30 Michael Nelson wrote:
> {{{
> self.buildstate = BuildStatus.FAILEDTOBUILD
> self.storeBuildInfo(librarian, slave_status)
> self.buildqueue_record.builder.cleanSlave()
> self.notify()
> self.buildqueue_record.destroySelf()
> }}}

One explanation if indeed this piece of code is getting called is that
something is doing a commit() in storeBuildInfo(), cleanSlave() or notify(),
before failing and aborting the rest of the txn.

Revision history for this message
Michael Nelson (michael.nelson) wrote :

Just for the record, the query to gather all the info:
https://pastebin.canonical.com/30473/

The result:
https://pastebin.canonical.com/30474/

And then deleting the offending records:
https://pastebin.canonical.com/30475/

Revision history for this message
Michael Nelson (michael.nelson) wrote :

And the build is now pending.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Did you get any more of these Scott?

I've removed the cron job (queue-builder) that kicked off a script that adds missing build records, it was conflicting with the build dispatcher in unpredictable ways.

If there were no more problems then I'll close this bug.

Changed in soyuz:
status: Triaged → Incomplete
Revision history for this message
Scott Kitterman (kitterman) wrote :

It's been intermittent, so it's hard to know. I'd say go ahead and close it and I'll file a new bug if it comes up again.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Ok thanks Scott.

Changed in soyuz:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.