Lost builder detection is insufficiently aggressive

Bug #463041 reported by William Grant
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
Low
William Grant

Bug Description

rescueBuilderIfLost needs to verify that buildqueue.builder matches and buildqueue.buildstart is not null, or it will miss some cases where a builder should be rescued.

This was a big problem this morning, causing the build farm to collapse after lots of buildds dropped off the network for a few minutes. buildd-manager marked all the builders as not-OK, and they were manually enabled again once connectivity was restored. LP showed all builders as OK and idle, but was not dispatching to more than a couple of builders.

Inspection afterwards revealed that the slave on artigas (among others) had finished its build and was sitting WAITING, having completed build-buildqueue 1312047-2740484. Since that buildqueue had been unassigned as soon as buildd-manager detected that the builder was not-OK, the builder should have been declared lost and had a rescue attempted.

Unfortunately, rescueBuilderIfLost only verifies the existence and correct linkage of the build and buildqueue. It doesn't confirm that buildqueue.builder is the current builder, or that buildqueue.buildstart is not null. This isn't easily detectable in most cases, since PPA builds from not-OK builders are fairly quickly picked up and built by other builders, at which point the buildqueue is deleted, and rescueBuilderIfLost kicks in.

Related branches

William Grant (wgrant)
Changed in soyuz:
status: New → In Progress
assignee: nobody → William Grant (wgrant)
Curtis Hovey (sinzui)
Changed in soyuz:
status: In Progress → Triaged
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Did you do any work on this William?

tags: added: buildd-manager
Changed in soyuz:
status: Triaged → Incomplete
William Grant (wgrant)
Changed in soyuz:
status: Incomplete → New
assignee: William Grant (wgrant) → nobody
Changed in soyuz:
status: New → Triaged
importance: Undecided → Low
William Grant (wgrant)
Changed in soyuz:
status: Triaged → In Progress
assignee: nobody → William Grant (wgrant)
Revision history for this message
William Grant (wgrant) wrote :

This was fixed with the big slave ID refactor in 10.04.

Changed in soyuz:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.