Change stuck in queue when Gearman errors during merge:merge submit

Bug #1358517 reported by Antoine "hashar" Musso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zuul
In Progress
Undecided
Antoine "hashar" Musso

Bug Description

Zuul tends to be stuck processing new changes from time to time. I have the issue today that Zuul merge client was not able to connect to Zuul internal Gearman server:

  2014-08-18 19:25:42,434 ERROR gear.Client.unknown: Connection
   <gear.Connection 0x1a8e710 host: x.y.z.a port: 4730>
   timed out waiting for a response to a submit job request:
  <gear.Job 0x7f195db8ff10 handle: None name: merger:merge unique: 18e85d5f5c9c4079a6f3718bd577fd54>

 2014-08-18 19:25:42,435 ERROR zuul.Scheduler: Exception in run handler:
 Traceback (most recent call last):
   File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 784, in run
     while pipeline.manager.processQueue():
   File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1361, in processQueue
     item, nnfi, ready_ahead)
   File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1321, in _processOneItem
     ready = self.prepareRef(item)
   File "/usr/local/lib/python2.7/dist-packages/zuul/scheduler.py", line 1230, in prepareRef
     item.current_build_set)
   File "/usr/local/lib/python2.7/dist-packages/zuul/merger/client.py", line 93, in mergeChanges
     self.submitJob('merger:merge', data, build_set)
   File "/usr/local/lib/python2.7/dist-packages/zuul/merger/client.py", line 89, in submitJob
     self.gearman.submitJob(job)
   File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 1363, in submitJob
     conn = self.getConnection()
   File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 1162, in getConnection
     raise NoConnectedServersError("No connected Gearman servers")
 NoConnectedServersError: No connected Gearman servers

That causes all merges to fail and the changes to be stuck in the queue. The Zuul status board shows that changes are being enqueued and none are processed.

I disconnected Jenkins Gearman Plugin and reconnected it. Noticed a few:

 2014-08-18 21:04:02,521 ERROR gear.Client.unknown: Exception in poll loop:
 Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 762, in _doPollLoop
     self._pollLoop()
   File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 802, in _pollLoop
     self.handlePacket(p)
   File "/usr/lib/python2.7/dist-packages/gear/__init__.py", line 842, in handlePacket
     self.handleStatusRes(packet)
   File "/usr/local/lib/python2.7/dist-packages/zuul/launcher/gearman.py", line 104, in handleStatusRes
     if build.__gearman_job.handle == handle:
 AttributeError: 'str' object has no attribute '_ZuulGearmanClient__gearman_job'

Then reconnected Jenkins Gearman plugin.

New Changes entering the pipelines were then properly processed but a lot were stuck in the queues. I noticed in the log a lot of merge jobs

 2014-08-18 21:16:52,029 INFO zuul.MergeClient: Merge <gear.Job 0x1a9a290 handle: H:208.80.154.135:13390 name: merger:merge unique: 2854e208bff54feb9fc2c73afe3995a8> complete, merged: True, updated: False, commit: None
 2014-08-18 21:16:52,029 WARNING zuul.Scheduler: Build set <zuul.model.BuildSet object at 0x1a9ab90> is not current

It seems Zuul attempt to rerun merge operation, which are successful, but refuse to updates the BuildSet because it is outdated.

The way I have fixed it is by using the 'zuul promote' client command on some changes. That apparently reset the BuildSet of a Change and unstuck the changes in a given pipeline.

Not sure how helpful this bug report is nor what to do from there.

python-gear: 0.5.5
Zuul: 9a95e71 (Merge "Update gerrit change attributes even if merged" which is very recent).

Revision history for this message
Antoine "hashar" Musso (hashar) wrote :

We had another occurrence this time with:

  NoConnectedServersError: No connected Gearman servers

Still when attempting to submit a merge:merge job.

The issue appear to be in zuul.scheduler.BasePipelineManager.prepareRef() it set the merge state to PENDING before the job submission has been properly finished. Pseudo code:

  def prepareRef():
      if build_set.merge_state == build_set.PENDING:
            return False

      build_set.merge_state = build_set.PENDING
      self.sched.merger.mergeChanges(..)

If an exception is thrown, the build_set is still in PENDING state and we will never attempt to propose again a merge:merge job because of the early return.

prepareRef should thus set the merge state after self.sched.merger.mergeChanges().

python-gear: 0.5.5
Zuul: c9d11ab (Merge "Rename doc environment to docs") from Sept 16th

Revision history for this message
Antoine "hashar" Musso (hashar) wrote :
summary: - Internal Gearman server timeout for merge operations
+ Change stuck in queue when Gearman errors during merge:merge submit
Revision history for this message
Antoine "hashar" Musso (hashar) wrote :
Changed in zuul:
assignee: nobody → Antoine "hashar" Musso (hashar)
status: New → In Progress
Revision history for this message
Antoine "hashar" Musso (hashar) wrote :

I have deployed the proposed patchset on Wikimedia production infrastructure.

Revision history for this message
Antoine "hashar" Musso (hashar) wrote :

The patch solved the issues of patches being stuck in the event queue whenever the connection is lost with the Gearman server. That happened yesterday on the Wikimedia setup, on reestablishing the connection, all the pending merger:merge function have been executed and the event queue processed everything :-)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.