Comment 1 for bug 1689282

Revision history for this message
Colin Watson (cjwatson) wrote :

Just copying my explanation here so that it isn't dependent on an external site:

The background for this is that publishing involves multiple steps: we have to do the equivalent of snapcraft push and then the equivalent of snapcraft release, and between those we have to wait for the store to finish scanning the upload, which is an asynchronous job and takes an undetermined amount of time. Rather than polling and thus taking up a worker slot in Launchpad for that undetermined amount of time, we retry the job later with a one-minute delay up to a maximum of 20 times.

There are various things that we could look at to reduce the latency of this process (which aren't all mutually-exclusive, and we may not know the best strategy until we do some more analysis):

 * Unlike the initial job, retries don't seem to be handled by celery for some reason, but instead are picked up by the fallback cron job some time later. This is the source of most of the unnecessary delay, and is probably just a simple bug somewhere. Assuming that the store scans the upload reasonably promptly, we could cut the typical delay for small snaps down to a little over a minute by getting celery to pick up the retries.
 * We could consider having the job poll for a short time after it pushes the snap, which would cut out almost all the extra latency in the case that the store manages to scan it immediately. This may be a good idea, but probably only if the store typically does in fact manage quick scans; otherwise we'd be tying up workers for longer and degrading overall system performance.
 * We could try some kind of exponential backoff approach, so that the first retries happen more quickly.
 * We could look at having the store tell us when it's done by way of a webhook. This seems like the most elegant approach, but it's also a lot of work in that it requires implementing webhook sending in the store and webhook receiving in Launchpad.