need Contents files to be generated

Bug #36830 reported by Colin Watson
34
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Celso Providelo

Bug Description

The dists/dapper/Contents-*.gz files haven't been generated since we switched to Soyuz. Since there are some parts of the distribution that look at Contents files, and since Contents files are useful for quality assurance work on the distribution (e.g. finding file conflicts between packages), we need to get these generated again before the Dapper release.

Colin Watson (cjwatson)
Changed in launchpad-publisher:
assignee: nobody → dsilvers
Revision history for this message
Adam Conrad (adconrad) wrote :

Note that there's no reason these need to be generated from the hourly (or half-hourly, hint, hint) publisher runs, and it's probably computationally impossible (or very difficult) to do so, but having the Contents.gz run be async with the publishing and, say, generating Contents.gz every 24 hours would be perfectly fine.

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

This was previously tracked as RT #722.

Fabio

Christian Reis (kiko)
Changed in launchpad-publisher:
assignee: dsilvers → cprov
status: Unconfirmed → Confirmed
Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Increasing severity. This is now blocking my work to fix some bugs and it must be available way before release.

Fabio

Revision history for this message
Celso Providelo (cprov) wrote :

Right, as per IRC discussion, it sounds like we only need run apt-ftparchive with a special apt-conf to generate contents once a day. It was removed from our normal config because takes so long.

So, skipping one cron.daily pass and running it once a day (let's say 0 UTC), maybe in conjunction with britney (wishlist) will sort the absence of this files in the archive.

Let's try this and finetune when and how often we run it as the results comes.

Revision history for this message
James Troup (elmo) wrote :

Be very careful when doing this.

apt-ftparchive has a cache DB without which it would take half a day or more just to generate Packages/Sources for each cron.daily cycle. However, what's less obvious/well known is that the cache DB actually comes in two forms:

 (1) Without data needed for Contents generation
 (2) With data needed for Contents generation

(1) is _massively_ more efficient than (2), to the point that forcing out databases to switch to (1) on the previous dak incarnation was the only way we could run cron.daily every 30 minutes.

The only way to get (1) format databases is to _always_ use the --no-contents switch because once a DB is upgraded to (2), it'll never downgrade to (1) again.

Obviously the implication of this is that you should never ever run apt-ftparchive without --no-contents with an apt.conf that points at your primary (cron.daily) cache DBs.

On a semi-unrelated note, you don't want to do Contents generation directly to the visible archive tree because it takes a long time and leaves nasty "hidden" temporary files around while it's working.

So what you probably want to do is create a new apt.conf-contents (or similar) file that both generates to a separate archive tree and uses a seperate cache DB directory. Then run that out of band (say once a day), and atomically copy the generated Contents files across to the main/real archive tree in every cron.daily (this is safe to do every 30 mins as apt-ftparchive atomically updates it's Contents files).

You can see the 'apt.conf-contents' and 'copycontents' files on jackass for an idea of what I mean.

I suggest you make a backup of your primary apt cache db before starting to work on this, as if you accidentally upgrade them to (2) format, you're going to see performance drop through the floor and the only way to go back to (1) will be to delete them and start again :-(

Revision history for this message
Adam Conrad (adconrad) wrote :

For people lacking access to jackass, the files are located at mawson:~adconrad/apt-ftparchive/

Revision history for this message
Celso Providelo (cprov) wrote :

First trial in mawson produced the following files:
{{{
launchpad@mawson:/srv/launchpad.net$ find contents-archive/ -name Con* | xargs ls -lh
-rw-rw-r-- 1 launchpad launchpad 137M May 31 15:33 contents-archive/ubuntu/dists/dapper/Contents-amd64
-rw-rw-r-- 1 launchpad launchpad 134M May 31 15:36 contents-archive/ubuntu/dists/dapper/Contents-hppa
-rw-rw-r-- 1 launchpad launchpad 144M May 31 15:39 contents-archive/ubuntu/dists/dapper/Contents-i386
-rw-rw-r-- 1 launchpad launchpad 135M May 31 15:42 contents-archive/ubuntu/dists/dapper/Contents-ia64
-rw-rw-r-- 1 launchpad launchpad 138M May 31 15:45 contents-archive/ubuntu/dists/dapper/Contents-powerpc
-rw-rw-r-- 1 launchpad launchpad 136M May 31 15:48 contents-archive/ubuntu/dists/dapper/Contents-sparc
}}}
Their contents look as expected. Just want be sure and have it reviewed by elmo. before start things in drescher.

Revision history for this message
Celso Providelo (cprov) wrote : apt.conf for Contents generation

apt-.conf used in mawson.

Revision history for this message
Celso Providelo (cprov) wrote :

Code is landed in production and pending review on kiko's plate (almost done)

Changed in launchpad-publisher:
status: Confirmed → In Progress
Revision history for this message
Celso Providelo (cprov) wrote :

Code reviewed and in RF for a long time.

Changed in launchpad-publisher:
status: In Progress → Fix Released
Revision history for this message
Celso Providelo (cprov) wrote :

It is still requiring a small fix to run periodically, It will share the cron.daily lockfile, so it won't mess with archive content during publication or mirroring.
wip in my `archive-tools`.
cjwatson will babysit the changes in production tomorrow.

Changed in launchpad-publisher:
status: Fix Released → In Progress
Revision history for this message
Celso Providelo (cprov) wrote :

Let's demote this last task to only *high*, to avoid misunderstandings

Changed in soyuz:
importance: Critical → High
Revision history for this message
Celso Providelo (cprov) wrote :

RF 4405 & 4410 were cherrypicked in drescher.

Contents should be generated daily via lp_publish cron job, starting at 3:35 UTC.

Currently it takes till 4:02 UTC to regenerate all contents, what is just about the time we have to not miss the next cron.daily. This situation is kind of suboptimal, but losing one cron.daily run at the mentioned time isn't the 'end-of-world'.

Changed in soyuz:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.