I'm uploading an attachment of the debug log with -Dhttp

The overall download did take ~60 minutes. I see that in the first 1 minute we are just probing indices (this could be the revision ancestry graph search) before we start downloading any actual .pack content.

After that, I see a lot of reads where we are grabbing the first 40 bytes (the pack header) and then ~540 bytes further on in the file.
It does look like we are at least reading new data each time. I don't understand why the data is all discrete. Perhaps each revision, etc is in its own group, and thus each one averages approximately the same size (small, of course).

I might poke at it a bit, but I'm guessing it is just that we read one group at a time, while we could be buffering a bit more than that. (The trick is figuring out how to make a bigger request without overflowing the LRUCache, etc.)

I also don't understand how we could get that many groups in a single pack file without them being collapsed together. Perhaps the autopack code isn't properly grouping revision texts? (I'm assuming it is revs because it is the first item being read.)

Note that we spend from approx 75s to 160s just accessing 585 byte chunks from:
   http readv of 1d0252d21c3632af79515b5e8c497c15.pack

It does take us until 350s (~6min) until we start looking at the .iix, and then it takes until 556s before we start reading .cix, and 2063s before we start reading .tix.

   3.5  finish reading format, etc files and start reading .rix
  73.9  start reading revision texts
 353.3  start reading .iix
 380.2  read inventory .pack data
 555.9  start reading root .cix
 627.9  reading chk .pack data
 907.1  read layer 2 .cix
1828.5  read layer 3 .cix
1911.2  .cix
2046.8  .cix
2063.4  read .tix
2247.8  read .pack
3651.8  finished reading text content, reading branch.conf and tags
3683.5  finished
 

I'm also seeing this, which seems a bit disconcerting:

384.270  creating new compressed block on-the-fly in 0.000s 11139 bytes => 365 bytes
384.271  stripping trailing bytes from groupcompress block 11046 => 275
384.274  creating new compressed block on-the-fly in 0.001s 11139 bytes => 2757 bytes
384.277  creating new compressed block on-the-fly in 0.000s 11046 bytes => 269 bytes
384.278  creating new compressed block on-the-fly in 0.000s 11139 bytes => 917 bytes
384.281  creating new compressed block on-the-fly in 0.001s 11046 bytes => 2832 bytes
384.298  creating new compressed block on-the-fly in 0.003s 11139 bytes => 5331 bytes
384.306  creating new compressed block on-the-fly in 0.000s 11046 bytes => 782 bytes
384.309  creating new compressed block on-the-fly in 0.002s 11139 bytes => 2386 bytes
384.313  http readv of 0ca910916f100de2c571f97d71c6bf9f.pack  offsets => 2 collapsed 2
384.314  > GET /~launchpad-pqm/launchpad/devel/.bzr/repository/packs/0ca910916f100de2c571f97d71c6bf9f.pack

Basically, it seems to be taking two modest sized groups (11kb) and splitting
them back up into fairly small groups. Perhaps this is a 'groupcompress sort
order' issue. That because we think the optimal order is slightly different
than what was previously on disk, we think we need to break up the groups on
the fly.
Though we should be doing an 'unordered' fetch at this point, which should
avoid this. Anyway, it definitely looks like we are thrashing between 2 groups
and breaking them up into a smaller stream because of that. (I'm seeing this
fairly often.)


Reading chk .pack data seems much better behaved than the .iix and .rix data
reads. They are still small (7.2k) but bigger than the 500 bytes from before.
And I'm *not* seeing the fragmentation effects that we saw earlier.

I think the 7k size chk groups are because of page splitting. The fetch code was
written assuming groups would average around 1-4MB. The chk fetch was reworked
to create smaller groups (by clustering based on search-key prefix). The
clustering is definitely beneficial, we need to revisit whether we want to
start a new group each time. The main trade off is the offset reference will
probably grow from 2-bytes to 3-4 bytes, since the copy source will be in the
middle of a group instead of at the beginning. I certainly don't expect better
compression by combining unrelated groups, so if we can just make bigger readv
requests, we will probably offset the cost better than if we tried to force
bigger groups.


This line is bad:
1956.154  creating new compressed block on-the-fly in 0.000s 3444823 bytes => 36 bytes

I'm certainly hoping that the 3.4MB block is just in the local buffer. (As near
as I can tell it is.) At least, I don't see a nearby read that is 3.4MB long.
(Just previous to that we had a 10.9MB read of the .cix file)


Note that we are 'thrashing' on reading the .cix pages for:
    1d0252d21c3632af79515b5e8c497c15.cix
I see it read 
569.016  http readv of 1d0252d21c3632af79515b5e8c497c15.cix  offsets => 3669 collapsed 1
...
> Range: bytes=98304-15123801
15MB (read *all* root pages)


907.135  http readv of 1d0252d21c3632af79515b5e8c497c15.cix  offsets => 2814 collapsed 1
...
> Range: bytes=98304-11624447
11MB (read a big cluster of pages)


1828.483  http readv of 1d0252d21c3632af79515b5e8c497c15.cix  offsets => 2726 collapsed 86
...
> Range: bytes=98304-233471,237568-249855,...
...
< Content-Length: 11171455
11MB (read a mix of pages)


1911.204  http readv of 1d0252d21c3632af79515b5e8c497c15.cix  offsets => 2652 collapsed 75
...
> Range: bytes=98304-139263,143360-303103,...
...
< Content-Length: 10870012
11MB (read a mix of pages)


2046.883  http readv of 1d0252d21c3632af79515b5e8c497c15.cix  offsets => 702 collapsed 542
...
> Range: bytes=110592-114687,122880-126975,...
...
< Content-Length: 1113169
11MB (read a mix of pages)


I believe the issue has been expressed before:
 1) .cix pages don't cluster, so when you go to read, you need to read a random
 selection across the whole page
 2) we have so many chk pages that the index gets larger than our buffer size.
 With 4kB pages, and an average of 100 records per page we require 2.8k btree
 pages to fit all the .cix records. However, our buffer is set to a hard cap of
 1k btree pages. So every time we read a layer of chk pages, we have to
 re-download the whole index again. (in this case ~ 5 times)


For the .tix, again I see some thrashing like:
2351.986  stripping trailing bytes from groupcompress block 620801 => 113468
2352.004  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 2099 bytes
2352.005  creating new compressed block on-the-fly in 0.000s 1623061 bytes => 1922 bytes
2352.008  creating new compressed block on-the-fly in 0.003s 620801 bytes => 7111 bytes
2352.012  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 948 bytes
2352.013  creating new compressed block on-the-fly in 0.001s 620801 bytes => 1283 bytes
2352.015  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 2050 bytes
2352.016  creating new compressed block on-the-fly in 0.001s 620801 bytes => 2129 bytes
2352.018  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 1930 bytes
2352.021  creating new compressed block on-the-fly in 0.002s 620801 bytes => 9805 bytes
2352.024  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 2978 bytes
2352.028  creating new compressed block on-the-fly in 0.003s 620801 bytes => 4253 bytes
2352.031  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 3075 bytes
2352.040  creating new compressed block on-the-fly in 0.007s 620801 bytes => 14191 bytes
2352.047  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 687 bytes
2352.049  creating new compressed block on-the-fly in 0.001s 620801 bytes => 1771 bytes
2352.052  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 3699 bytes
2352.062  creating new compressed block on-the-fly in 0.009s 620801 bytes => 29093 bytes
2352.069  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 2095 bytes
2352.071  creating new compressed block on-the-fly in 0.001s 620801 bytes => 3612 bytes
2352.073  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 9214 bytes
2352.075  creating new compressed block on-the-fly in 0.001s 1623061 bytes => 9520 bytes
2352.082  creating new compressed block on-the-fly in 0.005s 620801 bytes => 15460 bytes
2352.087  creating new compressed block on-the-fly in 0.001s 1623061 bytes => 17684 bytes
2352.098  creating new compressed block on-the-fly in 0.008s 620801 bytes => 27862 bytes
2352.103  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 1875 bytes
2352.105  creating new compressed block on-the-fly in 0.001s 620801 bytes => 2493 bytes
2352.107  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 1342 bytes
2352.109  creating new compressed block on-the-fly in 0.001s 620801 bytes => 3257 bytes
2352.111  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 950 bytes
2352.113  creating new compressed block on-the-fly in 0.001s 620801 bytes => 1291 bytes
2352.114  creating new compressed block on-the-fly in 0.000s 2096942 bytes => 4507 bytes
2352.122  creating new compressed block on-the-fly in 0.006s 620801 bytes => 12400 bytes

Again, this doesn't seem to cause us to download extra data, as it seems to be
in our local group cache, but it does mean that we are fragmenting what would
otherwise be a nice big group. So it doesn't effect *this* fetch, but any
fetches from this location to another would then only have small groups to
read. (Such as when pqm creates the commit, does the autopack, and then it gets
copied to launchpad. The launchpad copy would be more fragmented than
expected.)

Some of the text groups are nicely sized, but some are still surprisingly
small. (Like a 1k here and there.)


So I certainly think we need to teach fetch to buffer small groups together
into a single request. However, we *also* need to figure out what is causing
our nice big 4MB groups to be fragmented into lots of 1-10kB groups.