Bazaar

Bug #405251
Comment #26

Comment 26 for bug 405251

Revision history for this message

John A Meinel (jameinel) wrote on 2009-08-05:

#26

Just in my 10s quick look, I can see that we are spending 44% of the time in "_btree_serializer_px._parse_leaf_lines"

I can also see that part of the issue might be that we know use the StreamSource code path, rather than the Packer code path, though I don't have a good feel for why it would be that much slower.

In 1.14, 75% of the time (429k ticks) was spent in "create_pack_from_packs" (aka fetching via Packer)
versus in 1.17 90% of the time is spent in 'insert_record_stream' (which includes the time to read the source data as the source is a generator so isn't consumed until you start inserting the data.)

insert_record_stream is spending a *lot* of time in _has_key_from_parent_map, which I believe is the insert_record_stream code trying to see if the entry it is adding is already present. It turns out to be 52% of the overall time.

for some more side-by-side comparisons:
1) In your bzr-1.14 source, you don't even have the compiled extensions and we have
37,837 calls to bzrlib._btree_serializer_py._parse_leaf_lines, consuming 95k ticks (2.5ticks each)
2) In bzr-1.17 we have
469,030 calls to bzrlib._btree_serializer_pyx._parse_leaf_lines consuming 8.8M ticks (18.8 ticks each)

So while we have 12x more calls, they are also costing 7.5x more each, for a total increase of 92x at the index layer while it is trying to read the btree pages.

Now, I don't know how many entries you have in your btree, to have a good idea of how many pages we *should* be reading, or if it is enough pages to be causing us to overflow (and then thrash) a cache.

It is possible, I suppose, that the compiled parsing code is actually slower than the plain python code. That would be easy to test by just deleting the _btree_serializer_pyx.* files (.so on Linux, .pyd on Windows).

The 18x, though, could just be that the function is getting called so much more often and it is causing the python Garbage Collector to run while it is in the middle of processing. (it is a function that would be allocating a lot of objects, so it is more likely to have the GC get run.)

I *am* seeing that it is calling BTreeIndex._iter_smallest, which is a function that is generally only triggered when we have >100k records being written.

And, in fact, I'm seeing an "iter_all_entries" returning 1.4M records. Though I don' t have a feel for whether that is genuine, or is just because we are calling it more than once.

I *can* say that if you have 1.4M keys being transferred, then we will have a lot of index thrashing during insert_record_stream. Because we will try to cap memory consumption and flush 100k records to disk. Which we then query later during a followup insert, to check if the thing we are inserting now was inserted earlier.

I suppose one quick check is to give us the output of:
head -n 5 .bzr/repository/indices/*

That would give us an indication of how many real records you have stored in the repository. (Which helps track down if you really do have ~1M records, or whether we are just running over the same records over and over again.)

Now, I think I'm seeing 7 calls to _spill_mem_keys_to_disk in the 1.14 callgrind. Which would hint at 700,000 keys being copied. And *that* would definitely thrash the hell out of the new code path. As it is inserting it will write out 100k keys into a temp file, and then the "insert_record_stream" ends up calling "has_key" to check if this new record is a duplicate or not. And it will do the "has_key" check for each record (of the 700k), and it will search across all 7 of these temporary files.

I'm also seeing 680k calls to "_has_key" so that could also be a reasonable bound on the number of entries being inserted.

If I'm seeing the code correctly, it would be slightly more than 700k entries, and 680k of them are "delta" records, where we are checking to make sure that the compression parent already exists in the new repository.

You could try this simple one-line patch:
=== modified file 'bzrlib/knit.py'
--- bzrlib/knit.py 2009-07-28 08:09:13 +0000
+++ bzrlib/knit.py 2009-08-05 20:23:10 +0000
@@ -1628,7 +1628,7 @@
                 access_memo = self._access.add_raw_records(
                     [(record.key, len(bytes))], bytes)[0]
                 index_entry = (record.key, options, access_memo, parents)
- if 'fulltext' not in options:
+ if False and 'fulltext' not in options:
                     # Not a fulltext, so we need to make sure the compression
                     # parent will also be present.
                     # Note that pack backed knits don't need to buffer here

(sorry that Launchpad will probably show it poorly indented)

Anyway, that will disable the "is the compression parent already present" check, which seems to be where 50% of the time is spent.

Is there a chance you could give that change a shot? It would at least give us more info on what is going on.

Just in my 10s quick look, I can see that we are spending 44% of the time in "_btree_serializer_px._parse_leaf_lines"

I can also see that part of the issue might be that we know use the StreamSource code path, rather than the Packer code path, though I don't have a good feel for why it would be that much slower.

for some more side-by-side comparisons:
1) In your bzr-1.14 source, you don't even have the compiled extensions and we have
 37,837 calls to bzrlib._btree_serializer_py._parse_leaf_lines, consuming 95k ticks (2.5ticks each)
2) In bzr-1.17 we have
  469,030 calls to bzrlib._btree_serializer_pyx._parse_leaf_lines consuming 8.8M ticks (18.8 ticks each)

So while we have 12x more calls, they are also costing 7.5x more each, for a total increase of 92x at the index layer while it is trying to read the btree pages.

I *am* seeing that it is calling BTreeIndex._iter_smallest, which is a function that is generally only triggered when we have >100k records being written.

And, in fact, I'm seeing an "iter_all_entries" returning 1.4M records. Though I don' t have a feel for whether that is genuine, or is just because we are calling it more than once.

I suppose one quick check is to give us the output of:
 head -n 5 .bzr/repository/indices/*

I'm also seeing 680k calls to "_has_key" so that could also be a reasonable bound on the number of entries being inserted.

You could try this simple one-line patch:
=== modified file 'bzrlib/knit.py'
--- bzrlib/knit.py      2009-07-28 08:09:13 +0000
+++ bzrlib/knit.py      2009-08-05 20:23:10 +0000
@@ -1628,7 +1628,7 @@
                 access_memo = self._access.add_raw_records(
                     [(record.key, len(bytes))], bytes)[0]
                 index_entry = (record.key, options, access_memo, parents)
-                if 'fulltext' not in options:
+                if False and 'fulltext' not in options:
                     # Not a fulltext, so we need to make sure the compression
                     # parent will also be present.
                     # Note that pack backed knits don't need to buffer here

(sorry that Launchpad will probably show it poorly indented)

Anyway, that will disable the "is the compression parent already present" check, which seems to be where 50% of the time is spent.

Is there a chance you could give that change a shot? It would at least give us more info on what is going on.