chk groups possibly need more clustering

Bug #402662 reported by John A Meinel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
High
Unassigned
Breezy
Triaged
Medium
Unassigned

Bug Description

Split out from bug #402114

The current CHK streaming code creates lots of mini-streams. Essentially one for every search-key prefix. This has the nice property that similar CHK pages are put together into groups and chk pages that are unlikely to compress well together are not comingled.

The main downside is that something like dumb fetch ends up downloading each chk group separately, which can be a large overhead.

Note that if we just fix bug #402657 (buffer multiple groups to be read at the same time) this may not be an issue.

What needs to be evaluated is:

1) If we start grouping more chk pages into a larger group, what is the effect on overall compression. (It is expected that compression will go down, as the number of regions that can be copied will not increase, but the offset into the group will, causing the variable width offset field to consume more bytes per reference.) The expected benefit is that something like dumb transport copying doesn't need to consider as many groups. Also having fewer groups means better compression of '.cix' since more of the content is the same.

2) What is the effect on text extraction. Initial results I was testing a while ago said that combining too many chk pages into a single gc group could cause significant zlib decompression overhead. If what you need is 200 bytes in the middle of 2MB, you have to decompress 1MB of zlib data to get at it.

3) Note I also looked at "pack recent" to move chk pages that are recently referenced to be grouped separately from 'very old' chk pages. This would probably further exacerbate the problem, though again fixing bug #402657 may cause it to not matter. (There was a modest win for something like 'bzr ls -r -1' under those conditions, which would impact the 'bzr checkout' times as well.)

Changed in bzr:
importance: Medium → High
Martin Pool (mbp)
Changed in bzr:
status: Triaged → Confirmed
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
Jelmer Vernooij (jelmer)
tags: added: performance
removed: check-for-breezy
Changed in brz:
status: New → Triaged
importance: Undecided → Medium
Jelmer Vernooij (jelmer)
tags: added: chk
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.