Bazaar

Bug #390563
Comment #41

Comment 41 for bug 390563

Revision history for this message

John A Meinel (jameinel) wrote on 2009-07-16: Re: [Bug 390563] Re: absent factory exception from smart server when streaming 2a stacked branches

#41

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John A Meinel wrote:
>> Aaron Bentley wrote:
>>> This is hitting us any time we use bundles to create/update branches.
>>> It's a pretty high-frequency, high-irritation event.
>>> Aaron
>> so... is this happening when:
>
>> 1) applying a bundle to an existing stacked branch?
>
> After applying the bundle to the stacked branch, our branch puller is
> unable to pull the branch.
>

So if you follow the rest of my emails, it turns out that creating
bundles from --2a formats is just 100% broken. regardless stacking or
not (chk pages are not being inserted into the bundle, and the raw text
from 'repo.inventories' is not sufficient without them). But there are
still issues to be discussed with how "insert_revisions" is going to
interact with stacking.

>> If I was guessing, I could easily say that the bundle code:
>
>> 1) Doesn't contain the fulltext for inventories of parent revisions,
>
> Bundles aren't meant to contain fulltexts of any inventories.
>
>> nor does it contain the data for all texts that are referenced by the
>> inventories that it *does* have.
>
> It contains the data for the texts that are referenced by the
> inventories it does have, as long as the referenced texts are new. If
> they are present in the basis revision, they will not be included.
>
>> 2) Thus when creating a stacked branch from the bundle, we don't have
>> the parent inventories to *insert* into the new stacked repository.
>
> Why do we need the parent inventories?

Stacking requirements. The specific issues are:

1) When accessing a stacked branch via bzr+ssh the *server process* does
not have access to the backing repository. This was by design, and not
just accidental fallout.

  a) Because of this choice, when *streaming* data from the smart
     server, the stacked branch needs enough information to generate a
     stream for all revisions that it has, and then the client will ask
     the fallback directly for another stream for all the remaining
     data.

2) To stream the data for a set of revisions, we need the text keys that
should be transmitted. This was changed recently from:

set_of_file_keys_in_new.intersect(revision_ids_being_transmitted)

set_of_file_keys_seen_in_new.difference(
set_of_file_keys_in_all_available_parents)

This was done so that we could be more resilient in the face of ghosts,
and inventory inconsistencies, etc.
We have two things that we wanted to trust that we've found out we can't:

  file_key == (file_id, revision) #always true
  ie = inventory(revision)[file_id]
  ie.revision == file_key[1] # not always true

This is the inventory inconsistency issue. The other is the ghosts issue:

  if file_key in inventory then either:
    file_revision == inventory.revision # the text was in this revision
      or
    file_revision in set_of_revision_being_transmitted

The issue is with round-tripping and bzr-svn. Basically, if you did work
in bzr, and push that into svn, you end up with last-modified revision
information being stored in svn which points to a bzr revision_id which
was *not* pushed to svn. (you merge a bzr change that modified foo in
revision bar, creating revision dada. you push that to svn causing it to
record a revision 'dada' that has a ghost merge parent 'bar' and a
file_key of (file_id, 'bar'). Someone else pulls your svn revisions
using bzr-svn, and then end up with a file_key mentioning a revision
that they cannot have in their repository, since it wasn't in svn.)

>> (Without going to the backing repository and pulling them out)
>
> add_mpdiffs reconstructs the fulltexts and calls
> VersionedFile.add_lines, so yes, it would hit the backing repository to
> regenerate the fulltexts. But it usually wouldn't try to insert
> inventories that were already present in the backing repository.
>

Sure. However because of the stacking + streaming issues I just
mentioned at some point we do need to insert the parent inventories.
Now, for xml formats we have a "race condition" which is:

1) If you insert an line-delta for the XML inventory into the stacked
branch, then it only references newly introduced texts, and does not
reference all possible texts. Thus even though we don't have the parent
inventory later when doing "streaming fetch" we don't try to send more
data then we have available because it just wasn't referenced.

2) However, if you are at the end of a delta chain, etc, and you insert
a "fulltext" for the XML inventory, suddenly when doing streaming fetch
from that stacked branch, it will see all these text keys that it thinks
it needs to transmit, for which it cannot find in the local repository.
[and boom].

This was *the* major issue with the AbsentContentFactory tracebacks that
we saw with fetch in bzr 1.13. It seems that bundles are yet another
vector for injecting the same problems. (lightweight checkouts and
commit are another one that I already know about that still isn't fixed.)

...

>> This points the failure at the "insert data from a bundle" code, causing
>> it to generate invalid stacked repositories.
>
>> This would all be simplified quite a bit if we just made a bundle
>> effectively just a way to transfer the data for the stream that you get
>> out of StreamSource.
>
> I started work on that at the sprint.
>
>> There is a small caveat that the bundle *also*
>> needs to think of itself as not containing any data
>
> Pardon?

...

>> However at present we have a habit of setting submit_location to a local
>> mirror, and obviously that will generate the wrong information about
>> "what data do you have that I don't".
>
> If the head revision of the local mirror is accurate or stale, it may
> bundle more data than is strictly needed, but not too little.
>
> Aaron

My point was that the streaming code might not look at the
Branch.last_revision() and instead look at "is revision in
Repository.all_revision_ids". And if you have a local shared repository,
it *will* have the revisions you are trying to bundle. As an example:

bzr init-repo shared
cd shared
bzr init base
bzr commit --unchanged base -m 1
bzr commit --unchanged base -m 2
bzr branch base new_work
bzr commit --unchanged new_work -m 3

cd new_work

# outside the repository
bzr push --stacked-on $shared/base ../../stacked

ll ../../stacked/.bzr/repository/packs
total 0

At this point, the new stacked branch has 0 data in it, because all of
the data was found in the repository that was available at 'shared/base'.

If we implemented bundles naively as just a stacked branch on top of the
local mirror branch (which shares a repository with the branch that we
are trying to bundle) then we would also get 0 data in the bundle (just
like there is 0 data in the stacked repository.)

So we have to implement it as a Stacked repository that is stacked on a
BackingRepository that only has access to data that is in the ancestry
of the branch we are stacking on. We do this directly in the current
bundle code. (by using set ancestry differences, IIRC.)

Now, I *think* if we use set_ancestry to find the revisions to send, and
then use the generic "get_stream()" for those revisions, it will contain
the right data. (It would contain the correct revisions, and the correct
inventories, etc.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfSPIACgkQJdeBCYSNAANG6QCdFXf+1YUmoTkpHbC8miE7oUV/
wLsAn0it9vF/Yl4TdacTyHlbOp2DpqMG
=BbvL
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Aaron Bentley wrote:
> John A Meinel wrote:
>> Aaron Bentley wrote:
>>> This is hitting us any time we use bundles to create/update branches.
>>> It's a pretty high-frequency, high-irritation event.
>>> Aaron
>> so... is this happening when:
> 
>> 1) applying a bundle to an existing stacked branch?
> 
> After applying the bundle to the stacked branch, our branch puller is
> unable to pull the branch.
>

>> If I was guessing, I could easily say that the bundle code:
> 
>> 1) Doesn't contain the fulltext for inventories of parent revisions,
> 
> Bundles aren't meant to contain fulltexts of any inventories.
> 
>>    nor does it contain the data for all texts that are referenced by the
>>    inventories that it *does* have.
> 
> It contains the data for the texts that are referenced by the
> inventories it does have, as long as the referenced texts are new.  If
> they are present in the basis revision, they will not be included.
> 
>> 2) Thus when creating a stacked branch from the bundle, we don't have
>>    the parent inventories to *insert* into the new stacked repository.
> 
> Why do we need the parent inventories?

Stacking requirements. The specific issues are:

1) When accessing a stacked branch via bzr+ssh the *server process* does
not have access to the backing repository. This was by design, and not
just accidental fallout.

2) To stream the data for a set of revisions, we need the text keys that
should be transmitted. This was changed recently from:

set_of_file_keys_in_new.intersect(revision_ids_being_transmitted)

set_of_file_keys_seen_in_new.difference(
    set_of_file_keys_in_all_available_parents)

This was done so that we could be more resilient in the face of ghosts,
and inventory inconsistencies, etc.
We have two things that we wanted to trust that we've found out we can't:

file_key == (file_id, revision) #always true
  ie = inventory(revision)[file_id]
  ie.revision == file_key[1] # not always true

This is the inventory inconsistency issue. The other is the ghosts issue:

if file_key in inventory then either:
    file_revision == inventory.revision # the text was in this revision
      or
    file_revision in set_of_revision_being_transmitted

>>    (Without going to the backing repository and pulling them out)
> 
> add_mpdiffs reconstructs the fulltexts and calls
> VersionedFile.add_lines, so yes, it would hit the backing repository to
> regenerate the fulltexts.  But it usually wouldn't try to insert
> inventories that were already present in the backing repository.
>

Sure. However because of the stacking + streaming issues I just
mentioned at some point we do need to insert the parent inventories.
Now, for xml formats we have a "race condition" which is:

...

>> This points the failure at the "insert data from a bundle" code, causing
>> it to generate invalid stacked repositories.
> 
>> This would all be simplified quite a bit if we just made a bundle
>> effectively just a way to transfer the data for the stream that you get
>> out of StreamSource.
> 
> I started work on that at the sprint.
> 
>> There is a small caveat that the bundle *also*
>> needs to think of itself as not containing any data
> 
> Pardon?

...

>> However at present we have a habit of setting submit_location to a local
>> mirror, and obviously that will generate the wrong information about
>> "what data do you have that I don't".
> 
> If the head revision of the local mirror is accurate or stale, it may
> bundle more data than is strictly needed, but not too little.
> 
> Aaron

bzr init-repo shared
cd shared
bzr init base
bzr commit --unchanged base -m 1
bzr commit --unchanged base -m 2
bzr branch base new_work
bzr commit --unchanged new_work -m 3

cd new_work

# outside the repository
bzr push --stacked-on $shared/base ../../stacked

ll ../../stacked/.bzr/repository/packs
total 0

At this point, the new stacked branch has 0 data in it, because all of
the data was found in the repository that was available at 'shared/base'.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkpfSPIACgkQJdeBCYSNAANG6QCdFXf+1YUmoTkpHbC8miE7oUV/
wLsAn0it9vF/Yl4TdacTyHlbOp2DpqMG
=BbvL
-----END PGP SIGNATURE-----