Content filtering breaks commit w/ merge parents

Bug #415508 reported by John A Meinel
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bazaar
Fix Released
Critical
Martin Pool
1.18
Fix Released
High
Unassigned
2.0
Fix Released
Critical
Unassigned

Bug Description

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Frits Jalvingh (bug #405251) noticed that the performance of his
repository suddenly died when he upgraded to bzr 1.17. It looked like
all of a sudden he would have thousands of changed files per commit.

I dug into it, and found that there is a specific bug wrt content
filtering and commit that isn't using "record_iter_changes".

This happens whenever you have a merge in a 1.14 repository.

I was able to track it down into this code path:

bzrlib.repository.CommitBuilder.record_entry_contents()

has this statement:
        if kind == 'file':
            if content_summary[2] is None:
                raise ValueError("Files must not have executable = None")
            if not store:
                if (# if the file length changed we have to store:
                    parent_entry.text_size != content_summary[1] or
                    # if the exec bit has changed we have to store:
                    parent_entry.executable != content_summary[2]):
                    store = True
...
            if store:
                # We want to record a new node regardless of the presence or
                # absence of a content change in the file.
                nostore_sha = None

^- The issue is that if you are using content filtering,
'content_summary[1]' is claiming the size-on-disk, while
parent_entry.text_size is the size of the committed form.

Or in other words, it is comparing the canonical size with the
convenient size. And when they differ it forces the code to add a new
text. (by setting nostore_sha = None, so we won't notice that nothing
has changed.)

This also happens because of this code in Workingtree:
  def path_content_summary(...
  ...
        if kind == 'file':
            size = stat_result.st_size
            # try for a stat cache lookup
            executable = self._is_executable_from_path_and_stat(path,
stat_result)
            return (kind, size, executable, self._sha_from_stat(
                path, stat_result))

So path_content_summary always returns the size on disk, even if there
are filters in effect. And this gets passed all the way around to commit
builder.

I believe "CommitBuilder.record_iter_changes" is more immune to this,
because 'iter_changes' was taught to do the right thing wrt content size.

Which means that --1.14 format repositories + WT are probably
vulnerable, but --2a formats are probably not.

John
=:->

  affects bzr
  status triaged
  importance critical
  milestone 2.0

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqK514ACgkQJdeBCYSNAAPVewCfceiK4xewv4Kh8GetmMkmmQ6h
P2EAoJxfNcFhZU4r3esJWxlPihvxcg6u
=1gFL
-----END PGP SIGNATURE-----

Related branches

Revision history for this message
John A Meinel (jameinel) wrote :

The associated branch has a potential fix.

Basically, stop paying attention to 'content_summary[...]' to be accurate about the committed text size.

This means that for files with content filtering, we will read them off disk, apply the filtering, and then see that nothing has changed.

I expect performance will suck a little bit, as we'll be re-reading all the files that have a content filter. But at least it won't cause bogus data to be inserted into the repository on every merge commit.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 415508] Re: Content filtering breaks commit w/ merge parents

On Tue, 2009-08-18 at 17:53 +0000, John A Meinel wrote:
>
> I expect performance will suck a little bit, as we'll be re-reading
> all
> the files that have a content filter. But at least it won't cause
> bogus
> data to be inserted into the repository on every merge commit.

This would appear to still leave an api in place that isn't content
filter aware.

I appreciate that we need to fix this promptly, but it doesn't seem
complete without fixing / deprecating the get_content_summary api; and
if we fix it commit doesn't need to change.

-Rob

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Tue, 2009-08-18 at 17:53 +0000, John A Meinel wrote:
>> I expect performance will suck a little bit, as we'll be re-reading
>> all
>> the files that have a content filter. But at least it won't cause
>> bogus
>> data to be inserted into the repository on every merge commit.
>
> This would appear to still leave an api in place that isn't content
> filter aware.
>
> I appreciate that we need to fix this promptly, but it doesn't seem
> complete without fixing / deprecating the get_content_summary api; and
> if we fix it commit doesn't need to change.
>
> -Rob
>

So the thing is we can't know the size of the file without either
caching it into the dirstate or running the content filters.

With this patch, I believe it will fall down to the next check, which is
checking if the cached sha1 matches the sha1 of the only parent text,
and thus not require us to read the file at all. (So my earlier
statement about it re-reading all texts would be false.)

If you want to just deprecate the non 'record-iter-changes' path, that
is reasonable. I'm not sure the specific effect on commit, etc.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqLGAUACgkQJdeBCYSNAAOQ4wCfTeDsFgGSFoxLrWZXAs8zmFHV
S/QAoItwEnletcotCooHmLA3xZm0GjkI
=20vK
-----END PGP SIGNATURE-----

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote : Re: [Bug 415508] [NEW] Content filtering breaks commit w/ merge parents

John A Meinel wrote:

> I dug into it, and found that there is a specific bug wrt content
> filtering and commit that isn't using "record_iter_changes".

Thanks for tracking this down and proposing a patch. I'll make it a
priority today to look over what you've found and assist getting a fix
landed.

Ian C.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 415508] Re: Content filtering breaks commit w/ merge parents

On Tue, 2009-08-18 at 21:07 +0000, John A Meinel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Robert Collins wrote:
> > On Tue, 2009-08-18 at 17:53 +0000, John A Meinel wrote:
> >> I expect performance will suck a little bit, as we'll be re-reading
> >> all
> >> the files that have a content filter. But at least it won't cause
> >> bogus
> >> data to be inserted into the repository on every merge commit.
> >
> > This would appear to still leave an api in place that isn't content
> > filter aware.
> >
> > I appreciate that we need to fix this promptly, but it doesn't seem
> > complete without fixing / deprecating the get_content_summary api; and
> > if we fix it commit doesn't need to change.
> >
> > -Rob
> >
>
> So the thing is we can't know the size of the file without either
> caching it into the dirstate or running the content filters.
>
> With this patch, I believe it will fall down to the next check, which is
> checking if the cached sha1 matches the sha1 of the only parent text,
> and thus not require us to read the file at all. (So my earlier
> statement about it re-reading all texts would be false.)
>
> If you want to just deprecate the non 'record-iter-changes' path, that
> is reasonable. I'm not sure the specific effect on commit, etc.

All I'm saying is that stopping-the-use-of-a-bad-API is not sufficient.
The /cause/ of the bug remains, and that API should be
deprecated/deleted/made to raise when content filters are in use, or
something along those lines.

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

FWIW, while poolie was reviewing the content filtering changes and assisting with the associated dirstate changes, I'm pretty sure he suggested that path_content_summary() ought to go one day. As best I can tell, there are only two client calls in bzrlib to tree.path_content_summary():

* one in commit.py
* one in transform.py (as it's implementing it's own path_content_summary())

In the second case, there's one call to transform.path_content_summary() and it only uses the kind field in the tuple, not the size or other fields.

So I think deprecating path_content_summary() and the non "record-iter-changes" commit path using it is feasible. Does this need to wait until Robert's selective file commit patch lands first or not?

If there's agreement in principle, I'll check the code bases of the popular plugins for any client calls and ask on the list as to whether there's any objections to it going.

Revision history for this message
Robert Collins (lifeless) wrote :

On Wed, 2009-08-19 at 06:48 +0000, Ian Clatworthy wrote:
>
> So I think deprecating path_content_summary() and the non
> "record-iter-
> changes" commit path using it is feasible. Does this need to wait
> until
> Robert's selective file commit patch lands first or not?

Its orthogonal - should be no disruption to my work which is all about
iter_changes.

-Rob

Revision history for this message
Martin Pool (mbp) wrote :

So is this now fixed by John's patch? If so, but followon work is needed to remove a dangerous API let's turn that into a separate bug.

Is this still critical and still needed for 2.0?

Revision history for this message
Martin Pool (mbp) wrote :

Robert says on irc:

This can still be reproduced by using 'bzr commit --excludes' or (before his patch lands) by doing 'bzr commit FILE'.

It doesn't cause loss of history but it does cause us to store files as changed when they're not really, which makes things slow, and makes the graph history too long. So it is pretty bad.

Changed in bzr:
assignee: nobody → Martin Pool (mbp)
status: Triaged → Confirmed
Revision history for this message
Martin Pool (mbp) wrote :

John's patch

http://bazaar.launchpad.net/~jameinel/bzr/1.17-content-filtering-commit/revision/4530

seems reasonable, but seems to indicate that we'd want a test for record_entry_contents (or a larger function, up to commit), checking that it gets the canonical size from the tree, not the convenient size. Even if this interface is renamed, there'd still be a chance commit was using the wrong one and we ought to test it.

Changed in bzr:
status: Confirmed → In Progress
Revision history for this message
Robert Collins (lifeless) wrote :

On Mon, 2009-08-24 at 01:09 +0000, Martin Pool wrote:
>
> seems reasonable, but seems to indicate that we'd want a test for
> record_entry_contents (or a larger function, up to commit), checking
> that it gets the canonical size from the tree, not the convenient
> size.
> Even if this interface is renamed, there'd still be a chance commit
> was
> using the wrong one and we ought to test it.

record_entry_contents is the right method to test; testing commit will
run the risk of using the fast code path and not actually testing the
code we want to.

We should also rename the API and drop the size field from it. Which
will prevent it being used wrongly.

-Rob

Revision history for this message
Martin Pool (mbp) wrote :

I'm still wondering whether this is meant to return the sha1 of the canonical or convenient content.

The steps seem to be:

1- rename get_path_content_summary to say get_canonical_content_summary; remove the size field. So then it will return (kind, exec, sha1_or_link_target).

2- there don't seem to be any record_entry_contents tests yet but we could add some, presumably per_repository.

3- if the test of step 2 is failing, we could merge john's branch and expect that it will then pass.

Ian pointed out that bzr-svn and bzr-git might rely on this interface. It would be nice to avoid breaking them. If they provide this interface but don't call it, we might be able to have a clean migration path.

I'm wondering here if we should actually have another Tree object that represents the filtered view of the tree, rather than proliferating methods with filtered=bool parameters. But that can be done separately.

The code that does filtering, eg internal_size_sha_file_byname seems to return both the size and the hash so it seems like any problem that affects one should affect the other.

I think before changing get_path_content_summary we need to be clear about what the code that's currently using it expects the behaviour to be, whether that's to return the filtered or unfiltered content.

Revision history for this message
Martin Pool (mbp) wrote :

It seems like we agree that path_content_summary, being a pre-content-filtering interface, ought to speak about the canonical form. So I added a test that it should, and it fails.

The generic WorkingTree path_content_summary code goes through self._sha_from_stat so this ought to go through the SHA1 provider that can tell you about content filtering.

(Incidentally it looks a lot like this will fail on pre-dirstate working trees through _sha_from_stat being undefined; maybe this code can never be reached in that case.)

path_content_summary seems to not actually return the hash, because the contract allows it to return None if it would be expensive to compute. In the case of filters its reasonable to assume that it would be expensive. This might mean my test has insufficient coverage if there are cases where it does both filter and cache.

At the moment WorkingTree.path_content_summary always directly uses the st_size as the size, but in fact it needs to got through something that can end up calling internal_size_sha_file_byname. However, that's going to end up filtering the whole file just to throw away the results, which is not great. Indeed there are other places in WorkingTree that seem to make the same assumption that they can just use the stat value. So that perhaps brings us back to just saying this interface should be removed altogether.

bzr-svn provides path_content_summary (nearly copy-and-pasted?) in svn workingtrees, and bzr-git doesn't use it at all in trunk afaics. So we should be ok to remove it.

The commit use of path_content_summary is very basic, really it only strongly cares about the kind. It does pass the content_summary down to record_entry_contents, which uses the link target etc but we could fudge that, or just remove the size.

TreeTransform also uses it and that might be harder to detangle. It uses it in _comparison_data, but only returns the kind and x-bit, so that's easy. It also has its own PreviewTree.path_content_summary which I doesn't seem to know about filtering, and it can probably cope without calling it.

So that's all good.

This has the benefit of path_content_summary never promising to return something that might require reading the whole file, which is likely to cause wasted work.

If you cut out the bits that path_content_summary can't quickly compute all it's doing is detecting the type and the x bit. It is useful to do those together so as to get a single stat call. Some of the other things like symlink targets take a separate io anyhow and the caller may not want them.

We'd need a new test for get_kind_and_executable_by_path.

bzr-svn and git do use record_entry_contents but don't look at the contents_summary. It's probably still better to keep the same form and just blank out the fields that are likely to be wrong?

Revision history for this message
Martin Pool (mbp) wrote :

My attached branch changes bzrlib.commit to not use path_contents_summary. It adds a new tree method get_kind_and_executable_from_path so that we can observe the kind and execute bit in one stat call, but so that we don't promise to do any more work than that.

I still need to add a test for that new method, deprecate path_contents_summary, and fix up the TreeTransform use of that method.

Revision history for this message
Robert Collins (lifeless) wrote :

On Mon, 2009-08-24 at 09:14 +0000, Martin Pool wrote:

> My attached branch changes bzrlib.commit to not use
> path_contents_summary. It adds a new tree method
> get_kind_and_executable_from_path so that we can observe the kind and
> execute bit in one stat call, but so that we don't promise to do any
> more work than that.
>
> I still need to add a test for that new method, deprecate
> path_contents_summary, and fix up the TreeTransform use of that
> method.

Is the semantics of the new method clearly defined vis-a-vis content
filters? E.g. a filter that forces all files without a #! to be
non-executable.

-Rob

Revision history for this message
Martin Pool (mbp) wrote :

The uses of path_content_summary in the test suite seem to be:

1- test_commit_builder, that use it as input to record_entry_contents: we could change them to pass in the expected (generally pretty simple) value instead

2- test_path_content_summary - these probably want to be updated to get_kind_and_executable

3- as for #2 but in test_transform

I think overall this is not an inherently dangerous api, but it is also not one that we particularly want to encourage people to use because it will tend to do too much work in reading in the whole file. What is dangerous here is that there's no clear systematic distinction between when we're talking about the convenient and when about the canonical forms.

Revision history for this message
Frits Jalvingh (fjalvingh) wrote :

> It doesn't cause loss of history but it does cause us to store files as changed when they're not really, which makes
> things slow, and makes the graph history too long. So it is pretty bad.
I would like to say that it is more than bad: it effectively causes the destruction (at high speed, in our case in a month) of a repository (in my view pulls/pushes of over 2 hours, and commits taking half an hour mean your repository is unusable), and because it cannot be undone this means the only way out is to recreate everything from scratch, loosing all history in the process. In my view that is disastrous. The impact of this problem on a team of about 15 man has been quite big in terms of work hours lost. I certainly hope no one else currently goes live with crlf content filtering....

I still need to test the fix (will do it, but later); this discussion and some earlier indications I got from jam made me very unsure whether this fix is something to try on production; I currently put a lot of effort of testing the 2a format instead.

Revision history for this message
Martin Pool (mbp) wrote :

2009/8/24 Robert Collins <email address hidden>:
> Is the semantics of the new method clearly defined vis-a-vis content
> filters? E.g. a filter that forces all files without a #! to be
> non-executable.

I don't think content filters at the moment can change the executable
bit. It might be useful if they could. That seems to require more of
this separation of "the tree on disk" vs "the filtered tree" that I'm
talking about. I don't think this change makes it harder.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Robert Collins (lifeless) wrote :

On Mon, 2009-08-24 at 09:32 +0000, Martin Pool wrote:
> The uses of path_content_summary in the test suite seem to be:
>
> 1- test_commit_builder, that use it as input to record_entry_contents:
> we could change them to pass in the expected (generally pretty simple)
> value instead
>
> 2- test_path_content_summary - these probably want to be updated to
> get_kind_and_executable
>
> 3- as for #2 but in test_transform
>
> I think overall this is not an inherently dangerous api, but it is also
> not one that we particularly want to encourage people to use because it
> will tend to do too much work in reading in the whole file. What is
> dangerous here is that there's no clear systematic distinction between
> when we're talking about the convenient and when about the canonical
> forms.

So, its worth noting I think, that we got significant performance
increases by path_content_summary, because it avoided path computation
multiple times inside tree; the new commit code path largely addresses
that :- we're actually rolling back a change here ;)

-Rob

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Martin Pool wrote:
> I'm still wondering whether this is meant to return the sha1 of the
> canonical or convenient content.

We only deal in canonical sha1s.

The sha1 we store in the dirstate is the canonical form.

>
> The steps seem to be:

...

> The code that does filtering, eg internal_size_sha_file_byname seems to
> return both the size and the hash so it seems like any problem that
> affects one should affect the other.
>
> I think before changing get_path_content_summary we need to be clear
> about what the code that's currently using it expects the behaviour to
> be, whether that's to return the filtered or unfiltered content.
>

So path_content_summary is doing:

st = os.lstat(path)
size = st.st_size

sha1_and_other_info = self._check_the_hashcache(path, st)

In other words, it is ignoring the data stored in the dirstate when
returning the size of the file. I believe we store the size into the
dirstate in such a way that we could return it, rather than looking at
st.st_size. Though I'm not positive. (We do for old revisions, but I
would have thought we wouldn't for the *current* column.)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqSr1sACgkQJdeBCYSNAANdnwCgzsRbuponDlkb3J5LCMn5qrsl
n7sAn18N3wcaZYVdsbhkfS4FmL43nPsz
=HCRK
-----END PGP SIGNATURE-----

Revision history for this message
John A Meinel (jameinel) wrote :

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Collins wrote:
> On Mon, 2009-08-24 at 09:32 +0000, Martin Pool wrote:
>> The uses of path_content_summary in the test suite seem to be:
>>
>> 1- test_commit_builder, that use it as input to record_entry_contents:
>> we could change them to pass in the expected (generally pretty simple)
>> value instead
>>
>> 2- test_path_content_summary - these probably want to be updated to
>> get_kind_and_executable
>>
>> 3- as for #2 but in test_transform
>>
>> I think overall this is not an inherently dangerous api, but it is also
>> not one that we particularly want to encourage people to use because it
>> will tend to do too much work in reading in the whole file. What is
>> dangerous here is that there's no clear systematic distinction between
>> when we're talking about the convenient and when about the canonical
>> forms.
>
> So, its worth noting I think, that we got significant performance
> increases by path_content_summary, because it avoided path computation
> multiple times inside tree; the new commit code path largely addresses
> that :- we're actually rolling back a change here ;)
>
> -Rob
>

So the other thing that path_content_summary returns is the sha1 of the
file if it has it available from the dirstate. Which *is* useful and
important to be able to do (otherwise you have to read the file to
determine that fact.)

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkqSsBEACgkQJdeBCYSNAANAgQCgm2j3QgpuuaU1IE4aQfdcoOON
CFsAoMwjC3aHwLZKa1OgjglHUC5FVA4P
=BLNe
-----END PGP SIGNATURE-----

Revision history for this message
Martin Pool (mbp) wrote :

I'm going to consolidate the content filtering docs from other places or email so that we have a clear statement of how it's supposed to work.

Revision history for this message
Martin Pool (mbp) wrote :

2009/8/24 Robert Collins <email address hidden>:
> So, its worth noting I think, that we got significant performance
> increases by path_content_summary, because it avoided path computation
> multiple times inside tree; the new commit code path largely addresses
> that :- we're actually rolling back a change here ;)

Right, I realized that after posting last night. We need to make sure
that if the canonical hash is known, we use it.

The problems in path_content_summary can be seen as:

 * it does not consistently use the canonical form
 * it's contract is meant to be "return the size, and return the SHA1
if you can do that cheaply", but in fact it needs to be allowed to not
return the size if it can't do it cheaply

I'm adding some developer documentation about content filtering now to
make some of the issues discussed here more permanent and explicit.

--
Martin <http://launchpad.net/~mbp/>

Revision history for this message
Martin Pool (mbp) wrote :

The doc update is in https://code.edge.launchpad.net/~mbp/bzr/doc/+merge/10638

The dirstate file does promise to hold the size of the working tree copy, presumably the size of the canonical form. However, it doesn't actually do so :-( there are multiple places where it compares it to or assigns it from the stat value size. So there's no reliable way to know the size of the canonical form other than by filtering it.

(Even if we fixed all the callers people with existing trees might keep having problems with it.)

What I'm going to do overall is: change path_content_summary so that it doesn't return the size unless it's known to be correct; in fact don't return it at all for formats that support content filtering. The change commit so that it doesn't look at the size. This should keep performance similar to what we have at present.

Revision history for this message
Martin Pool (mbp) wrote :

mp posted https://code.edge.launchpad.net/~mbp/bzr/415508-content-filtering/+merge/10640

I'm not totally satisfied with the level of testing of this - I feel there should be an overall smoketest or blackbox test that commit works with content filtering, and there really is not at the moment. But, that could possibly be tackled separately.

Revision history for this message
Martin Pool (mbp) wrote :

Bug 418458 for wanting integration tests and bug 419038 for having better testing of record_entry_contents.

Revision history for this message
Martin Pool (mbp) wrote :

Fixed in trunk and 2.0.

Changed in bzr:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.