Skip to content

More commit-graph/Bloom filter improvements #659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Jun 15, 2020

This builds on sg/commit-graph-cleanups, which took several patches from Szeder's series [1] and applied them almost directly to a more-recent version of Git [2].

[1] https://lore.kernel.org/git/[email protected]/
[2] https://lore.kernel.org/git/[email protected]/

This series adds a few extra improvements, several of which are rooted in Szeder's original series. I maintained his authorship and sign-off, even though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder discovered in the old thread) that we are not re-using the bloom_filter_settings from the existing commit-graph when writing a new one.

  1. commit-graph: place bloom_settings in context
  2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in the commit-graph write context.

  1. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in the very next patch. In fact, if we swap patch 3 and 4, then t4216-log-bloom.sh fails with a segfault due to a NULL filter.

  1. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

  1. commit-graph: unify the signatures of all write_graph_chunk_*() functions
  2. commit-graph: simplify chunk writes into loop
  3. commit-graph: check chunk sizes after writing

These are all the same as before.

  1. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

  1. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

  1. commit-graph: check all leading directories in changed path Bloom filters
  2. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the better logic around directory separators and empty paths. Also, the list of Bloom keys is ordered with the deepest path first. That has some tiny performance benefits for deep paths since we can short-circuit the multi-key checks more often. That code path is much faster than the tree parsing, so it is hard to measure any change.

Updates in V3:

  • Responded to René's feedback.
  • Fixed the test in Patch 4 to use GIT_TEST_ variables and extend the GIT_TRACE2 depth to work with 'seen' branch.

Update in V4;

  • Fixed the bug with "too large" commits. Test is added. The fixup! I sent earlier doesn't actually squash cleanly, so I resolved the conflicts during the rebase.

Thanks,
-Stolee

Cc: [email protected], [email protected], [email protected]
cc: SZEDER Gábor [email protected]
cc: Taylor Blau [email protected]
cc: Derrick Stolee [email protected]

@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2020

Submitted as [email protected]

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 17, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> This builds on sg/commit-graph-cleanups,...

How ready is that topic, do you think?  I'd rather not to see too
many patches piled on top of what is not even in 'next', but I do
not remember it reviewed seriously (I did take a look or two at it
myself before queuing the series, but that does not quite count).

Will queue to extend the topic for now.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

This branch is now known as sg/commit-graph-cleanups.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

This patch series was integrated into pu via git@7ac860a.

@gitgitgadget gitgitgadget bot added the pu label Jun 18, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/17/2020 5:21 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> 
>> This builds on sg/commit-graph-cleanups,...
> 
> How ready is that topic, do you think?  I'd rather not to see too
> many patches piled on top of what is not even in 'next', but I do
> not remember it reviewed seriously (I did take a look or two at it
> myself before queuing the series, but that does not quite count).

That topic was my attempt to apply the "easy and obvious" changes
from Szeder's proof-of-concept. In some sense, you could consider
them authored by Szeder and reviewed by me. But also, I did need to
tweak some things, so some review from others would be helpful.

> Will queue to extend the topic for now.

I'll refrain from pushing any more in this direction until more
review comes along. I found myself with a need to do something
productive and familiar, so I started doing commit-graph
performance stuff.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

This patch series was integrated into pu via git@7fd559e.

@@ -882,6 +882,7 @@ struct write_commit_graph_context {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
> From: Derrick Stolee <[email protected]>
>
> Place an instance of struct bloom_settings into the struct
> write_commit_graph_context. This allows simplifying the function
> prototype of write_graph_chunk_bloom_data(). This will allow us
> to combine the function prototypes and use function pointers to
> simplify write_commit_graph_file().
>
> Reported-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  commit-graph.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 887837e8826..05b7035d8d5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>
>  	const struct split_commit_graph_opts *split_opts;
>  	size_t total_bloom_filter_data_size;
> +	struct bloom_filter_settings bloom_settings;

That structure is quite busy already, so adding one more member wouldn't
matter much.

Passing so many things to lots of functions makes it harder to argue
about them, though, as all of them effectively become part of their
signature, and you have to look at their implementation to see which
pseudo-parameters they actually use.  It's like a God object.

>  };
>
>  static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  }
>
>  static void write_graph_chunk_bloom_data(struct hashfile *f,
> -					 struct write_commit_graph_context *ctx,
> -					 const struct bloom_filter_settings *settings)
> +					 struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  			_("Writing changed paths Bloom filters data"),
>  			ctx->commits.nr);
>
> -	hashwrite_be32(f, settings->hash_version);
> -	hashwrite_be32(f, settings->num_hashes);
> -	hashwrite_be32(f, settings->bits_per_entry);
> +	hashwrite_be32(f, ctx->bloom_settings.hash_version);
> +	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
> +	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>
>  	while (list < last) {
>  		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	struct object_id file_hash;
>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> +	ctx->bloom_settings = bloom_settings;

So we use the defaults, no customization?  Then you could simply move
the declaration of bloom_settings from write_commit_graph_file() to
write_graph_chunk_bloom_data().  Glancing at pu I don't see additional
uses there, so no need to put it into the context (yet?).

René

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/18/2020 4:30 PM, René Scharfe wrote:
> Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
>> From: Derrick Stolee <[email protected]>
>>
>> Place an instance of struct bloom_settings into the struct
>> write_commit_graph_context. This allows simplifying the function
>> prototype of write_graph_chunk_bloom_data(). This will allow us
>> to combine the function prototypes and use function pointers to
>> simplify write_commit_graph_file().
>>
>> Reported-by: SZEDER Gábor <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>>  commit-graph.c | 14 ++++++++------
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 887837e8826..05b7035d8d5 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>>
>>  	const struct split_commit_graph_opts *split_opts;
>>  	size_t total_bloom_filter_data_size;
>> +	struct bloom_filter_settings bloom_settings;
> 
> That structure is quite busy already, so adding one more member wouldn't
> matter much.
> 
> Passing so many things to lots of functions makes it harder to argue
> about them, though, as all of them effectively become part of their
> signature, and you have to look at their implementation to see which
> pseudo-parameters they actually use.  It's like a God object.

Correct. The write_commit_graph_context _is_ a God object for the
commit-graph write. The good news is that it is limited only to
commit-graph.c and the write operations therein. Hopefully, the
code organization benefits enough from this structure to justify
the massive struct.

In contrast, it's still smaller and more contained than
"struct rev_info"!

>>  };
>>
>>  static void write_graph_chunk_fanout(struct hashfile *f,
>> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>>  }
>>
>>  static void write_graph_chunk_bloom_data(struct hashfile *f,
>> -					 struct write_commit_graph_context *ctx,
>> -					 const struct bloom_filter_settings *settings)
>> +					 struct write_commit_graph_context *ctx)
>>  {
>>  	struct commit **list = ctx->commits.list;
>>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
>> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>>  			_("Writing changed paths Bloom filters data"),
>>  			ctx->commits.nr);
>>
>> -	hashwrite_be32(f, settings->hash_version);
>> -	hashwrite_be32(f, settings->num_hashes);
>> -	hashwrite_be32(f, settings->bits_per_entry);
>> +	hashwrite_be32(f, ctx->bloom_settings.hash_version);
>> +	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
>> +	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>>
>>  	while (list < last) {
>>  		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>  	struct object_id file_hash;
>>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>
>> +	ctx->bloom_settings = bloom_settings;
> 
> So we use the defaults, no customization?  Then you could simply move
> the declaration of bloom_settings from write_commit_graph_file() to
> write_graph_chunk_bloom_data().  Glancing at pu I don't see additional
> uses there, so no need to put it into the context (yet?).

It certainly is not customized by a user (yet). However, you do make an
excellent point that I need to be more careful here! Patch 8
(commit-graph: persist existence of changed-paths) needs to load the
bloom_filter_settings from the existing commit-graph so we can be
future-proof from a future version customizing the settings inside the
commit-graph file!

This means that in v2 I'll move patches 7 & 8 to be after patch 1 and
add a test to verify the filter settings are preserved (after manually
changing the data in the file).

Thanks!
-Stolee

@@ -882,10 +882,11 @@ struct write_commit_graph_context {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_oids(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	int count;
>  	for (count = 0; count < ctx->commits.nr; count++, list++) {
>  		display_progress(ctx->progress, ++ctx->progress_cnt);
> -		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
> +		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);

Before the cast was forcing an int into an int (huh?), now it forces a
size_t into an int, but hashwrite() expects an unsigned int.  Do we
really need that cast?

>  	}
> +
> +	return 0;
>  }
>
>  static const unsigned char *commit_to_sha1(size_t index, void *table)
> @@ -926,8 +930,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
>  	return commits[index]->object.oid.hash;
>  }
>
> -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_data(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -944,7 +948,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  			die(_("unable to parse commit %s"),
>  				oid_to_hex(&(*list)->object.oid));
>  		tree = get_commit_tree_oid(*list);
> -		hashwrite(f, tree->hash, hash_len);
> +		hashwrite(f, tree->hash, the_hash_algo->rawsz);

... and here's the answer: No, we don't need to cast.

René

@@ -882,10 +882,11 @@ struct write_commit_graph_context {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> In write_commit_graph_file() we now have one block of code filling the
> array of 'struct chunk_info' with the IDs and sizes of chunks to be
> written, and an other block of code calling the functions responsible
> for writing individual chunks.  In case of optional chunks like Extra
> Edge List an Base Graphs List there is also a condition checking
> whether that chunk is necessary/desired, and that same condition is
> repeated in both blocks of code. Other, newer chunks have similar
> optional conditions.
>
> Eliminate these repeated conditions by storing the function pointers
> responsible for writing individual chunks in the 'struct chunk_info'
> array as well, and calling them in a loop to write the commit-graph
> file.  This will open up the possibility for a bit of foolproofing in
> the following patch.

OK.  An alternative would be a switch in the loop that calls the right
function based on the chunk id.  That would not require uniform
interfaces for all write functions; patch 2 would not be necessary.

>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  commit-graph.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 3bae1e52ed0..78e023be664 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1532,9 +1532,13 @@ static int write_graph_chunk_base(struct hashfile *f,
>  	return 0;
>  }
>
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      struct write_commit_graph_context *ctx);
> +
>  struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
> +	chunk_write_fn write_fn;
>  };
>
>  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> @@ -1591,27 +1595,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
>  	chunks[0].size = GRAPH_FANOUT_SIZE;
> +	chunks[0].write_fn = write_graph_chunk_fanout;
>  	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunks[1].size = hashsz * ctx->commits.nr;
> +	chunks[1].write_fn = write_graph_chunk_oids;
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
> +	chunks[2].write_fn = write_graph_chunk_data;
>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> +		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
>  		num_chunks++;
>  	}
>  	if (ctx->changed_paths) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
>  		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
>  		num_chunks++;
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
>  		chunks[num_chunks].size = sizeof(uint32_t) * 3
>  					  + ctx->total_bloom_filter_data_size;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>  		num_chunks++;
>  	}
>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
>  		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
> +		chunks[num_chunks].write_fn = write_graph_chunk_base;
>  		num_chunks++;
>  	}
>
> @@ -1647,19 +1658,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			progress_title.buf,
>  			num_chunks * ctx->commits.nr);
>  	}
> -	write_graph_chunk_fanout(f, ctx);
> -	write_graph_chunk_oids(f, ctx);
> -	write_graph_chunk_data(f, ctx);
> -	if (ctx->num_extra_edges)
> -		write_graph_chunk_extra_edges(f, ctx);
> -	if (ctx->changed_paths) {
> -		write_graph_chunk_bloom_indexes(f, ctx);
> -		write_graph_chunk_bloom_data(f, ctx);
> -	}
> -	if (ctx->num_commit_graphs_after > 1 &&
> -	    write_graph_chunk_base(f, ctx)) {
> -		return -1;
> +
> +	for (i = 0; i < num_chunks; i++) {
> +		if (chunks[i].write_fn(f, ctx)) {
> +			error(_("failed writing chunk with id %"PRIx32""),
> +			      chunks[i].id);

This error message is new and not mentioned in the commit message.
write_graph_chunk_base() seems to be the only write function that can
return something else than 0, and it already reports an error in that
case.  So do we really want the one here as well?

René

@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well.  Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks.  The table below compares the average
> false positive rate and runtime of
>
>   git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
>                     Average false           Average        Average
>                     positive rate           runtime        runtime
>                   before     after     before     after   difference
>   ------------------------------------------------------------------
>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

Nice!

>
> *Path selection was done with the following pipeline:
>
> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here.  However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches).  If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>         commit-graph write --reachable
>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>         rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  revision.c           | 35 ++++++++++++++++++++++++++---------
>  revision.h           |  6 ++++--
>  t/t4216-log-bloom.sh |  2 +-
>  3 files changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index c644c660917..027ae3982b4 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
>  	char *path_alloc = NULL;
> -	const char *path;
> +	const char *path, *p;
>  	int last_index;
> -	int len;
> +	size_t len;
> +	int path_component_nr = 0, j;
>
>  	if (!revs->commits)
>  		return;
> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
>  	len = strlen(path);
>
> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +	p = path;
> +	do {
> +		p = strchrnul(p + 1, '/');
> +		path_component_nr++;
> +	} while (p - path < len);

Hmm, that "+ 1" makes me a bit nervous.  Can we be sure that path is not
an empty string?

And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of
hard-coding a slash?

> +
> +	revs->bloom_keys_nr = path_component_nr;
> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> +	p = path;
> +	for (j = 0; j < revs->bloom_keys_nr; j++) {
> +		p = strchrnul(p + 1, '/');

Same here, of course.

Also note that this puts shorter sub-strings first.


> +
> +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +			       revs->bloom_filter_settings);
> +	}
>
>  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>  		atexit(trace2_bloom_filter_statistics_atexit);
> @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  						 struct commit *commit)
>  {
>  	struct bloom_filter *filter;
> -	int result;
> +	int result = 1, j;
>
>  	if (!revs->repo->objects->commit_graph)
>  		return -1;
> @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  		return -1;
>  	}
>
> -	result = bloom_filter_contains(filter,
> -				       revs->bloom_key,
> -				       revs->bloom_filter_settings);
> +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> +		result = bloom_filter_contains(filter,
> +					       &revs->bloom_keys[j],
> +					       revs->bloom_filter_settings);
> +	}

This checks shorter sub-strings first, contradicting the "deepest first"
strategy mentioned in the commit message.

This can easily be fixed by inverting the traversal of one of the loops,
of course.  Or perhaps do something like this?

	revs->bloom_keys = NULL;
	revs->bloom_keys_nr = 0;
	strbuf_add(&path, pi->match, pi->len);
	strbuf_trim_trailing_dir_sep(&path);
	for (;;) {
		const char *sep;
		ALLOC_GROW(revs->bloom_keys, revs->bloom_keys_nr + 1, alloc);
		fill_bloom_key(path.buf, path.len,
			       &revs->bloom_keys[revs->bloom_keys_nr++],
			       revs->bloom_filter_settings);
		sep = find_last_dir_sep(path.buf);
		if (!sep)
			break;
		strbuf_setlen(&path, sep - path.buf);
	}
	strbuf_release(&path);

The find_last_dir_sep() calls scan the first part of the string over and
over, which is a bit silly.  A strbuf_trim_trailing_path_component()
could start at the end and scan backwards if that turns out to be an
actual problem.

ALLOC_GROW wastes memory on revs->bloom_keys, and reallocating instead
of allocating the right size from the start has a cost as well, but I'd
expect this to be dwarfed by the actual revision walk.

>
>  	if (result)
>  		count_bloom_filter_maybe++;
> @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>
> -	if (revs->bloom_key && !nth_parent) {
> +	if (revs->bloom_keys_nr && !nth_parent) {
>  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
>  		if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41fc..abbfb4ab59a 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
>  	struct topo_walk_info *topo_walk_info;
>
>  	/* Commit graph bloom filter fields */
> -	/* The bloom filter key for the pathspec */
> -	struct bloom_key *bloom_key;
> +	/* The bloom filter key(s) for the pathspec */
> +	struct bloom_key *bloom_keys;
> +	int bloom_keys_nr;
> +
>  	/*
>  	 * The bloom filter settings used to generate the key.
>  	 * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c7011f33e2c..c13b97d3bda 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
>  test_bloom_filters_used_when_some_filters_are_missing () {
>  	log_args=$1
> -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
>  	setup "$log_args" &&
>  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
>  	test_cmp log_wo_bloom log_w_bloom
>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 18.06.20 um 22:31 schrieb René Scharfe:
> Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>>  	char *path_alloc = NULL;
>> -	const char *path;
>> +	const char *path, *p;
>>  	int last_index;
>> -	int len;
>> +	size_t len;
>> +	int path_component_nr = 0, j;
>>
>>  	if (!revs->commits)
>>  		return;
>> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>
>>  	len = strlen(path);
>>
>> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> +	p = path;
>> +	do {
>> +		p = strchrnul(p + 1, '/');
>> +		path_component_nr++;
>> +	} while (p - path < len);

> And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of
> hard-coding a slash?

Not necessarily.  Paths should be normalized to use one specific
separator, probably slash, both when building and querying the Bloom
filter.  Otherwise a filter that knows e.g. "foo/bar" could confidently
claim that "foo\bar" does not match.  If this is done in a previous
step then using a literal slash here would be correct.

René

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2020

This patch series was integrated into pu via git@6c75447.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 19, 2020

This patch series was integrated into pu via git@518b032.

@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

Hi Stolee,

On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well.  Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks.  The table below compares the average
> false positive rate and runtime of
>
>   git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
>                     Average false           Average        Average
>                     positive rate           runtime        runtime
>                   before     after     before     after   difference
>   ------------------------------------------------------------------
>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>
> *Path selection was done with the following pipeline:
>
> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here.  However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches).  If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>         commit-graph write --reachable
>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>         rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  revision.c           | 35 ++++++++++++++++++++++++++---------
>  revision.h           |  6 ++++--
>  t/t4216-log-bloom.sh |  2 +-
>  3 files changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index c644c660917..027ae3982b4 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
>  	char *path_alloc = NULL;
> -	const char *path;
> +	const char *path, *p;
>  	int last_index;
> -	int len;
> +	size_t len;
> +	int path_component_nr = 0, j;
>
>  	if (!revs->commits)
>  		return;
> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
>  	len = strlen(path);
>
> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +	p = path;
> +	do {
> +		p = strchrnul(p + 1, '/');
> +		path_component_nr++;
> +	} while (p - path < len);
> +
> +	revs->bloom_keys_nr = path_component_nr;
> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> +	p = path;
> +	for (j = 0; j < revs->bloom_keys_nr; j++) {
> +		p = strchrnul(p + 1, '/');
> +
> +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +			       revs->bloom_filter_settings);
> +	}
>

Somewhat related to our off-list discussion yesterday, there is a bug in
both 2.27 and this patch which produces incorrect results when (1)
Bloom filters are enabled, and (2) we are doing a revision walk from
root with the pathspec '.'.

What appears to be going on is that our normalization takes '.' -> '',
and then we form a Bloom key based on the empty string, which will
return 'definitely not' when querying the Bloom filter some of the time,
which should never happen. This is a consequence of never inserting the
empty key into the Bloom filter upon generation.

As a result, I have patched this in GitHub's fork (which is currently
based on 2.27 and doesn't have these patches yet) by doing an early
return when 'strlen(path) == 0'. Since it looks like these patches are
going to land, here is some clean-up and a fix for the bug that you
should feel free to test with and apply on top:

--- >8 ---

diff --git a/revision.c b/revision.c
index 8bd383b1dd..123e72698d 100644
--- a/revision.c
+++ b/revision.c
@@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
        struct pathspec_item *pi;
        char *path_alloc = NULL;
-       const char *path, *p;
+       char *path, *p;
        int last_index;
        size_t len;
-       int path_component_nr = 0, j;
+       int path_component_nr = 1, j;

        if (!revs->commits)
                return;
@@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)

        /* remove single trailing slash from path, if needed */
        if (pi->match[last_index] == '/') {
-           path_alloc = xstrdup(pi->match);
-           path_alloc[last_index] = '\0';
-           path = path_alloc;
-       } else
-           path = pi->match;
+               path_alloc = xstrdup(pi->match);
+               path_alloc[last_index] = '\0';
+               path = path_alloc;
+       } else {
+               path = pi->match;
+               len = pi->len;
+       }

-       len = strlen(path);
+       if (!len)
+               return;

-       p = path;
        do {
-               p = strchrnul(p + 1, '/');
-               path_component_nr++;
-       } while (p - path < len);
+               if (is_dir_sep(*p)) {
+                       *p = '\0';
+                       path_component_nr++;
+               }
+       } while (*p++);

        revs->bloom_keys_nr = path_component_nr;
        ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);

        p = path;
        for (j = 0; j < revs->bloom_keys_nr; j++) {
-               p = strchrnul(p + 1, '/');
-
-               fill_bloom_key(path, p - path, &revs->bloom_keys[j],
+               size_t plen = strlen(p);
+               fill_bloom_key(p, plen, &revs->bloom_keys[j],
                               revs->bloom_filter_settings);
+               p += plen;
        }

        if (trace2_is_enabled() && !bloom_filter_atexit_registered) {

>  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>  		atexit(trace2_bloom_filter_statistics_atexit);
> @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  						 struct commit *commit)
>  {
>  	struct bloom_filter *filter;
> -	int result;
> +	int result = 1, j;
>
>  	if (!revs->repo->objects->commit_graph)
>  		return -1;
> @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  		return -1;
>  	}
>
> -	result = bloom_filter_contains(filter,
> -				       revs->bloom_key,
> -				       revs->bloom_filter_settings);
> +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> +		result = bloom_filter_contains(filter,
> +					       &revs->bloom_keys[j],
> +					       revs->bloom_filter_settings);
> +	}
>
>  	if (result)
>  		count_bloom_filter_maybe++;
> @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>
> -	if (revs->bloom_key && !nth_parent) {
> +	if (revs->bloom_keys_nr && !nth_parent) {
>  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
>  		if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41fc..abbfb4ab59a 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
>  	struct topo_walk_info *topo_walk_info;
>
>  	/* Commit graph bloom filter fields */
> -	/* The bloom filter key for the pathspec */
> -	struct bloom_key *bloom_key;
> +	/* The bloom filter key(s) for the pathspec */
> +	struct bloom_key *bloom_keys;
> +	int bloom_keys_nr;
> +
>  	/*
>  	 * The bloom filter settings used to generate the key.
>  	 * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c7011f33e2c..c13b97d3bda 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
>  test_bloom_filters_used_when_some_filters_are_missing () {
>  	log_args=$1
> -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
>  	setup "$log_args" &&
>  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
>  	test_cmp log_wo_bloom log_w_bloom
> --
> gitgitgadget
>
Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Jun 19, 2020 at 11:17:17AM -0600, Taylor Blau wrote:
> Hi Stolee,
>
> On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
> > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
> >
> > The file 'dir/subdir/file' can only be modified if its leading
> > directories 'dir' and 'dir/subdir' are modified as well.
> >
> > So when checking modified path Bloom filters looking for commits
> > modifying a path with multiple path components, then check not only
> > the full path in the Bloom filters, but all its leading directories as
> > well.  Take care to check these paths in "deepest first" order,
> > because it's the full path that is least likely to be modified, and
> > the Bloom filter queries can short circuit sooner.
> >
> > This can significantly reduce the average false positive rate, by
> > about an order of magnitude or three(!), and can further speed up
> > pathspec-limited revision walks.  The table below compares the average
> > false positive rate and runtime of
> >
> >   git rev-list HEAD -- "$path"
> >
> > before and after this change for 5000+ randomly* selected paths from
> > each repository:
> >
> >                     Average false           Average        Average
> >                     positive rate           runtime        runtime
> >                   before     after     before     after   difference
> >   ------------------------------------------------------------------
> >   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
> >   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
> >   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
> >
> > *Path selection was done with the following pipeline:
> >
> > 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
> >
> > The improvements in runtime are much smaller than the improvements in
> > average false positive rate, as we are clearly reaching diminishing
> > returns here.  However, all these timings depend on that accessing
> > tree objects is reasonably fast (warm caches).  If we had a partial
> > clone and the tree objects had to be fetched from a promisor remote,
> > e.g.:
> >
> >   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
> >   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
> >         commit-graph write --reachable
> >   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
> >   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
> >         rev-list HEAD -- "$path"
> >
> > then checking all leading path component can reduce the runtime from
> > over an hour to a few seconds (and this is with the clone and the
> > promisor on the same machine).
> >
> > This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> > concrete way to notice the improvement.
> >
> > Signed-off-by: SZEDER Gábor <[email protected]>
> > Signed-off-by: Derrick Stolee <[email protected]>
> > ---
> >  revision.c           | 35 ++++++++++++++++++++++++++---------
> >  revision.h           |  6 ++++--
> >  t/t4216-log-bloom.sh |  2 +-
> >  3 files changed, 31 insertions(+), 12 deletions(-)
> >
> > diff --git a/revision.c b/revision.c
> > index c644c660917..027ae3982b4 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> >  {
> >  	struct pathspec_item *pi;
> >  	char *path_alloc = NULL;
> > -	const char *path;
> > +	const char *path, *p;
> >  	int last_index;
> > -	int len;
> > +	size_t len;
> > +	int path_component_nr = 0, j;
> >
> >  	if (!revs->commits)
> >  		return;
> > @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> >
> >  	len = strlen(path);
> >
> > -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> > -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> > +	p = path;
> > +	do {
> > +		p = strchrnul(p + 1, '/');
> > +		path_component_nr++;
> > +	} while (p - path < len);
> > +
> > +	revs->bloom_keys_nr = path_component_nr;
> > +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> > +
> > +	p = path;
> > +	for (j = 0; j < revs->bloom_keys_nr; j++) {
> > +		p = strchrnul(p + 1, '/');
> > +
> > +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> > +			       revs->bloom_filter_settings);
> > +	}
> >
>
> Somewhat related to our off-list discussion yesterday, there is a bug in
> both 2.27 and this patch which produces incorrect results when (1)
> Bloom filters are enabled, and (2) we are doing a revision walk from
> root with the pathspec '.'.
>
> What appears to be going on is that our normalization takes '.' -> '',
> and then we form a Bloom key based on the empty string, which will
> return 'definitely not' when querying the Bloom filter some of the time,
> which should never happen. This is a consequence of never inserting the
> empty key into the Bloom filter upon generation.
>
> As a result, I have patched this in GitHub's fork (which is currently
> based on 2.27 and doesn't have these patches yet) by doing an early
> return when 'strlen(path) == 0'. Since it looks like these patches are
> going to land, here is some clean-up and a fix for the bug that you
> should feel free to test with and apply on top:
>
> --- >8 ---
>
> diff --git a/revision.c b/revision.c
> index 8bd383b1dd..123e72698d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>         struct pathspec_item *pi;
>         char *path_alloc = NULL;
> -       const char *path, *p;
> +       char *path, *p;
>         int last_index;
>         size_t len;
> -       int path_component_nr = 0, j;
> +       int path_component_nr = 1, j;
>
>         if (!revs->commits)
>                 return;
> @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
>         /* remove single trailing slash from path, if needed */
>         if (pi->match[last_index] == '/') {
> -           path_alloc = xstrdup(pi->match);
> -           path_alloc[last_index] = '\0';
> -           path = path_alloc;
> -       } else
> -           path = pi->match;
> +               path_alloc = xstrdup(pi->match);
> +               path_alloc[last_index] = '\0';
> +               path = path_alloc;
> +       } else {
> +               path = pi->match;
> +               len = pi->len;
> +       }
>
> -       len = strlen(path);
> +       if (!len)
> +               return;

I should note that _this_ is the critical fix, and it should fix the bug
if you only applied just this hunk.

Everything else is purely style clean-ups on top (ranging from the four
spaces used instead of a tab, to some string processing niceties that I
_think_ should address Rene's concern, although I'm not sure if an
actual bug is lurking there or not...)

> -       p = path;
>         do {
> -               p = strchrnul(p + 1, '/');
> -               path_component_nr++;
> -       } while (p - path < len);
> +               if (is_dir_sep(*p)) {
> +                       *p = '\0';
> +                       path_component_nr++;
> +               }
> +       } while (*p++);
>
>         revs->bloom_keys_nr = path_component_nr;
>         ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>
>         p = path;
>         for (j = 0; j < revs->bloom_keys_nr; j++) {
> -               p = strchrnul(p + 1, '/');
> -
> -               fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +               size_t plen = strlen(p);
> +               fill_bloom_key(p, plen, &revs->bloom_keys[j],
>                                revs->bloom_filter_settings);
> +               p += plen;
>         }
>
>         if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>
> >  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> >  		atexit(trace2_bloom_filter_statistics_atexit);
> > @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> >  						 struct commit *commit)
> >  {
> >  	struct bloom_filter *filter;
> > -	int result;
> > +	int result = 1, j;
> >
> >  	if (!revs->repo->objects->commit_graph)
> >  		return -1;
> > @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> >  		return -1;
> >  	}
> >
> > -	result = bloom_filter_contains(filter,
> > -				       revs->bloom_key,
> > -				       revs->bloom_filter_settings);
> > +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> > +		result = bloom_filter_contains(filter,
> > +					       &revs->bloom_keys[j],
> > +					       revs->bloom_filter_settings);
> > +	}
> >
> >  	if (result)
> >  		count_bloom_filter_maybe++;
> > @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
> >  			return REV_TREE_SAME;
> >  	}
> >
> > -	if (revs->bloom_key && !nth_parent) {
> > +	if (revs->bloom_keys_nr && !nth_parent) {
> >  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
> >
> >  		if (bloom_ret == 0)
> > diff --git a/revision.h b/revision.h
> > index 7c026fe41fc..abbfb4ab59a 100644
> > --- a/revision.h
> > +++ b/revision.h
> > @@ -295,8 +295,10 @@ struct rev_info {
> >  	struct topo_walk_info *topo_walk_info;
> >
> >  	/* Commit graph bloom filter fields */
> > -	/* The bloom filter key for the pathspec */
> > -	struct bloom_key *bloom_key;
> > +	/* The bloom filter key(s) for the pathspec */
> > +	struct bloom_key *bloom_keys;
> > +	int bloom_keys_nr;
> > +
> >  	/*
> >  	 * The bloom filter settings used to generate the key.
> >  	 * This is loaded from the commit-graph being used.
> > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> > index c7011f33e2c..c13b97d3bda 100755
> > --- a/t/t4216-log-bloom.sh
> > +++ b/t/t4216-log-bloom.sh
> > @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
> >
> >  test_bloom_filters_used_when_some_filters_are_missing () {
> >  	log_args=$1
> > -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> > +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
> >  	setup "$log_args" &&
> >  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
> >  	test_cmp log_wo_bloom log_w_bloom
> > --
> > gitgitgadget
> >
> Thanks,
> Taylor
Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/19/2020 1:17 PM, Taylor Blau wrote:
> Hi Stolee,
> 
> On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>>
>> The file 'dir/subdir/file' can only be modified if its leading
>> directories 'dir' and 'dir/subdir' are modified as well.
>>
>> So when checking modified path Bloom filters looking for commits
>> modifying a path with multiple path components, then check not only
>> the full path in the Bloom filters, but all its leading directories as
>> well.  Take care to check these paths in "deepest first" order,
>> because it's the full path that is least likely to be modified, and
>> the Bloom filter queries can short circuit sooner.
>>
>> This can significantly reduce the average false positive rate, by
>> about an order of magnitude or three(!), and can further speed up
>> pathspec-limited revision walks.  The table below compares the average
>> false positive rate and runtime of
>>
>>   git rev-list HEAD -- "$path"
>>
>> before and after this change for 5000+ randomly* selected paths from
>> each repository:
>>
>>                     Average false           Average        Average
>>                     positive rate           runtime        runtime
>>                   before     after     before     after   difference
>>   ------------------------------------------------------------------
>>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>>
>> *Path selection was done with the following pipeline:
>>
>> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>>
>> The improvements in runtime are much smaller than the improvements in
>> average false positive rate, as we are clearly reaching diminishing
>> returns here.  However, all these timings depend on that accessing
>> tree objects is reasonably fast (warm caches).  If we had a partial
>> clone and the tree objects had to be fetched from a promisor remote,
>> e.g.:
>>
>>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>>         commit-graph write --reachable
>>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>>         rev-list HEAD -- "$path"
>>
>> then checking all leading path component can reduce the runtime from
>> over an hour to a few seconds (and this is with the clone and the
>> promisor on the same machine).
>>
>> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
>> concrete way to notice the improvement.
>>
>> Signed-off-by: SZEDER Gábor <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>>  revision.c           | 35 ++++++++++++++++++++++++++---------
>>  revision.h           |  6 ++++--
>>  t/t4216-log-bloom.sh |  2 +-
>>  3 files changed, 31 insertions(+), 12 deletions(-)
>>
>> diff --git a/revision.c b/revision.c
>> index c644c660917..027ae3982b4 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>>  	char *path_alloc = NULL;
>> -	const char *path;
>> +	const char *path, *p;
>>  	int last_index;
>> -	int len;
>> +	size_t len;
>> +	int path_component_nr = 0, j;
>>
>>  	if (!revs->commits)
>>  		return;
>> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>
>>  	len = strlen(path);
>>
>> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> +	p = path;
>> +	do {
>> +		p = strchrnul(p + 1, '/');
>> +		path_component_nr++;
>> +	} while (p - path < len);
>> +
>> +	revs->bloom_keys_nr = path_component_nr;
>> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>> +
>> +	p = path;
>> +	for (j = 0; j < revs->bloom_keys_nr; j++) {
>> +		p = strchrnul(p + 1, '/');
>> +
>> +		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
>> +			       revs->bloom_filter_settings);
>> +	}
>>
> 
> Somewhat related to our off-list discussion yesterday, there is a bug in
> both 2.27 and this patch which produces incorrect results when (1)
> Bloom filters are enabled, and (2) we are doing a revision walk from
> root with the pathspec '.'.
> 
> What appears to be going on is that our normalization takes '.' -> '',
> and then we form a Bloom key based on the empty string, which will
> return 'definitely not' when querying the Bloom filter some of the time,
> which should never happen. This is a consequence of never inserting the
> empty key into the Bloom filter upon generation.
> 
> As a result, I have patched this in GitHub's fork (which is currently
> based on 2.27 and doesn't have these patches yet) by doing an early
> return when 'strlen(path) == 0'. Since it looks like these patches are
> going to land, here is some clean-up and a fix for the bug that you
> should feel free to test with and apply on top:
> 
> --- >8 ---
> 
> diff --git a/revision.c b/revision.c
> index 8bd383b1dd..123e72698d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>         struct pathspec_item *pi;
>         char *path_alloc = NULL;
> -       const char *path, *p;
> +       char *path, *p;
>         int last_index;
>         size_t len;
> -       int path_component_nr = 0, j;
> +       int path_component_nr = 1, j;
> 
>         if (!revs->commits)
>                 return;
> @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> 
>         /* remove single trailing slash from path, if needed */
>         if (pi->match[last_index] == '/') {
> -           path_alloc = xstrdup(pi->match);
> -           path_alloc[last_index] = '\0';
> -           path = path_alloc;
> -       } else
> -           path = pi->match;
> +               path_alloc = xstrdup(pi->match);
> +               path_alloc[last_index] = '\0';
> +               path = path_alloc;
> +       } else {
> +               path = pi->match;
> +               len = pi->len;
> +       }
> 
> -       len = strlen(path);
> +       if (!len)
> +               return;
> 
> -       p = path;
>         do {
> -               p = strchrnul(p + 1, '/');
> -               path_component_nr++;
> -       } while (p - path < len);
> +               if (is_dir_sep(*p)) {
> +                       *p = '\0';
> +                       path_component_nr++;
> +               }
> +       } while (*p++);
> 
>         revs->bloom_keys_nr = path_component_nr;
>         ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> 
>         p = path;
>         for (j = 0; j < revs->bloom_keys_nr; j++) {
> -               p = strchrnul(p + 1, '/');
> -
> -               fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> +               size_t plen = strlen(p);
> +               fill_bloom_key(p, plen, &revs->bloom_keys[j],
>                                revs->bloom_filter_settings);
> +               p += plen;

I don't think this is correct at all. Looking at it, it seems
that it would take a path "A/B/C" and add keys for "A", "B", and
"C" instead of "A", "A/B", and "A/B/C".

Looking more closely, there are a few issues that makes it clear
why you didn't see a failing test:

1. You use "while (*p++)" instead of "while (*++p)" so the scan
   terminates after the first directory split. (So only "A" is
   added, which won't fail, but will be slower than intended.)

Changing that, we see the next problem:

2. You use "p += plen" instead of "p += plen + 1". This causes
   the filters to add "A", "", and "" (because we don't skip the
   terminating character). This _is_ incorrect and would result
   in test failures.

Changing that, we then see "A", "B", and "C" are added as keys.

I'm going to take the style issues that you presented, and
change them in one commit (reported-by you) and then the
if (!len) fix in a separate patch.

I'll update the scan loops to use is_dir_sep() accordingly
in this patch.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 19, 2020

This patch series was integrated into pu via git@97ed3c6.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 23, 2020

This branch is now known as ds/commit-graph-bloom-updates.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 23, 2020

This patch series was integrated into pu via git@fc5ffbe.

Place an instance of struct bloom_settings into the struct
write_commit_graph_context. This allows simplifying the function
prototype of write_graph_chunk_bloom_data(). This will allow us
to combine the function prototypes and use function pointers to
simplify write_commit_graph_file().

By using a pointer, we can later replace the settings to match those
that exist in the current commit-graph, in case a future Git version
allows customization of these parameters.

Reported-by: SZEDER Gábor <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
43d3561 (commit-graph write: don't die if the existing graph is corrupt,
2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment
variable. This was created to verify that commit-graph was not loaded
when writing a new non-incremental commit-graph.

An upcoming change wants to load a commit-graph in some valuable cases,
but we want to maintain that we don't trust the commit-graph data when
writing our new file. Instead of dying on load, instead die if we ever
try to parse a commit from the commit-graph. This functionally verifies
the same intended behavior, but allows a more advanced feature in the
next change.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee derrickstolee force-pushed the bloom-2 branch 2 times, most recently from 485e6bf to 1022c0a Compare June 23, 2020 17:45
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 23, 2020

Submitted as [email protected]

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2020

This patch series was integrated into seen via git@3af73ec.

@gitgitgadget gitgitgadget bot added the seen label Jun 24, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2020

This patch series was integrated into seen via git@0812eda.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2020

This patch series was integrated into seen via git@1b5d3d8.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 13, 2020

This patch series was integrated into seen via git@bd95731.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2020

This patch series was integrated into seen via git@277bc3c.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 15, 2020

This patch series was integrated into seen via git@ee1d34d.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 16, 2020

This patch series was integrated into seen via git@2188883.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 16, 2020

This patch series was integrated into seen via git@bf511ce.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 17, 2020

This patch series was integrated into seen via git@57cb778.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 17, 2020

This patch series was integrated into seen via git@179c51a.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 18, 2020

This patch series was integrated into seen via git@4ff1624.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 19, 2020

This patch series was integrated into seen via git@8157d78.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 20, 2020

This patch series was integrated into seen via git@522b5b8.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 21, 2020

This patch series was integrated into seen via git@9bd7fbc.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 21, 2020

This patch series was integrated into seen via git@bd9b7d8.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 22, 2020

This patch series was integrated into seen via git@ceb72d1.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 22, 2020

This patch series was integrated into seen via git@a652433.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 24, 2020

This patch series was integrated into seen via git@bb7e490.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 24, 2020

This patch series was integrated into seen via git@097b424.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 26, 2020

This patch series was integrated into seen via git@d4c116d.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 27, 2020

This patch series was integrated into seen via git@2cf3d20.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 30, 2020

This patch series was integrated into seen via git@70cdbbe.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 30, 2020

This patch series was integrated into next via git@70cdbbe.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 30, 2020

This patch series was integrated into master via git@70cdbbe.

@gitgitgadget gitgitgadget bot added the master label Jul 30, 2020
@gitgitgadget gitgitgadget bot closed this Jul 30, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Jul 30, 2020

Closed via 70cdbbe.

@@ -60,7 +60,10 @@ existing commit-graph file.
With the `--changed-paths` option, compute and write information about the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Wed, Jul 01, 2020 at 01:27:24PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
> 
> The changed-path Bloom filters were released in v2.27.0, but have a
> significant drawback. A user can opt-in to writing the changed-path
> filters using the "--changed-paths" option to "git commit-graph write"
> but the next write will drop the filters unless that option is
> specified.
> 
> This becomes even more important when considering the interaction with
> gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
> features.experimental). These config options trigger commit-graph writes
> that the user did not signal, and hence there is no --changed-paths
> option available.
> 
> Allow a user that opts-in to the changed-path filters to persist the
> property of "my commit-graph has changed-path filters" automatically. A
> user can drop filters using the --no-changed-paths option.

The above parts of the commit message and the corresponding changes
are OK, but ...

> In the process, we need to be extremely careful to match the Bloom
> filter settings as specified by the commit-graph. This will allow future
> versions of Git to customize these settings, and the version with this
> change will persist those settings as commit-graphs are rewritten on
> top.

As pointed out in my original bug report [1], modified path Bloom
filters are computed with hardcoded settings in
bloom.c:get_bloom_filter().  Since this patch does not touch bloom.c
at all, it still computes Bloom filters with those hardcoded settings,
and, consequently, despite the commit message's claims, it does not
persist the settings in the existing commit-graph.

[1] https://public-inbox.org/git/[email protected]/

> Use the trace2 API to signal the settings used during the write, and
> check that output in a test after manually adjusting the correct bytes
> in the commit-graph file.

This test is insufficient, as it only checks what settings trace2
believes the Bloom filters are computed with, not what settings they
are actually computed with; that's why it succeeded while the bug
whose absence it was supposed to ensure was still there.

More robust tests should instead look at what actually gets written to
the commit-graph, and how that is interpreted during pathspec-limited
revision walks.

> Signed-off-by: Derrick Stolee <[email protected]>

A "Reported-by: me" trailer would have been appropriate here.

> ---
>  Documentation/git-commit-graph.txt |  5 +++-
>  builtin/commit-graph.c             |  5 +++-
>  commit-graph.c                     | 45 ++++++++++++++++++++++++++++--
>  commit-graph.h                     |  1 +
>  t/t4216-log-bloom.sh               | 17 ++++++++++-
>  5 files changed, 67 insertions(+), 6 deletions(-)

Anyway, this is now partially fixed in 9a7a9ed10d (bloom: use provided
'struct bloom_filter_settings', 2020-09-16), though, unfortunately,
its commit message is not quite clear on this.  Alas, that's only a
partial fix, because we still only look at the top level commit-graph
file for existing Bloom filter settings.  However, deeper commit-graph
layers can contain Bloom filters with non-default settings even when
the top level doesn't, and these failing tests below demonstrate:

  ---  >8  ---

#!/bin/sh

test_description='test'

. ./test-lib.sh

test_expect_success 'setup' '
	git commit --allow-empty -m "Bloom filters are written but ignored for root commits :(" &&
	for i in 1 2 3
	do
		echo $i >file &&
		git add file &&
		git commit -m "$i" || return 1
	done &&
	git log --oneline --no-decorate -- file >expect
'

test_expect_success 'split' '
	# Compute Bloom filters with "unusual" settings.
	git rev-parse HEAD^^ | GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git commit-graph write --stdin-commits --changed-paths --split &&
	# A commit-graph layer without Bloom filters "hides" the layers
	# below ...
	git rev-parse HEAD^ | git commit-graph write --stdin-commits --no-changed-paths --split=no-merge &&
	# ... so this does not look at existing Bloom filters and their
	# settings in the bottom commit-graph layer and computes new
	# Bloom filters using the default 7 hashes.
	git rev-parse HEAD | git commit-graph write --stdin-commits --changed-paths --split=no-merge &&

	# Just to make sure that there are as many graph layers as I
	# think there should be.
	test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain &&

	# This checks Bloom filters using settings in the top layer,
	# thus misses commits modifying file in the bottom commit-graph
	# layer.
	git log --oneline --no-decorate -- file >actual &&
	test_cmp expect actual
'

test_expect_success 'merged' '
	# This merges all existing layers, and computes missing Bloom
	# filters with the settings in the top layer, without noticing
	# that filters in the bottom layer were computed with different
	# settings.
	git commit-graph write --reachable --changed-paths &&

	# Just to make sure...
	test_path_is_file .git/objects/info/commit-graph &&

	# This misses commits modifying file that were merged from the
	# bottom commit-graph layer.
	git log --oneline --no-decorate -- file >actual &&
	test_cmp expect actual
'

test_done

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Oct 15, 2020 at 03:21:47PM +0200, SZEDER Gábor wrote:
> As pointed out in my original bug report [1], modified path Bloom
> filters are computed with hardcoded settings in
> bloom.c:get_bloom_filter().  Since this patch does not touch bloom.c
> at all, it still computes Bloom filters with those hardcoded settings,
> and, consequently, despite the commit message's claims, it does not
> persist the settings in the existing commit-graph.
>
> [1] https://public-inbox.org/git/[email protected]/

Right. It is worth mentioning here (as you do below) that this was fixed
as of 9a7a9ed10d (bloom: use provided 'struct bloom_filter_settings',
2020-09-16).

> > Use the trace2 API to signal the settings used during the write, and
> > check that output in a test after manually adjusting the correct bytes
> > in the commit-graph file.
>
> This test is insufficient, as it only checks what settings trace2
> believes the Bloom filters are computed with, not what settings they
> are actually computed with; that's why it succeeded while the bug
> whose absence it was supposed to ensure was still there.
>
> More robust tests should instead look at what actually gets written to
> the commit-graph, and how that is interpreted during pathspec-limited
> revision walks.

Thanks for the test! That saved me a little bit of work trying to set up
the scenario you're describing in a reproducible way. I think that the
third test can be fixed relatively easily. The crux of the issue there
is that we are computing Bloom filters for commits, some of which
already had Bloom filters computed in an earlier layer, but with
different settings than the one that we're using to write the current
layer.

So, we need to be more aggressively checking the Bloom filter settings
in any layer we want to reuse a Bloom filter out of before reusing it
verbatim in the current layer. The patch below the scissors line is
sufficient to do that, and it causes the third test to start passing.

...But, teaching the revision machinery how to handle a walk through
commits in multiple commit-graph layers with incompatible Bloom filter
settings is trickier. Right now we compute all of the Bloom keys up
front using the Bloom filter settings in the top layer. I think we'd
probably want to change this to lazily compute those keys by:

  - Keeping around the contents of the Bloom keys, i.e., the pathspecs
    themselves.

  - Keeping a hashmap from Bloom filter settings to computed Bloom keys
    corresponding to those settings.

  - Lazily filling in that hashmap as we traverse through different
    commits.

At least, that's the best way that I can think to do it. It does get
kind of slow, though; we'd have to scan every commit graph layer at each
commit in the worst case to find the actual 'struct commit_graph *' in
order to get its Bloom filter settings. So, I think that's sort of
show-stoppingly slow, and that we should probably think more deeply
about it before taking up that direction.

Maybe Stolee has some better thoughts about how we can quickly go from a
commit to either (a) the commit-graph struct that that commit is stored
in, or (b) the Bloom filter settings belonging to that struct.

Thanks,
Taylor

--- >8 ---

Subject: [PATCH] bloom: recompute filters with incompatible settings
---
 bloom.c        | 21 +++++++++++++++++++--
 commit-graph.c |  4 ++--
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/bloom.c b/bloom.c
index 68c73200a5..30da128474 100644
--- a/bloom.c
+++ b/bloom.c
@@ -30,7 +30,8 @@ static inline unsigned char get_bitmask(uint32_t pos)

 static int load_bloom_filter_from_graph(struct commit_graph *g,
 					struct bloom_filter *filter,
-					struct commit *c)
+					struct commit *c,
+					const struct bloom_filter_settings *settings)
 {
 	uint32_t lex_pos, start_index, end_index;
 	uint32_t graph_pos = commit_graph_position(c);
@@ -42,6 +43,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	if (!g->chunk_bloom_indexes)
 		return 0;

+	if (settings) {
+		struct bloom_filter_settings *graph_settings = g->bloom_filter_settings;
+		/*
+		 * Check that the settings used to generate new Bloom filters is
+		 * compatible with the settings Bloom filters in this
+		 * commit-graph layer were generated with.
+		 */
+		if (settings->hash_version != graph_settings->hash_version)
+			return 0;
+		if (settings->num_hashes != graph_settings->num_hashes)
+			return 0;
+		if (settings->bits_per_entry != graph_settings->bits_per_entry)
+			return 0;
+	}
+
 	lex_pos = graph_pos - g->num_commits_in_base;

 	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
@@ -205,7 +221,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph,
+						     filter, c, settings);
 	}

 	if (filter->data && filter->len)
diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..afe14af2a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1188,7 +1188,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;

 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1228,7 +1228,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);

 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
 		size_t len = filter ? filter->len : 0;

 		display_progress(ctx->progress, ++ctx->progress_cnt);
--
2.29.0.rc1.dirty

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/15/2020 5:41 PM, Taylor Blau wrote:
> So, we need to be more aggressively checking the Bloom filter settings
> in any layer we want to reuse a Bloom filter out of before reusing it
> verbatim in the current layer. The patch below the scissors line is
> sufficient to do that, and it causes the third test to start passing.

I think there are three things we should keep in mind:

1. Incompatible Bloom filter settings between layers should be seen
   as _inconsistent data_ as Git should not be writing incremental
   commit-graph files with inconsistent Bloom filter settings. Thus,
   when reading the commit-graph chain we should prevent incompatible
   filters from being used. One way to do this is to notice different
   settings and completely disable Bloom filters. The other way would
   be to take the settings from the first layer with filters and then
   clear the chunk_bloom_indexes and chunk_bloom_data fields for the
   layers that don't agree. This fits with an expectation that lower
   layers are larger, so more filters can be used in that situation.

2. We should be sure that Git will not agree to write incompatible
   settings between layers of a commit-graph chain. Even more, it
   should definitely not re-use filters when merging layers with
   incompatible filter values. The strategy above to ignore Bloom
   filters in incompatible upper layers is enough of a guard against
   the "merge layers" situation.

3. Allowing users (or future Git developers) to adjust the default
   Bloom filter settings is one that is good to do for future-proofing,
   but not one that I expect to be widely used (any gains here are
   minuscule compared to the results already achieved with the defaults).
   On top of that, including incompatible settings across layers is even
   less likely to be a real use case. We should not be straining to make
   the code even worse for this imaginary scenario.

With that said...
 
> ...But, teaching the revision machinery how to handle a walk through
> commits in multiple commit-graph layers with incompatible Bloom filter
> settings is trickier. Right now we compute all of the Bloom keys up
> front using the Bloom filter settings in the top layer. I think we'd
> probably want to change this to lazily compute those keys by:
It would probably be easiest to compute an array of bloom_key structs
(per path) where the index corresponds to the depth of the commit-graph
layer. You can then access the correct key after you have identified
which layer the commit is from.

> At least, that's the best way that I can think to do it. It does get
> kind of slow, though; we'd have to scan every commit graph layer at each
> commit in the worst case to find the actual 'struct commit_graph *' in
> order to get its Bloom filter settings. So, I think that's sort of
> show-stoppingly slow, and that we should probably think more deeply
> about it before taking up that direction.
> 
> Maybe Stolee has some better thoughts about how we can quickly go from a
> commit to either (a) the commit-graph struct that that commit is stored
> in, or (b) the Bloom filter settings belonging to that struct.

We already have code that finds a commit in a commit-graph layer
based on its integer position by iterating until num_commits_in_base
is below our position. The lexicographic position within that layer
is found by subtracting num_commits_in_base.

For us, we would simply need:

int get_commit_graph_layer(struct commit_graph *g, uint32_t pos)
{
	uint32_t layer_index = 0;

	while (g && pos < g->num_commits_in_base) {
		g = g->base_graph;
		layer_index++;
	}

	return layer_index;
}

You could then use the response from get_commit_graph_layer()
to load the correct Bloom key.

Again, I strongly suggest _not_ working hard to support this
case. We should only put in proper safeguards to prevent data
like this being written and protect a Git process that might
stumble across data in this shape.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote:
> On 10/15/2020 5:41 PM, Taylor Blau wrote:
> > So, we need to be more aggressively checking the Bloom filter settings
> > in any layer we want to reuse a Bloom filter out of before reusing it
> > verbatim in the current layer. The patch below the scissors line is
> > sufficient to do that, and it causes the third test to start passing.
>
> I think there are three things we should keep in mind:
>
> 1. Incompatible Bloom filter settings between layers should be seen
>    as _inconsistent data_ as Git should not be writing incremental
>    commit-graph files with inconsistent Bloom filter settings. Thus,
>    when reading the commit-graph chain we should prevent incompatible
>    filters from being used. One way to do this is to notice different
>    settings and completely disable Bloom filters. The other way would
>    be to take the settings from the first layer with filters and then
>    clear the chunk_bloom_indexes and chunk_bloom_data fields for the
>    layers that don't agree. This fits with an expectation that lower
>    layers are larger, so more filters can be used in that situation.

Sure; I'd be fine with only allowing filters computed with the settings
present in the lowest or largest layer in the event that multiple layers
exist with incompatible settings.

I'm trying to point us towards a direction of not optimizing too far
along a direction that we're unlikely to take, while also trying to do
something relatively non-invasive to make it possible for a version of
Git to change the default Bloom settings. That is, if a user is writing
split commit-graphs, and we change the default Bloom settings, they
shouldn't have to recompute or merge down all of their Bloom filters.

If that's something that we never think is going to happen, I'm fine
with not thinking too hard about it. But, I also don't want to paint
ourselves into a corner, so I think something like the patch I wrote in
the email that you're replying to actually may be worth pursuing
further. I dunno. Definitely after 2.29, though.

> 2. We should be sure that Git will not agree to write incompatible
>    settings between layers of a commit-graph chain. Even more, it
>    should definitely not re-use filters when merging layers with
>    incompatible filter values. The strategy above to ignore Bloom
>    filters in incompatible upper layers is enough of a guard against
>    the "merge layers" situation.

Yeah, this would be fine with me.

> 3. Allowing users (or future Git developers) to adjust the default
>    Bloom filter settings is one that is good to do for future-proofing,
>    but not one that I expect to be widely used (any gains here are
>    minuscule compared to the results already achieved with the defaults).
>    On top of that, including incompatible settings across layers is even
>    less likely to be a real use case. We should not be straining to make
>    the code even worse for this imaginary scenario.

Right, I think we're pretty much in agreement here. Doing something easy
to make sure that we don't run into a wall seems to make sense, but I
think modifying the revision walk machinery to keep track of multiple
Bloom keys computed with different settings corresponding to the set of
Bloom filter settings in commit-graph layers is probably too far in that
direction.

For what it's worth, I was mainly talking about it to say that it would
be more effort than it's probably worth to do. There's also nothing that
we're currently discussing that would prevent us from taking that same
direction up in six months from now.

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/15/2020 11:18 PM, Taylor Blau wrote:
> On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote:
>> On 10/15/2020 5:41 PM, Taylor Blau wrote:
>>> So, we need to be more aggressively checking the Bloom filter settings
>>> in any layer we want to reuse a Bloom filter out of before reusing it
>>> verbatim in the current layer. The patch below the scissors line is
>>> sufficient to do that, and it causes the third test to start passing.
>>
>> I think there are three things we should keep in mind:
>>
>> 1. Incompatible Bloom filter settings between layers should be seen
>>    as _inconsistent data_ as Git should not be writing incremental
>>    commit-graph files with inconsistent Bloom filter settings. Thus,
>>    when reading the commit-graph chain we should prevent incompatible
>>    filters from being used. One way to do this is to notice different
>>    settings and completely disable Bloom filters. The other way would
>>    be to take the settings from the first layer with filters and then
>>    clear the chunk_bloom_indexes and chunk_bloom_data fields for the
>>    layers that don't agree. This fits with an expectation that lower
>>    layers are larger, so more filters can be used in that situation.
> 
> Sure; I'd be fine with only allowing filters computed with the settings
> present in the lowest or largest layer in the event that multiple layers
> exist with incompatible settings.
> 
> I'm trying to point us towards a direction of not optimizing too far
> along a direction that we're unlikely to take, while also trying to do
> something relatively non-invasive to make it possible for a version of
> Git to change the default Bloom settings. That is, if a user is writing
> split commit-graphs, and we change the default Bloom settings, they
> shouldn't have to recompute or merge down all of their Bloom filters.

They would need to recompute when they merge layers, which introduces
a big question about how we should handle such a case.

> If that's something that we never think is going to happen, I'm fine
> with not thinking too hard about it. But, I also don't want to paint
> ourselves into a corner, so I think something like the patch I wrote in
> the email that you're replying to actually may be worth pursuing
> further. I dunno. Definitely after 2.29, though.

I think the proposed "react properly to this unlikely situation"
is a good way to prevent getting locked into our choices now. It
makes it possible for "old" Git versions (2.30 until we decide to
allow this mix) to interact with the inconsistent settings without
failure.

We don't need to do the 100% "optimal" case of using all filters
in order to enable this choice in the future.
 
[...]

> For what it's worth, I was mainly talking about it to say that it would
> be more effort than it's probably worth to do. There's also nothing that
> we're currently discussing that would prevent us from taking that same
> direction up in six months from now.

Yes, I just want to make sure that everyone agrees there is a
middle ground without saying that inconsistent filter settings
across layers is a "fully supported" feature. If someone wants
to tackle the work to make it a desirable state, then they can
try that (with great care).

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2020

User SZEDER Gábor <[email protected]> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 15, 2020

User Taylor Blau <[email protected]> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Oct 16, 2020

User Derrick Stolee <[email protected]> has been added to the cc: list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants