More commit-graph/Bloom filter improvements #659

derrickstolee · 2020-06-15T19:41:53Z

This builds on sg/commit-graph-cleanups, which took several patches from Szeder's series [1] and applied them almost directly to a more-recent version of Git [2].

[1] https://lore.kernel.org/git/[email protected]/
[2] https://lore.kernel.org/git/[email protected]/

This series adds a few extra improvements, several of which are rooted in Szeder's original series. I maintained his authorship and sign-off, even though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder discovered in the old thread) that we are not re-using the bloom_filter_settings from the existing commit-graph when writing a new one.

commit-graph: place bloom_settings in context
commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in the commit-graph write context.

bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in the very next patch. In fact, if we swap patch 3 and 4, then t4216-log-bloom.sh fails with a segfault due to a NULL filter.

commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

commit-graph: unify the signatures of all write_graph_chunk_*() functions
commit-graph: simplify chunk writes into loop
commit-graph: check chunk sizes after writing

These are all the same as before.

revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

commit-graph: check all leading directories in changed path Bloom filters
bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the better logic around directory separators and empty paths. Also, the list of Bloom keys is ordered with the deepest path first. That has some tiny performance benefits for deep paths since we can short-circuit the multi-key checks more often. That code path is much faster than the tree parsing, so it is hard to measure any change.

Updates in V3:

Responded to René's feedback.
Fixed the test in Patch 4 to use GIT_TEST_ variables and extend the GIT_TRACE2 depth to work with 'seen' branch.

Update in V4;

Fixed the bug with "too large" commits. Test is added. The fixup! I sent earlier doesn't actually squash cleanly, so I resolved the conflicts during the rebase.

Thanks,
-Stolee

Cc: [email protected], [email protected], [email protected]
cc: SZEDER Gábor [email protected]
cc: Taylor Blau [email protected]
cc: Derrick Stolee [email protected]

derrickstolee · 2020-06-15T20:13:58Z

/submit

gitgitgadget · 2020-06-15T20:15:21Z

Submitted as [email protected]

gitgitgadget · 2020-06-17T21:23:33Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> This builds on sg/commit-graph-cleanups,...

How ready is that topic, do you think?  I'd rather not to see too
many patches piled on top of what is not even in 'next', but I do
not remember it reviewed seriously (I did take a look or two at it
myself before queuing the series, but that does not quite count).

Will queue to extend the topic for now.

Thanks.

gitgitgadget · 2020-06-18T01:02:28Z

This branch is now known as sg/commit-graph-cleanups.

gitgitgadget · 2020-06-18T01:02:30Z

This patch series was integrated into pu via git@7ac860a.

gitgitgadget · 2020-06-18T01:57:03Z

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/17/2020 5:21 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> 
>> This builds on sg/commit-graph-cleanups,...
> 
> How ready is that topic, do you think?  I'd rather not to see too
> many patches piled on top of what is not even in 'next', but I do
> not remember it reviewed seriously (I did take a look or two at it
> myself before queuing the series, but that does not quite count).

That topic was my attempt to apply the "easy and obvious" changes
from Szeder's proof-of-concept. In some sense, you could consider
them authored by Szeder and reviewed by me. But also, I did need to
tweak some things, so some review from others would be helpful.

> Will queue to extend the topic for now.

I'll refrain from pushing any more in this direction until more
review comes along. I found myself with a need to do something
productive and familiar, so I started doing commit-graph
performance stuff.

Thanks,
-Stolee

gitgitgadget · 2020-06-18T06:29:39Z

This patch series was integrated into pu via git@7fd559e.

gitgitgadget · 2020-06-18T20:31:27Z

commit-graph.c

@@ -882,6 +882,7 @@ struct write_commit_graph_context {



On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget: > From: Derrick Stolee <[email protected]> > > Place an instance of struct bloom_settings into the struct > write_commit_graph_context. This allows simplifying the function > prototype of write_graph_chunk_bloom_data(). This will allow us > to combine the function prototypes and use function pointers to > simplify write_commit_graph_file(). > > Reported-by: SZEDER Gábor <[email protected]> > Signed-off-by: Derrick Stolee <[email protected]> > --- > commit-graph.c | 14 ++++++++------ > 1 file changed, 8 insertions(+), 6 deletions(-) > > diff --git a/commit-graph.c b/commit-graph.c > index 887837e8826..05b7035d8d5 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -882,6 +882,7 @@ struct write_commit_graph_context { > > const struct split_commit_graph_opts *split_opts; > size_t total_bloom_filter_data_size; > + struct bloom_filter_settings bloom_settings; That structure is quite busy already, so adding one more member wouldn't matter much. Passing so many things to lots of functions makes it harder to argue about them, though, as all of them effectively become part of their signature, and you have to look at their implementation to see which pseudo-parameters they actually use. It's like a God object. > }; > > static void write_graph_chunk_fanout(struct hashfile *f, > @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, > } > > static void write_graph_chunk_bloom_data(struct hashfile *f, > - struct write_commit_graph_context *ctx, > - const struct bloom_filter_settings *settings) > + struct write_commit_graph_context *ctx) > { > struct commit **list = ctx->commits.list; > struct commit **last = ctx->commits.list + ctx->commits.nr; > @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, > _("Writing changed paths Bloom filters data"), > ctx->commits.nr); > > - hashwrite_be32(f, settings->hash_version); > - hashwrite_be32(f, settings->num_hashes); > - hashwrite_be32(f, settings->bits_per_entry); > + hashwrite_be32(f, ctx->bloom_settings.hash_version); > + hashwrite_be32(f, ctx->bloom_settings.num_hashes); > + hashwrite_be32(f, ctx->bloom_settings.bits_per_entry); > > while (list < last) { > struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); > @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) > struct object_id file_hash; > const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS; > > + ctx->bloom_settings = bloom_settings; So we use the defaults, no customization? Then you could simply move the declaration of bloom_settings from write_commit_graph_file() to write_graph_chunk_bloom_data(). Glancing at pu I don't see additional uses there, so no need to put it into the context (yet?). René

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/18/2020 4:30 PM, René Scharfe wrote: > Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget: >> From: Derrick Stolee <[email protected]> >> >> Place an instance of struct bloom_settings into the struct >> write_commit_graph_context. This allows simplifying the function >> prototype of write_graph_chunk_bloom_data(). This will allow us >> to combine the function prototypes and use function pointers to >> simplify write_commit_graph_file(). >> >> Reported-by: SZEDER Gábor <[email protected]> >> Signed-off-by: Derrick Stolee <[email protected]> >> --- >> commit-graph.c | 14 ++++++++------ >> 1 file changed, 8 insertions(+), 6 deletions(-) >> >> diff --git a/commit-graph.c b/commit-graph.c >> index 887837e8826..05b7035d8d5 100644 >> --- a/commit-graph.c >> +++ b/commit-graph.c >> @@ -882,6 +882,7 @@ struct write_commit_graph_context { >> >> const struct split_commit_graph_opts *split_opts; >> size_t total_bloom_filter_data_size; >> + struct bloom_filter_settings bloom_settings; > > That structure is quite busy already, so adding one more member wouldn't > matter much. > > Passing so many things to lots of functions makes it harder to argue > about them, though, as all of them effectively become part of their > signature, and you have to look at their implementation to see which > pseudo-parameters they actually use. It's like a God object. Correct. The write_commit_graph_context _is_ a God object for the commit-graph write. The good news is that it is limited only to commit-graph.c and the write operations therein. Hopefully, the code organization benefits enough from this structure to justify the massive struct. In contrast, it's still smaller and more contained than "struct rev_info"! >> }; >> >> static void write_graph_chunk_fanout(struct hashfile *f, >> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f, >> } >> >> static void write_graph_chunk_bloom_data(struct hashfile *f, >> - struct write_commit_graph_context *ctx, >> - const struct bloom_filter_settings *settings) >> + struct write_commit_graph_context *ctx) >> { >> struct commit **list = ctx->commits.list; >> struct commit **last = ctx->commits.list + ctx->commits.nr; >> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f, >> _("Writing changed paths Bloom filters data"), >> ctx->commits.nr); >> >> - hashwrite_be32(f, settings->hash_version); >> - hashwrite_be32(f, settings->num_hashes); >> - hashwrite_be32(f, settings->bits_per_entry); >> + hashwrite_be32(f, ctx->bloom_settings.hash_version); >> + hashwrite_be32(f, ctx->bloom_settings.num_hashes); >> + hashwrite_be32(f, ctx->bloom_settings.bits_per_entry); >> >> while (list < last) { >> struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0); >> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) >> struct object_id file_hash; >> const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS; >> >> + ctx->bloom_settings = bloom_settings; > > So we use the defaults, no customization? Then you could simply move > the declaration of bloom_settings from write_commit_graph_file() to > write_graph_chunk_bloom_data(). Glancing at pu I don't see additional > uses there, so no need to put it into the context (yet?). It certainly is not customized by a user (yet). However, you do make an excellent point that I need to be more careful here! Patch 8 (commit-graph: persist existence of changed-paths) needs to load the bloom_filter_settings from the existing commit-graph so we can be future-proof from a future version customizing the settings inside the commit-graph file! This means that in v2 I'll move patches 7 & 8 to be after patch 1 and add a test to verify the filter settings are preserved (after manually changing the data in the file). Thanks! -Stolee

gitgitgadget · 2020-06-18T20:31:29Z

commit-graph.c

@@ -882,10 +882,11 @@ struct write_commit_graph_context {



On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget: > -static void write_graph_chunk_oids(struct hashfile *f, int hash_len, > - struct write_commit_graph_context *ctx) > +static int write_graph_chunk_oids(struct hashfile *f, > + struct write_commit_graph_context *ctx) > { > struct commit **list = ctx->commits.list; > int count; > for (count = 0; count < ctx->commits.nr; count++, list++) { > display_progress(ctx->progress, ++ctx->progress_cnt); > - hashwrite(f, (*list)->object.oid.hash, (int)hash_len); > + hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz); Before the cast was forcing an int into an int (huh?), now it forces a size_t into an int, but hashwrite() expects an unsigned int. Do we really need that cast? > } > + > + return 0; > } > > static const unsigned char *commit_to_sha1(size_t index, void *table) > @@ -926,8 +930,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table) > return commits[index]->object.oid.hash; > } > > -static void write_graph_chunk_data(struct hashfile *f, int hash_len, > - struct write_commit_graph_context *ctx) > +static int write_graph_chunk_data(struct hashfile *f, > + struct write_commit_graph_context *ctx) > { > struct commit **list = ctx->commits.list; > struct commit **last = ctx->commits.list + ctx->commits.nr; > @@ -944,7 +948,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len, > die(_("unable to parse commit %s"), > oid_to_hex(&(*list)->object.oid)); > tree = get_commit_tree_oid(*list); > - hashwrite(f, tree->hash, hash_len); > + hashwrite(f, tree->hash, the_hash_algo->rawsz); ... and here's the answer: No, we don't need to cast. René

gitgitgadget · 2020-06-18T20:31:30Z

commit-graph.c

@@ -882,10 +882,11 @@ struct write_commit_graph_context {



On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget: > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]> > > In write_commit_graph_file() we now have one block of code filling the > array of 'struct chunk_info' with the IDs and sizes of chunks to be > written, and an other block of code calling the functions responsible > for writing individual chunks. In case of optional chunks like Extra > Edge List an Base Graphs List there is also a condition checking > whether that chunk is necessary/desired, and that same condition is > repeated in both blocks of code. Other, newer chunks have similar > optional conditions. > > Eliminate these repeated conditions by storing the function pointers > responsible for writing individual chunks in the 'struct chunk_info' > array as well, and calling them in a loop to write the commit-graph > file. This will open up the possibility for a bit of foolproofing in > the following patch. OK. An alternative would be a switch in the loop that calls the right function based on the chunk id. That would not require uniform interfaces for all write functions; patch 2 would not be necessary. > > Signed-off-by: SZEDER Gábor <[email protected]> > Signed-off-by: Derrick Stolee <[email protected]> > --- > commit-graph.c | 31 +++++++++++++++++++------------ > 1 file changed, 19 insertions(+), 12 deletions(-) > > diff --git a/commit-graph.c b/commit-graph.c > index 3bae1e52ed0..78e023be664 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -1532,9 +1532,13 @@ static int write_graph_chunk_base(struct hashfile *f, > return 0; > } > > +typedef int (*chunk_write_fn)(struct hashfile *f, > + struct write_commit_graph_context *ctx); > + > struct chunk_info { > uint32_t id; > uint64_t size; > + chunk_write_fn write_fn; > }; > > static int write_commit_graph_file(struct write_commit_graph_context *ctx) > @@ -1591,27 +1595,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) > > chunks[0].id = GRAPH_CHUNKID_OIDFANOUT; > chunks[0].size = GRAPH_FANOUT_SIZE; > + chunks[0].write_fn = write_graph_chunk_fanout; > chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP; > chunks[1].size = hashsz * ctx->commits.nr; > + chunks[1].write_fn = write_graph_chunk_oids; > chunks[2].id = GRAPH_CHUNKID_DATA; > chunks[2].size = (hashsz + 16) * ctx->commits.nr; > + chunks[2].write_fn = write_graph_chunk_data; > if (ctx->num_extra_edges) { > chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES; > chunks[num_chunks].size = 4 * ctx->num_extra_edges; > + chunks[num_chunks].write_fn = write_graph_chunk_extra_edges; > num_chunks++; > } > if (ctx->changed_paths) { > chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES; > chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr; > + chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes; > num_chunks++; > chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA; > chunks[num_chunks].size = sizeof(uint32_t) * 3 > + ctx->total_bloom_filter_data_size; > + chunks[num_chunks].write_fn = write_graph_chunk_bloom_data; > num_chunks++; > } > if (ctx->num_commit_graphs_after > 1) { > chunks[num_chunks].id = GRAPH_CHUNKID_BASE; > chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1); > + chunks[num_chunks].write_fn = write_graph_chunk_base; > num_chunks++; > } > > @@ -1647,19 +1658,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) > progress_title.buf, > num_chunks * ctx->commits.nr); > } > - write_graph_chunk_fanout(f, ctx); > - write_graph_chunk_oids(f, ctx); > - write_graph_chunk_data(f, ctx); > - if (ctx->num_extra_edges) > - write_graph_chunk_extra_edges(f, ctx); > - if (ctx->changed_paths) { > - write_graph_chunk_bloom_indexes(f, ctx); > - write_graph_chunk_bloom_data(f, ctx); > - } > - if (ctx->num_commit_graphs_after > 1 && > - write_graph_chunk_base(f, ctx)) { > - return -1; > + > + for (i = 0; i < num_chunks; i++) { > + if (chunks[i].write_fn(f, ctx)) { > + error(_("failed writing chunk with id %"PRIx32""), > + chunks[i].id); This error message is new and not mentioned in the commit message. write_graph_chunk_base() seems to be the only write function that can return something else than 0, and it already reports an error in that case. So do we really want the one here as well? René

gitgitgadget · 2020-06-18T20:34:27Z

revision.c

@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {


On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget: > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]> > > The file 'dir/subdir/file' can only be modified if its leading > directories 'dir' and 'dir/subdir' are modified as well. > > So when checking modified path Bloom filters looking for commits > modifying a path with multiple path components, then check not only > the full path in the Bloom filters, but all its leading directories as > well. Take care to check these paths in "deepest first" order, > because it's the full path that is least likely to be modified, and > the Bloom filter queries can short circuit sooner. > > This can significantly reduce the average false positive rate, by > about an order of magnitude or three(!), and can further speed up > pathspec-limited revision walks. The table below compares the average > false positive rate and runtime of > > git rev-list HEAD -- "$path" > > before and after this change for 5000+ randomly* selected paths from > each repository: > > Average false Average Average > positive rate runtime runtime > before after before after difference > ------------------------------------------------------------------ > git 3.220% 0.7853% 0.0558s 0.0387s -30.6% > linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% > tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% Nice! > > *Path selection was done with the following pipeline: > > git ls-tree -r --name-only HEAD | sort -R | head -n 5000 > > The improvements in runtime are much smaller than the improvements in > average false positive rate, as we are clearly reaching diminishing > returns here. However, all these timings depend on that accessing > tree objects is reasonably fast (warm caches). If we had a partial > clone and the tree objects had to be fetched from a promisor remote, > e.g.: > > $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git > $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ > commit-graph write --reachable > $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ > $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ > rev-list HEAD -- "$path" > > then checking all leading path component can reduce the runtime from > over an hour to a few seconds (and this is with the clone and the > promisor on the same machine). > > This adjusts the tracing values in t4216-log-bloom.sh, which provides a > concrete way to notice the improvement. > > Signed-off-by: SZEDER Gábor <[email protected]> > Signed-off-by: Derrick Stolee <[email protected]> > --- > revision.c | 35 ++++++++++++++++++++++++++--------- > revision.h | 6 ++++-- > t/t4216-log-bloom.sh | 2 +- > 3 files changed, 31 insertions(+), 12 deletions(-) > > diff --git a/revision.c b/revision.c > index c644c660917..027ae3982b4 100644 > --- a/revision.c > +++ b/revision.c > @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > { > struct pathspec_item *pi; > char *path_alloc = NULL; > - const char *path; > + const char *path, *p; > int last_index; > - int len; > + size_t len; > + int path_component_nr = 0, j; > > if (!revs->commits) > return; > @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > > len = strlen(path); > > - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); > - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); > + p = path; > + do { > + p = strchrnul(p + 1, '/'); > + path_component_nr++; > + } while (p - path < len); Hmm, that "+ 1" makes me a bit nervous. Can we be sure that path is not an empty string? And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of hard-coding a slash? > + > + revs->bloom_keys_nr = path_component_nr; > + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); > + > + p = path; > + for (j = 0; j < revs->bloom_keys_nr; j++) { > + p = strchrnul(p + 1, '/'); Same here, of course. Also note that this puts shorter sub-strings first. > + > + fill_bloom_key(path, p - path, &revs->bloom_keys[j], > + revs->bloom_filter_settings); > + } > > if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > atexit(trace2_bloom_filter_statistics_atexit); > @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > struct commit *commit) > { > struct bloom_filter *filter; > - int result; > + int result = 1, j; > > if (!revs->repo->objects->commit_graph) > return -1; > @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > return -1; > } > > - result = bloom_filter_contains(filter, > - revs->bloom_key, > - revs->bloom_filter_settings); > + for (j = 0; result && j < revs->bloom_keys_nr; j++) { > + result = bloom_filter_contains(filter, > + &revs->bloom_keys[j], > + revs->bloom_filter_settings); > + } This checks shorter sub-strings first, contradicting the "deepest first" strategy mentioned in the commit message. This can easily be fixed by inverting the traversal of one of the loops, of course. Or perhaps do something like this? revs->bloom_keys = NULL; revs->bloom_keys_nr = 0; strbuf_add(&path, pi->match, pi->len); strbuf_trim_trailing_dir_sep(&path); for (;;) { const char *sep; ALLOC_GROW(revs->bloom_keys, revs->bloom_keys_nr + 1, alloc); fill_bloom_key(path.buf, path.len, &revs->bloom_keys[revs->bloom_keys_nr++], revs->bloom_filter_settings); sep = find_last_dir_sep(path.buf); if (!sep) break; strbuf_setlen(&path, sep - path.buf); } strbuf_release(&path); The find_last_dir_sep() calls scan the first part of the string over and over, which is a bit silly. A strbuf_trim_trailing_path_component() could start at the end and scan backwards if that turns out to be an actual problem. ALLOC_GROW wastes memory on revs->bloom_keys, and reallocating instead of allocating the right size from the start has a cost as well, but I'd expect this to be dwarfed by the actual revision walk. > > if (result) > count_bloom_filter_maybe++; > @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs, > return REV_TREE_SAME; > } > > - if (revs->bloom_key && !nth_parent) { > + if (revs->bloom_keys_nr && !nth_parent) { > bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > > if (bloom_ret == 0) > diff --git a/revision.h b/revision.h > index 7c026fe41fc..abbfb4ab59a 100644 > --- a/revision.h > +++ b/revision.h > @@ -295,8 +295,10 @@ struct rev_info { > struct topo_walk_info *topo_walk_info; > > /* Commit graph bloom filter fields */ > - /* The bloom filter key for the pathspec */ > - struct bloom_key *bloom_key; > + /* The bloom filter key(s) for the pathspec */ > + struct bloom_key *bloom_keys; > + int bloom_keys_nr; > + > /* > * The bloom filter settings used to generate the key. > * This is loaded from the commit-graph being used. > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > index c7011f33e2c..c13b97d3bda 100755 > --- a/t/t4216-log-bloom.sh > +++ b/t/t4216-log-bloom.sh > @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' ' > > test_bloom_filters_used_when_some_filters_are_missing () { > log_args=$1 > - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6" > + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8" > setup "$log_args" && > grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && > test_cmp log_wo_bloom log_w_bloom >

On the Git mailing list, René Scharfe wrote (reply to this):

Am 18.06.20 um 22:31 schrieb René Scharfe: > Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget: >> --- a/revision.c >> +++ b/revision.c >> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) >> { >> struct pathspec_item *pi; >> char *path_alloc = NULL; >> - const char *path; >> + const char *path, *p; >> int last_index; >> - int len; >> + size_t len; >> + int path_component_nr = 0, j; >> >> if (!revs->commits) >> return; >> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) >> >> len = strlen(path); >> >> - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); >> - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); >> + p = path; >> + do { >> + p = strchrnul(p + 1, '/'); >> + path_component_nr++; >> + } while (p - path < len); > And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of > hard-coding a slash? Not necessarily. Paths should be normalized to use one specific separator, probably slash, both when building and querying the Bloom filter. Otherwise a filter that knows e.g. "foo/bar" could confidently claim that "foo\bar" does not match. If this is done in a previous step then using a literal slash here would be correct. René

gitgitgadget · 2020-06-18T23:45:42Z

This patch series was integrated into pu via git@6c75447.

gitgitgadget · 2020-06-19T06:14:25Z

This patch series was integrated into pu via git@518b032.

gitgitgadget · 2020-06-19T17:18:45Z

revision.c

@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {


On the Git mailing list, Taylor Blau wrote (reply to this):

Hi Stolee, On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote: > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]> > > The file 'dir/subdir/file' can only be modified if its leading > directories 'dir' and 'dir/subdir' are modified as well. > > So when checking modified path Bloom filters looking for commits > modifying a path with multiple path components, then check not only > the full path in the Bloom filters, but all its leading directories as > well. Take care to check these paths in "deepest first" order, > because it's the full path that is least likely to be modified, and > the Bloom filter queries can short circuit sooner. > > This can significantly reduce the average false positive rate, by > about an order of magnitude or three(!), and can further speed up > pathspec-limited revision walks. The table below compares the average > false positive rate and runtime of > > git rev-list HEAD -- "$path" > > before and after this change for 5000+ randomly* selected paths from > each repository: > > Average false Average Average > positive rate runtime runtime > before after before after difference > ------------------------------------------------------------------ > git 3.220% 0.7853% 0.0558s 0.0387s -30.6% > linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% > tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% > > *Path selection was done with the following pipeline: > > git ls-tree -r --name-only HEAD | sort -R | head -n 5000 > > The improvements in runtime are much smaller than the improvements in > average false positive rate, as we are clearly reaching diminishing > returns here. However, all these timings depend on that accessing > tree objects is reasonably fast (warm caches). If we had a partial > clone and the tree objects had to be fetched from a promisor remote, > e.g.: > > $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git > $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ > commit-graph write --reachable > $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ > $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ > rev-list HEAD -- "$path" > > then checking all leading path component can reduce the runtime from > over an hour to a few seconds (and this is with the clone and the > promisor on the same machine). > > This adjusts the tracing values in t4216-log-bloom.sh, which provides a > concrete way to notice the improvement. > > Signed-off-by: SZEDER Gábor <[email protected]> > Signed-off-by: Derrick Stolee <[email protected]> > --- > revision.c | 35 ++++++++++++++++++++++++++--------- > revision.h | 6 ++++-- > t/t4216-log-bloom.sh | 2 +- > 3 files changed, 31 insertions(+), 12 deletions(-) > > diff --git a/revision.c b/revision.c > index c644c660917..027ae3982b4 100644 > --- a/revision.c > +++ b/revision.c > @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > { > struct pathspec_item *pi; > char *path_alloc = NULL; > - const char *path; > + const char *path, *p; > int last_index; > - int len; > + size_t len; > + int path_component_nr = 0, j; > > if (!revs->commits) > return; > @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > > len = strlen(path); > > - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); > - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); > + p = path; > + do { > + p = strchrnul(p + 1, '/'); > + path_component_nr++; > + } while (p - path < len); > + > + revs->bloom_keys_nr = path_component_nr; > + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); > + > + p = path; > + for (j = 0; j < revs->bloom_keys_nr; j++) { > + p = strchrnul(p + 1, '/'); > + > + fill_bloom_key(path, p - path, &revs->bloom_keys[j], > + revs->bloom_filter_settings); > + } > Somewhat related to our off-list discussion yesterday, there is a bug in both 2.27 and this patch which produces incorrect results when (1) Bloom filters are enabled, and (2) we are doing a revision walk from root with the pathspec '.'. What appears to be going on is that our normalization takes '.' -> '', and then we form a Bloom key based on the empty string, which will return 'definitely not' when querying the Bloom filter some of the time, which should never happen. This is a consequence of never inserting the empty key into the Bloom filter upon generation. As a result, I have patched this in GitHub's fork (which is currently based on 2.27 and doesn't have these patches yet) by doing an early return when 'strlen(path) == 0'. Since it looks like these patches are going to land, here is some clean-up and a fix for the bug that you should feel free to test with and apply on top: --- >8 --- diff --git a/revision.c b/revision.c index 8bd383b1dd..123e72698d 100644 --- a/revision.c +++ b/revision.c @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) { struct pathspec_item *pi; char *path_alloc = NULL; - const char *path, *p; + char *path, *p; int last_index; size_t len; - int path_component_nr = 0, j; + int path_component_nr = 1, j; if (!revs->commits) return; @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) /* remove single trailing slash from path, if needed */ if (pi->match[last_index] == '/') { - path_alloc = xstrdup(pi->match); - path_alloc[last_index] = '\0'; - path = path_alloc; - } else - path = pi->match; + path_alloc = xstrdup(pi->match); + path_alloc[last_index] = '\0'; + path = path_alloc; + } else { + path = pi->match; + len = pi->len; + } - len = strlen(path); + if (!len) + return; - p = path; do { - p = strchrnul(p + 1, '/'); - path_component_nr++; - } while (p - path < len); + if (is_dir_sep(*p)) { + *p = '\0'; + path_component_nr++; + } + } while (*p++); revs->bloom_keys_nr = path_component_nr; ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); p = path; for (j = 0; j < revs->bloom_keys_nr; j++) { - p = strchrnul(p + 1, '/'); - - fill_bloom_key(path, p - path, &revs->bloom_keys[j], + size_t plen = strlen(p); + fill_bloom_key(p, plen, &revs->bloom_keys[j], revs->bloom_filter_settings); + p += plen; } if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > atexit(trace2_bloom_filter_statistics_atexit); > @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > struct commit *commit) > { > struct bloom_filter *filter; > - int result; > + int result = 1, j; > > if (!revs->repo->objects->commit_graph) > return -1; > @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > return -1; > } > > - result = bloom_filter_contains(filter, > - revs->bloom_key, > - revs->bloom_filter_settings); > + for (j = 0; result && j < revs->bloom_keys_nr; j++) { > + result = bloom_filter_contains(filter, > + &revs->bloom_keys[j], > + revs->bloom_filter_settings); > + } > > if (result) > count_bloom_filter_maybe++; > @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs, > return REV_TREE_SAME; > } > > - if (revs->bloom_key && !nth_parent) { > + if (revs->bloom_keys_nr && !nth_parent) { > bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > > if (bloom_ret == 0) > diff --git a/revision.h b/revision.h > index 7c026fe41fc..abbfb4ab59a 100644 > --- a/revision.h > +++ b/revision.h > @@ -295,8 +295,10 @@ struct rev_info { > struct topo_walk_info *topo_walk_info; > > /* Commit graph bloom filter fields */ > - /* The bloom filter key for the pathspec */ > - struct bloom_key *bloom_key; > + /* The bloom filter key(s) for the pathspec */ > + struct bloom_key *bloom_keys; > + int bloom_keys_nr; > + > /* > * The bloom filter settings used to generate the key. > * This is loaded from the commit-graph being used. > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > index c7011f33e2c..c13b97d3bda 100755 > --- a/t/t4216-log-bloom.sh > +++ b/t/t4216-log-bloom.sh > @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' ' > > test_bloom_filters_used_when_some_filters_are_missing () { > log_args=$1 > - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6" > + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8" > setup "$log_args" && > grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && > test_cmp log_wo_bloom log_w_bloom > -- > gitgitgadget > Thanks, Taylor

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Jun 19, 2020 at 11:17:17AM -0600, Taylor Blau wrote: > Hi Stolee, > > On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote: > > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]> > > > > The file 'dir/subdir/file' can only be modified if its leading > > directories 'dir' and 'dir/subdir' are modified as well. > > > > So when checking modified path Bloom filters looking for commits > > modifying a path with multiple path components, then check not only > > the full path in the Bloom filters, but all its leading directories as > > well. Take care to check these paths in "deepest first" order, > > because it's the full path that is least likely to be modified, and > > the Bloom filter queries can short circuit sooner. > > > > This can significantly reduce the average false positive rate, by > > about an order of magnitude or three(!), and can further speed up > > pathspec-limited revision walks. The table below compares the average > > false positive rate and runtime of > > > > git rev-list HEAD -- "$path" > > > > before and after this change for 5000+ randomly* selected paths from > > each repository: > > > > Average false Average Average > > positive rate runtime runtime > > before after before after difference > > ------------------------------------------------------------------ > > git 3.220% 0.7853% 0.0558s 0.0387s -30.6% > > linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% > > tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% > > > > *Path selection was done with the following pipeline: > > > > git ls-tree -r --name-only HEAD | sort -R | head -n 5000 > > > > The improvements in runtime are much smaller than the improvements in > > average false positive rate, as we are clearly reaching diminishing > > returns here. However, all these timings depend on that accessing > > tree objects is reasonably fast (warm caches). If we had a partial > > clone and the tree objects had to be fetched from a promisor remote, > > e.g.: > > > > $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git > > $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ > > commit-graph write --reachable > > $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ > > $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ > > rev-list HEAD -- "$path" > > > > then checking all leading path component can reduce the runtime from > > over an hour to a few seconds (and this is with the clone and the > > promisor on the same machine). > > > > This adjusts the tracing values in t4216-log-bloom.sh, which provides a > > concrete way to notice the improvement. > > > > Signed-off-by: SZEDER Gábor <[email protected]> > > Signed-off-by: Derrick Stolee <[email protected]> > > --- > > revision.c | 35 ++++++++++++++++++++++++++--------- > > revision.h | 6 ++++-- > > t/t4216-log-bloom.sh | 2 +- > > 3 files changed, 31 insertions(+), 12 deletions(-) > > > > diff --git a/revision.c b/revision.c > > index c644c660917..027ae3982b4 100644 > > --- a/revision.c > > +++ b/revision.c > > @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > > { > > struct pathspec_item *pi; > > char *path_alloc = NULL; > > - const char *path; > > + const char *path, *p; > > int last_index; > > - int len; > > + size_t len; > > + int path_component_nr = 0, j; > > > > if (!revs->commits) > > return; > > @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > > > > len = strlen(path); > > > > - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); > > - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); > > + p = path; > > + do { > > + p = strchrnul(p + 1, '/'); > > + path_component_nr++; > > + } while (p - path < len); > > + > > + revs->bloom_keys_nr = path_component_nr; > > + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); > > + > > + p = path; > > + for (j = 0; j < revs->bloom_keys_nr; j++) { > > + p = strchrnul(p + 1, '/'); > > + > > + fill_bloom_key(path, p - path, &revs->bloom_keys[j], > > + revs->bloom_filter_settings); > > + } > > > > Somewhat related to our off-list discussion yesterday, there is a bug in > both 2.27 and this patch which produces incorrect results when (1) > Bloom filters are enabled, and (2) we are doing a revision walk from > root with the pathspec '.'. > > What appears to be going on is that our normalization takes '.' -> '', > and then we form a Bloom key based on the empty string, which will > return 'definitely not' when querying the Bloom filter some of the time, > which should never happen. This is a consequence of never inserting the > empty key into the Bloom filter upon generation. > > As a result, I have patched this in GitHub's fork (which is currently > based on 2.27 and doesn't have these patches yet) by doing an early > return when 'strlen(path) == 0'. Since it looks like these patches are > going to land, here is some clean-up and a fix for the bug that you > should feel free to test with and apply on top: > > --- >8 --- > > diff --git a/revision.c b/revision.c > index 8bd383b1dd..123e72698d 100644 > --- a/revision.c > +++ b/revision.c > @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > { > struct pathspec_item *pi; > char *path_alloc = NULL; > - const char *path, *p; > + char *path, *p; > int last_index; > size_t len; > - int path_component_nr = 0, j; > + int path_component_nr = 1, j; > > if (!revs->commits) > return; > @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > > /* remove single trailing slash from path, if needed */ > if (pi->match[last_index] == '/') { > - path_alloc = xstrdup(pi->match); > - path_alloc[last_index] = '\0'; > - path = path_alloc; > - } else > - path = pi->match; > + path_alloc = xstrdup(pi->match); > + path_alloc[last_index] = '\0'; > + path = path_alloc; > + } else { > + path = pi->match; > + len = pi->len; > + } > > - len = strlen(path); > + if (!len) > + return; I should note that _this_ is the critical fix, and it should fix the bug if you only applied just this hunk. Everything else is purely style clean-ups on top (ranging from the four spaces used instead of a tab, to some string processing niceties that I _think_ should address Rene's concern, although I'm not sure if an actual bug is lurking there or not...) > - p = path; > do { > - p = strchrnul(p + 1, '/'); > - path_component_nr++; > - } while (p - path < len); > + if (is_dir_sep(*p)) { > + *p = '\0'; > + path_component_nr++; > + } > + } while (*p++); > > revs->bloom_keys_nr = path_component_nr; > ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); > > p = path; > for (j = 0; j < revs->bloom_keys_nr; j++) { > - p = strchrnul(p + 1, '/'); > - > - fill_bloom_key(path, p - path, &revs->bloom_keys[j], > + size_t plen = strlen(p); > + fill_bloom_key(p, plen, &revs->bloom_keys[j], > revs->bloom_filter_settings); > + p += plen; > } > > if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > > > if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > > atexit(trace2_bloom_filter_statistics_atexit); > > @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > > struct commit *commit) > > { > > struct bloom_filter *filter; > > - int result; > > + int result = 1, j; > > > > if (!revs->repo->objects->commit_graph) > > return -1; > > @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > > return -1; > > } > > > > - result = bloom_filter_contains(filter, > > - revs->bloom_key, > > - revs->bloom_filter_settings); > > + for (j = 0; result && j < revs->bloom_keys_nr; j++) { > > + result = bloom_filter_contains(filter, > > + &revs->bloom_keys[j], > > + revs->bloom_filter_settings); > > + } > > > > if (result) > > count_bloom_filter_maybe++; > > @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs, > > return REV_TREE_SAME; > > } > > > > - if (revs->bloom_key && !nth_parent) { > > + if (revs->bloom_keys_nr && !nth_parent) { > > bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > > > > if (bloom_ret == 0) > > diff --git a/revision.h b/revision.h > > index 7c026fe41fc..abbfb4ab59a 100644 > > --- a/revision.h > > +++ b/revision.h > > @@ -295,8 +295,10 @@ struct rev_info { > > struct topo_walk_info *topo_walk_info; > > > > /* Commit graph bloom filter fields */ > > - /* The bloom filter key for the pathspec */ > > - struct bloom_key *bloom_key; > > + /* The bloom filter key(s) for the pathspec */ > > + struct bloom_key *bloom_keys; > > + int bloom_keys_nr; > > + > > /* > > * The bloom filter settings used to generate the key. > > * This is loaded from the commit-graph being used. > > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > > index c7011f33e2c..c13b97d3bda 100755 > > --- a/t/t4216-log-bloom.sh > > +++ b/t/t4216-log-bloom.sh > > @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' ' > > > > test_bloom_filters_used_when_some_filters_are_missing () { > > log_args=$1 > > - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6" > > + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8" > > setup "$log_args" && > > grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && > > test_cmp log_wo_bloom log_w_bloom > > -- > > gitgitgadget > > > Thanks, > Taylor Thanks, Taylor

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/19/2020 1:17 PM, Taylor Blau wrote: > Hi Stolee, > > On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote: >> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]> >> >> The file 'dir/subdir/file' can only be modified if its leading >> directories 'dir' and 'dir/subdir' are modified as well. >> >> So when checking modified path Bloom filters looking for commits >> modifying a path with multiple path components, then check not only >> the full path in the Bloom filters, but all its leading directories as >> well. Take care to check these paths in "deepest first" order, >> because it's the full path that is least likely to be modified, and >> the Bloom filter queries can short circuit sooner. >> >> This can significantly reduce the average false positive rate, by >> about an order of magnitude or three(!), and can further speed up >> pathspec-limited revision walks. The table below compares the average >> false positive rate and runtime of >> >> git rev-list HEAD -- "$path" >> >> before and after this change for 5000+ randomly* selected paths from >> each repository: >> >> Average false Average Average >> positive rate runtime runtime >> before after before after difference >> ------------------------------------------------------------------ >> git 3.220% 0.7853% 0.0558s 0.0387s -30.6% >> linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% >> tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% >> >> *Path selection was done with the following pipeline: >> >> git ls-tree -r --name-only HEAD | sort -R | head -n 5000 >> >> The improvements in runtime are much smaller than the improvements in >> average false positive rate, as we are clearly reaching diminishing >> returns here. However, all these timings depend on that accessing >> tree objects is reasonably fast (warm caches). If we had a partial >> clone and the tree objects had to be fetched from a promisor remote, >> e.g.: >> >> $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git >> $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ >> commit-graph write --reachable >> $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ >> $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ >> rev-list HEAD -- "$path" >> >> then checking all leading path component can reduce the runtime from >> over an hour to a few seconds (and this is with the clone and the >> promisor on the same machine). >> >> This adjusts the tracing values in t4216-log-bloom.sh, which provides a >> concrete way to notice the improvement. >> >> Signed-off-by: SZEDER Gábor <[email protected]> >> Signed-off-by: Derrick Stolee <[email protected]> >> --- >> revision.c | 35 ++++++++++++++++++++++++++--------- >> revision.h | 6 ++++-- >> t/t4216-log-bloom.sh | 2 +- >> 3 files changed, 31 insertions(+), 12 deletions(-) >> >> diff --git a/revision.c b/revision.c >> index c644c660917..027ae3982b4 100644 >> --- a/revision.c >> +++ b/revision.c >> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) >> { >> struct pathspec_item *pi; >> char *path_alloc = NULL; >> - const char *path; >> + const char *path, *p; >> int last_index; >> - int len; >> + size_t len; >> + int path_component_nr = 0, j; >> >> if (!revs->commits) >> return; >> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) >> >> len = strlen(path); >> >> - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); >> - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); >> + p = path; >> + do { >> + p = strchrnul(p + 1, '/'); >> + path_component_nr++; >> + } while (p - path < len); >> + >> + revs->bloom_keys_nr = path_component_nr; >> + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); >> + >> + p = path; >> + for (j = 0; j < revs->bloom_keys_nr; j++) { >> + p = strchrnul(p + 1, '/'); >> + >> + fill_bloom_key(path, p - path, &revs->bloom_keys[j], >> + revs->bloom_filter_settings); >> + } >> > > Somewhat related to our off-list discussion yesterday, there is a bug in > both 2.27 and this patch which produces incorrect results when (1) > Bloom filters are enabled, and (2) we are doing a revision walk from > root with the pathspec '.'. > > What appears to be going on is that our normalization takes '.' -> '', > and then we form a Bloom key based on the empty string, which will > return 'definitely not' when querying the Bloom filter some of the time, > which should never happen. This is a consequence of never inserting the > empty key into the Bloom filter upon generation. > > As a result, I have patched this in GitHub's fork (which is currently > based on 2.27 and doesn't have these patches yet) by doing an early > return when 'strlen(path) == 0'. Since it looks like these patches are > going to land, here is some clean-up and a fix for the bug that you > should feel free to test with and apply on top: > > --- >8 --- > > diff --git a/revision.c b/revision.c > index 8bd383b1dd..123e72698d 100644 > --- a/revision.c > +++ b/revision.c > @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > { > struct pathspec_item *pi; > char *path_alloc = NULL; > - const char *path, *p; > + char *path, *p; > int last_index; > size_t len; > - int path_component_nr = 0, j; > + int path_component_nr = 1, j; > > if (!revs->commits) > return; > @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > > /* remove single trailing slash from path, if needed */ > if (pi->match[last_index] == '/') { > - path_alloc = xstrdup(pi->match); > - path_alloc[last_index] = '\0'; > - path = path_alloc; > - } else > - path = pi->match; > + path_alloc = xstrdup(pi->match); > + path_alloc[last_index] = '\0'; > + path = path_alloc; > + } else { > + path = pi->match; > + len = pi->len; > + } > > - len = strlen(path); > + if (!len) > + return; > > - p = path; > do { > - p = strchrnul(p + 1, '/'); > - path_component_nr++; > - } while (p - path < len); > + if (is_dir_sep(*p)) { > + *p = '\0'; > + path_component_nr++; > + } > + } while (*p++); > > revs->bloom_keys_nr = path_component_nr; > ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); > > p = path; > for (j = 0; j < revs->bloom_keys_nr; j++) { > - p = strchrnul(p + 1, '/'); > - > - fill_bloom_key(path, p - path, &revs->bloom_keys[j], > + size_t plen = strlen(p); > + fill_bloom_key(p, plen, &revs->bloom_keys[j], > revs->bloom_filter_settings); > + p += plen; I don't think this is correct at all. Looking at it, it seems that it would take a path "A/B/C" and add keys for "A", "B", and "C" instead of "A", "A/B", and "A/B/C". Looking more closely, there are a few issues that makes it clear why you didn't see a failing test: 1. You use "while (*p++)" instead of "while (*++p)" so the scan terminates after the first directory split. (So only "A" is added, which won't fail, but will be slower than intended.) Changing that, we see the next problem: 2. You use "p += plen" instead of "p += plen + 1". This causes the filters to add "A", "", and "" (because we don't skip the terminating character). This _is_ incorrect and would result in test failures. Changing that, we then see "A", "B", and "C" are added as keys. I'm going to take the style issues that you presented, and change them in one commit (reported-by you) and then the if (!len) fix in a separate patch. I'll update the scan loops to use is_dir_sep() accordingly in this patch. Thanks, -Stolee

gitgitgadget · 2020-06-19T22:35:39Z

This patch series was integrated into pu via git@97ed3c6.

gitgitgadget · 2020-06-23T00:15:26Z

This branch is now known as ds/commit-graph-bloom-updates.

gitgitgadget · 2020-06-23T00:15:29Z

This patch series was integrated into pu via git@fc5ffbe.

Place an instance of struct bloom_settings into the struct write_commit_graph_context. This allows simplifying the function prototype of write_graph_chunk_bloom_data(). This will allow us to combine the function prototypes and use function pointers to simplify write_commit_graph_file(). By using a pointer, we can later replace the settings to match those that exist in the current commit-graph, in case a future Git version allows customization of these parameters. Reported-by: SZEDER Gábor <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>

43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment variable. This was created to verify that commit-graph was not loaded when writing a new non-incremental commit-graph. An upcoming change wants to load a commit-graph in some valuable cases, but we want to maintain that we don't trust the commit-graph data when writing our new file. Instead of dying on load, instead die if we ever try to parse a commit from the commit-graph. This functionally verifies the same intended behavior, but allows a more advanced feature in the next change. Signed-off-by: Derrick Stolee <[email protected]>

derrickstolee · 2020-06-23T17:45:46Z

/submit

gitgitgadget · 2020-06-23T17:47:31Z

Submitted as [email protected]

gitgitgadget · 2020-06-24T03:05:37Z

This patch series was integrated into seen via git@3af73ec.

gitgitgadget · 2020-06-24T05:43:18Z

This patch series was integrated into seen via git@0812eda.

gitgitgadget · 2020-06-24T19:47:04Z

This patch series was integrated into seen via git@1b5d3d8.

gitgitgadget · 2020-07-13T22:46:56Z

This patch series was integrated into seen via git@bd95731.

gitgitgadget · 2020-07-14T22:00:56Z

This patch series was integrated into seen via git@277bc3c.

gitgitgadget · 2020-07-15T22:38:38Z

This patch series was integrated into seen via git@ee1d34d.

gitgitgadget · 2020-07-16T00:36:12Z

This patch series was integrated into seen via git@2188883.

gitgitgadget · 2020-07-16T23:14:40Z

This patch series was integrated into seen via git@bf511ce.

gitgitgadget · 2020-07-17T00:55:06Z

This patch series was integrated into seen via git@57cb778.

gitgitgadget · 2020-07-17T20:25:41Z

This patch series was integrated into seen via git@179c51a.

gitgitgadget · 2020-07-18T23:31:53Z

This patch series was integrated into seen via git@4ff1624.

gitgitgadget · 2020-07-19T01:24:20Z

This patch series was integrated into seen via git@8157d78.

gitgitgadget · 2020-07-20T21:18:43Z

This patch series was integrated into seen via git@522b5b8.

gitgitgadget · 2020-07-21T05:19:43Z

This patch series was integrated into seen via git@9bd7fbc.

gitgitgadget · 2020-07-21T22:12:14Z

This patch series was integrated into seen via git@bd9b7d8.

gitgitgadget · 2020-07-22T17:36:45Z

This patch series was integrated into seen via git@ceb72d1.

gitgitgadget · 2020-07-22T22:43:01Z

This patch series was integrated into seen via git@a652433.

gitgitgadget · 2020-07-24T21:38:45Z

This patch series was integrated into seen via git@bb7e490.

gitgitgadget · 2020-07-24T23:24:21Z

This patch series was integrated into seen via git@097b424.

gitgitgadget · 2020-07-26T20:21:30Z

This patch series was integrated into seen via git@d4c116d.

gitgitgadget · 2020-07-27T16:53:55Z

This patch series was integrated into seen via git@2cf3d20.

gitgitgadget · 2020-07-30T21:39:07Z

This patch series was integrated into seen via git@70cdbbe.

gitgitgadget · 2020-07-30T21:39:08Z

This patch series was integrated into next via git@70cdbbe.

gitgitgadget · 2020-07-30T21:39:09Z

This patch series was integrated into master via git@70cdbbe.

gitgitgadget · 2020-07-30T21:39:13Z

Closed via 70cdbbe.

gitgitgadget · 2020-10-15T13:24:41Z

Documentation/git-commit-graph.txt

@@ -60,7 +60,10 @@ existing commit-graph file.
 With the `--changed-paths` option, compute and write information about the


On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Wed, Jul 01, 2020 at 01:27:24PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <[email protected]> > > The changed-path Bloom filters were released in v2.27.0, but have a > significant drawback. A user can opt-in to writing the changed-path > filters using the "--changed-paths" option to "git commit-graph write" > but the next write will drop the filters unless that option is > specified. > > This becomes even more important when considering the interaction with > gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of > features.experimental). These config options trigger commit-graph writes > that the user did not signal, and hence there is no --changed-paths > option available. > > Allow a user that opts-in to the changed-path filters to persist the > property of "my commit-graph has changed-path filters" automatically. A > user can drop filters using the --no-changed-paths option. The above parts of the commit message and the corresponding changes are OK, but ... > In the process, we need to be extremely careful to match the Bloom > filter settings as specified by the commit-graph. This will allow future > versions of Git to customize these settings, and the version with this > change will persist those settings as commit-graphs are rewritten on > top. As pointed out in my original bug report [1], modified path Bloom filters are computed with hardcoded settings in bloom.c:get_bloom_filter(). Since this patch does not touch bloom.c at all, it still computes Bloom filters with those hardcoded settings, and, consequently, despite the commit message's claims, it does not persist the settings in the existing commit-graph. [1] https://public-inbox.org/git/[email protected]/ > Use the trace2 API to signal the settings used during the write, and > check that output in a test after manually adjusting the correct bytes > in the commit-graph file. This test is insufficient, as it only checks what settings trace2 believes the Bloom filters are computed with, not what settings they are actually computed with; that's why it succeeded while the bug whose absence it was supposed to ensure was still there. More robust tests should instead look at what actually gets written to the commit-graph, and how that is interpreted during pathspec-limited revision walks. > Signed-off-by: Derrick Stolee <[email protected]> A "Reported-by: me" trailer would have been appropriate here. > --- > Documentation/git-commit-graph.txt | 5 +++- > builtin/commit-graph.c | 5 +++- > commit-graph.c | 45 ++++++++++++++++++++++++++++-- > commit-graph.h | 1 + > t/t4216-log-bloom.sh | 17 ++++++++++- > 5 files changed, 67 insertions(+), 6 deletions(-) Anyway, this is now partially fixed in 9a7a9ed10d (bloom: use provided 'struct bloom_filter_settings', 2020-09-16), though, unfortunately, its commit message is not quite clear on this. Alas, that's only a partial fix, because we still only look at the top level commit-graph file for existing Bloom filter settings. However, deeper commit-graph layers can contain Bloom filters with non-default settings even when the top level doesn't, and these failing tests below demonstrate: --- >8 --- #!/bin/sh test_description='test' . ./test-lib.sh test_expect_success 'setup' ' git commit --allow-empty -m "Bloom filters are written but ignored for root commits :(" && for i in 1 2 3 do echo $i >file && git add file && git commit -m "$i" || return 1 done && git log --oneline --no-decorate -- file >expect ' test_expect_success 'split' ' # Compute Bloom filters with "unusual" settings. git rev-parse HEAD^^ | GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git commit-graph write --stdin-commits --changed-paths --split && # A commit-graph layer without Bloom filters "hides" the layers # below ... git rev-parse HEAD^ | git commit-graph write --stdin-commits --no-changed-paths --split=no-merge && # ... so this does not look at existing Bloom filters and their # settings in the bottom commit-graph layer and computes new # Bloom filters using the default 7 hashes. git rev-parse HEAD | git commit-graph write --stdin-commits --changed-paths --split=no-merge && # Just to make sure that there are as many graph layers as I # think there should be. test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain && # This checks Bloom filters using settings in the top layer, # thus misses commits modifying file in the bottom commit-graph # layer. git log --oneline --no-decorate -- file >actual && test_cmp expect actual ' test_expect_success 'merged' ' # This merges all existing layers, and computes missing Bloom # filters with the settings in the top layer, without noticing # that filters in the bottom layer were computed with different # settings. git commit-graph write --reachable --changed-paths && # Just to make sure... test_path_is_file .git/objects/info/commit-graph && # This misses commits modifying file that were merged from the # bottom commit-graph layer. git log --oneline --no-decorate -- file >actual && test_cmp expect actual ' test_done

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Oct 15, 2020 at 03:21:47PM +0200, SZEDER Gábor wrote: > As pointed out in my original bug report [1], modified path Bloom > filters are computed with hardcoded settings in > bloom.c:get_bloom_filter(). Since this patch does not touch bloom.c > at all, it still computes Bloom filters with those hardcoded settings, > and, consequently, despite the commit message's claims, it does not > persist the settings in the existing commit-graph. > > [1] https://public-inbox.org/git/[email protected]/ Right. It is worth mentioning here (as you do below) that this was fixed as of 9a7a9ed10d (bloom: use provided 'struct bloom_filter_settings', 2020-09-16). > > Use the trace2 API to signal the settings used during the write, and > > check that output in a test after manually adjusting the correct bytes > > in the commit-graph file. > > This test is insufficient, as it only checks what settings trace2 > believes the Bloom filters are computed with, not what settings they > are actually computed with; that's why it succeeded while the bug > whose absence it was supposed to ensure was still there. > > More robust tests should instead look at what actually gets written to > the commit-graph, and how that is interpreted during pathspec-limited > revision walks. Thanks for the test! That saved me a little bit of work trying to set up the scenario you're describing in a reproducible way. I think that the third test can be fixed relatively easily. The crux of the issue there is that we are computing Bloom filters for commits, some of which already had Bloom filters computed in an earlier layer, but with different settings than the one that we're using to write the current layer. So, we need to be more aggressively checking the Bloom filter settings in any layer we want to reuse a Bloom filter out of before reusing it verbatim in the current layer. The patch below the scissors line is sufficient to do that, and it causes the third test to start passing. ...But, teaching the revision machinery how to handle a walk through commits in multiple commit-graph layers with incompatible Bloom filter settings is trickier. Right now we compute all of the Bloom keys up front using the Bloom filter settings in the top layer. I think we'd probably want to change this to lazily compute those keys by: - Keeping around the contents of the Bloom keys, i.e., the pathspecs themselves. - Keeping a hashmap from Bloom filter settings to computed Bloom keys corresponding to those settings. - Lazily filling in that hashmap as we traverse through different commits. At least, that's the best way that I can think to do it. It does get kind of slow, though; we'd have to scan every commit graph layer at each commit in the worst case to find the actual 'struct commit_graph *' in order to get its Bloom filter settings. So, I think that's sort of show-stoppingly slow, and that we should probably think more deeply about it before taking up that direction. Maybe Stolee has some better thoughts about how we can quickly go from a commit to either (a) the commit-graph struct that that commit is stored in, or (b) the Bloom filter settings belonging to that struct. Thanks, Taylor --- >8 --- Subject: [PATCH] bloom: recompute filters with incompatible settings --- bloom.c | 21 +++++++++++++++++++-- commit-graph.c | 4 ++-- 2 files changed, 21 insertions(+), 4 deletions(-) diff --git a/bloom.c b/bloom.c index 68c73200a5..30da128474 100644 --- a/bloom.c +++ b/bloom.c @@ -30,7 +30,8 @@ static inline unsigned char get_bitmask(uint32_t pos) static int load_bloom_filter_from_graph(struct commit_graph *g, struct bloom_filter *filter, - struct commit *c) + struct commit *c, + const struct bloom_filter_settings *settings) { uint32_t lex_pos, start_index, end_index; uint32_t graph_pos = commit_graph_position(c); @@ -42,6 +43,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g, if (!g->chunk_bloom_indexes) return 0; + if (settings) { + struct bloom_filter_settings *graph_settings = g->bloom_filter_settings; + /* + * Check that the settings used to generate new Bloom filters is + * compatible with the settings Bloom filters in this + * commit-graph layer were generated with. + */ + if (settings->hash_version != graph_settings->hash_version) + return 0; + if (settings->num_hashes != graph_settings->num_hashes) + return 0; + if (settings->bits_per_entry != graph_settings->bits_per_entry) + return 0; + } + lex_pos = graph_pos - g->num_commits_in_base; end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos); @@ -205,7 +221,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, if (!filter->data) { load_commit_graph_info(r, c); if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH) - load_bloom_filter_from_graph(r->objects->commit_graph, filter, c); + load_bloom_filter_from_graph(r->objects->commit_graph, + filter, c, settings); } if (filter->data && filter->len) diff --git a/commit-graph.c b/commit-graph.c index cb042bdba8..afe14af2a3 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1188,7 +1188,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f, uint32_t cur_pos = 0; while (list < last) { - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL); size_t len = filter ? filter->len : 0; cur_pos += len; display_progress(ctx->progress, ++ctx->progress_cnt); @@ -1228,7 +1228,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f, hashwrite_be32(f, ctx->bloom_settings->bits_per_entry); while (list < last) { - struct bloom_filter *filter = get_bloom_filter(ctx->r, *list); + struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL); size_t len = filter ? filter->len : 0; display_progress(ctx->progress, ++ctx->progress_cnt); -- 2.29.0.rc1.dirty

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/15/2020 5:41 PM, Taylor Blau wrote: > So, we need to be more aggressively checking the Bloom filter settings > in any layer we want to reuse a Bloom filter out of before reusing it > verbatim in the current layer. The patch below the scissors line is > sufficient to do that, and it causes the third test to start passing. I think there are three things we should keep in mind: 1. Incompatible Bloom filter settings between layers should be seen as _inconsistent data_ as Git should not be writing incremental commit-graph files with inconsistent Bloom filter settings. Thus, when reading the commit-graph chain we should prevent incompatible filters from being used. One way to do this is to notice different settings and completely disable Bloom filters. The other way would be to take the settings from the first layer with filters and then clear the chunk_bloom_indexes and chunk_bloom_data fields for the layers that don't agree. This fits with an expectation that lower layers are larger, so more filters can be used in that situation. 2. We should be sure that Git will not agree to write incompatible settings between layers of a commit-graph chain. Even more, it should definitely not re-use filters when merging layers with incompatible filter values. The strategy above to ignore Bloom filters in incompatible upper layers is enough of a guard against the "merge layers" situation. 3. Allowing users (or future Git developers) to adjust the default Bloom filter settings is one that is good to do for future-proofing, but not one that I expect to be widely used (any gains here are minuscule compared to the results already achieved with the defaults). On top of that, including incompatible settings across layers is even less likely to be a real use case. We should not be straining to make the code even worse for this imaginary scenario. With that said... > ...But, teaching the revision machinery how to handle a walk through > commits in multiple commit-graph layers with incompatible Bloom filter > settings is trickier. Right now we compute all of the Bloom keys up > front using the Bloom filter settings in the top layer. I think we'd > probably want to change this to lazily compute those keys by: It would probably be easiest to compute an array of bloom_key structs (per path) where the index corresponds to the depth of the commit-graph layer. You can then access the correct key after you have identified which layer the commit is from. > At least, that's the best way that I can think to do it. It does get > kind of slow, though; we'd have to scan every commit graph layer at each > commit in the worst case to find the actual 'struct commit_graph *' in > order to get its Bloom filter settings. So, I think that's sort of > show-stoppingly slow, and that we should probably think more deeply > about it before taking up that direction. > > Maybe Stolee has some better thoughts about how we can quickly go from a > commit to either (a) the commit-graph struct that that commit is stored > in, or (b) the Bloom filter settings belonging to that struct. We already have code that finds a commit in a commit-graph layer based on its integer position by iterating until num_commits_in_base is below our position. The lexicographic position within that layer is found by subtracting num_commits_in_base. For us, we would simply need: int get_commit_graph_layer(struct commit_graph *g, uint32_t pos) { uint32_t layer_index = 0; while (g && pos < g->num_commits_in_base) { g = g->base_graph; layer_index++; } return layer_index; } You could then use the response from get_commit_graph_layer() to load the correct Bloom key. Again, I strongly suggest _not_ working hard to support this case. We should only put in proper safeguards to prevent data like this being written and protect a Git process that might stumble across data in this shape. Thanks, -Stolee

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote: > On 10/15/2020 5:41 PM, Taylor Blau wrote: > > So, we need to be more aggressively checking the Bloom filter settings > > in any layer we want to reuse a Bloom filter out of before reusing it > > verbatim in the current layer. The patch below the scissors line is > > sufficient to do that, and it causes the third test to start passing. > > I think there are three things we should keep in mind: > > 1. Incompatible Bloom filter settings between layers should be seen > as _inconsistent data_ as Git should not be writing incremental > commit-graph files with inconsistent Bloom filter settings. Thus, > when reading the commit-graph chain we should prevent incompatible > filters from being used. One way to do this is to notice different > settings and completely disable Bloom filters. The other way would > be to take the settings from the first layer with filters and then > clear the chunk_bloom_indexes and chunk_bloom_data fields for the > layers that don't agree. This fits with an expectation that lower > layers are larger, so more filters can be used in that situation. Sure; I'd be fine with only allowing filters computed with the settings present in the lowest or largest layer in the event that multiple layers exist with incompatible settings. I'm trying to point us towards a direction of not optimizing too far along a direction that we're unlikely to take, while also trying to do something relatively non-invasive to make it possible for a version of Git to change the default Bloom settings. That is, if a user is writing split commit-graphs, and we change the default Bloom settings, they shouldn't have to recompute or merge down all of their Bloom filters. If that's something that we never think is going to happen, I'm fine with not thinking too hard about it. But, I also don't want to paint ourselves into a corner, so I think something like the patch I wrote in the email that you're replying to actually may be worth pursuing further. I dunno. Definitely after 2.29, though. > 2. We should be sure that Git will not agree to write incompatible > settings between layers of a commit-graph chain. Even more, it > should definitely not re-use filters when merging layers with > incompatible filter values. The strategy above to ignore Bloom > filters in incompatible upper layers is enough of a guard against > the "merge layers" situation. Yeah, this would be fine with me. > 3. Allowing users (or future Git developers) to adjust the default > Bloom filter settings is one that is good to do for future-proofing, > but not one that I expect to be widely used (any gains here are > minuscule compared to the results already achieved with the defaults). > On top of that, including incompatible settings across layers is even > less likely to be a real use case. We should not be straining to make > the code even worse for this imaginary scenario. Right, I think we're pretty much in agreement here. Doing something easy to make sure that we don't run into a wall seems to make sense, but I think modifying the revision walk machinery to keep track of multiple Bloom keys computed with different settings corresponding to the set of Bloom filter settings in commit-graph layers is probably too far in that direction. For what it's worth, I was mainly talking about it to say that it would be more effort than it's probably worth to do. There's also nothing that we're currently discussing that would prevent us from taking that same direction up in six months from now. Thanks, Taylor

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/15/2020 11:18 PM, Taylor Blau wrote: > On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote: >> On 10/15/2020 5:41 PM, Taylor Blau wrote: >>> So, we need to be more aggressively checking the Bloom filter settings >>> in any layer we want to reuse a Bloom filter out of before reusing it >>> verbatim in the current layer. The patch below the scissors line is >>> sufficient to do that, and it causes the third test to start passing. >> >> I think there are three things we should keep in mind: >> >> 1. Incompatible Bloom filter settings between layers should be seen >> as _inconsistent data_ as Git should not be writing incremental >> commit-graph files with inconsistent Bloom filter settings. Thus, >> when reading the commit-graph chain we should prevent incompatible >> filters from being used. One way to do this is to notice different >> settings and completely disable Bloom filters. The other way would >> be to take the settings from the first layer with filters and then >> clear the chunk_bloom_indexes and chunk_bloom_data fields for the >> layers that don't agree. This fits with an expectation that lower >> layers are larger, so more filters can be used in that situation. > > Sure; I'd be fine with only allowing filters computed with the settings > present in the lowest or largest layer in the event that multiple layers > exist with incompatible settings. > > I'm trying to point us towards a direction of not optimizing too far > along a direction that we're unlikely to take, while also trying to do > something relatively non-invasive to make it possible for a version of > Git to change the default Bloom settings. That is, if a user is writing > split commit-graphs, and we change the default Bloom settings, they > shouldn't have to recompute or merge down all of their Bloom filters. They would need to recompute when they merge layers, which introduces a big question about how we should handle such a case. > If that's something that we never think is going to happen, I'm fine > with not thinking too hard about it. But, I also don't want to paint > ourselves into a corner, so I think something like the patch I wrote in > the email that you're replying to actually may be worth pursuing > further. I dunno. Definitely after 2.29, though. I think the proposed "react properly to this unlikely situation" is a good way to prevent getting locked into our choices now. It makes it possible for "old" Git versions (2.30 until we decide to allow this mix) to interact with the inconsistent settings without failure. We don't need to do the 100% "optimal" case of using all filters in order to enable this choice in the future. [...] > For what it's worth, I was mainly talking about it to say that it would > be more effort than it's probably worth to do. There's also nothing that > we're currently discussing that would prevent us from taking that same > direction up in six months from now. Yes, I just want to make sure that everyone agrees there is a middle ground without saying that inconsistent filter settings across layers is a "fully supported" feature. If someone wants to tackle the work to make it a desirable state, then they can try that (with great care). Thanks, -Stolee

gitgitgadget · 2020-10-15T13:24:45Z

User SZEDER Gábor <[email protected]> has been added to the cc: list.

gitgitgadget · 2020-10-15T21:44:02Z

User Taylor Blau <[email protected]> has been added to the cc: list.

gitgitgadget · 2020-10-16T02:19:42Z

User Derrick Stolee <[email protected]> has been added to the cc: list.

derrickstolee force-pushed the bloom-2 branch from 8dccd0a to db5b8fe Compare June 15, 2020 19:42

gitgitgadget bot added the pu label Jun 18, 2020

gitgitgadget bot reviewed Jun 18, 2020

View reviewed changes

gitgitgadget bot reviewed Jun 19, 2020

View reviewed changes

derrickstolee force-pushed the bloom-2 branch from db5b8fe to ba668af Compare June 23, 2020 14:47

derrickstolee added 2 commits June 23, 2020 10:48

derrickstolee force-pushed the bloom-2 branch 2 times, most recently from 485e6bf to 1022c0a Compare June 23, 2020 17:45

gitgitgadget bot added the seen label Jun 24, 2020

This was referenced Jul 26, 2020

Weekly Digest (19 July, 2020 - 26 July, 2020) #682

Closed

Weekly Digest (19 July, 2020 - 26 July, 2020) #683

Closed

gitgitgadget bot added the master label Jul 30, 2020

gitgitgadget bot closed this Jul 30, 2020

gitgitgadget bot reviewed Oct 15, 2020

View reviewed changes

		@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
		{

		@@ -60,7 +60,10 @@ existing commit-graph file.
		With the `--changed-paths` option, compute and write information about the

More commit-graph/Bloom filter improvements #659

More commit-graph/Bloom filter improvements #659

Conversation

derrickstolee commented Jun 15, 2020 • edited by gitgitgadget bot Loading

derrickstolee commented Jun 15, 2020

gitgitgadget bot commented Jun 15, 2020

gitgitgadget bot commented Jun 17, 2020

gitgitgadget bot commented Jun 18, 2020

gitgitgadget bot commented Jun 18, 2020

gitgitgadget bot commented Jun 18, 2020

gitgitgadget bot commented Jun 18, 2020

gitgitgadget bot Jun 18, 2020

Choose a reason for hiding this comment

gitgitgadget bot Jun 19, 2020

Choose a reason for hiding this comment

gitgitgadget bot Jun 18, 2020

Choose a reason for hiding this comment

gitgitgadget bot Jun 18, 2020

Choose a reason for hiding this comment

gitgitgadget bot Jun 18, 2020

Choose a reason for hiding this comment

gitgitgadget bot Jun 19, 2020

Choose a reason for hiding this comment

gitgitgadget bot commented Jun 18, 2020

gitgitgadget bot commented Jun 19, 2020

gitgitgadget bot Jun 19, 2020

Choose a reason for hiding this comment

gitgitgadget bot Jun 19, 2020

Choose a reason for hiding this comment

gitgitgadget bot Jun 23, 2020

Choose a reason for hiding this comment

gitgitgadget bot commented Jun 19, 2020

gitgitgadget bot commented Jun 23, 2020

gitgitgadget bot commented Jun 23, 2020

derrickstolee commented Jun 23, 2020

gitgitgadget bot commented Jun 23, 2020

gitgitgadget bot commented Jun 24, 2020

gitgitgadget bot commented Jun 24, 2020

gitgitgadget bot commented Jun 24, 2020

gitgitgadget bot commented Jul 13, 2020

gitgitgadget bot commented Jul 14, 2020

gitgitgadget bot commented Jul 15, 2020

gitgitgadget bot commented Jul 16, 2020

gitgitgadget bot commented Jul 16, 2020

gitgitgadget bot commented Jul 17, 2020

gitgitgadget bot commented Jul 17, 2020

gitgitgadget bot commented Jul 18, 2020

gitgitgadget bot commented Jul 19, 2020

gitgitgadget bot commented Jul 20, 2020

gitgitgadget bot commented Jul 21, 2020

gitgitgadget bot commented Jul 21, 2020

gitgitgadget bot commented Jul 22, 2020

gitgitgadget bot commented Jul 22, 2020

gitgitgadget bot commented Jul 24, 2020

gitgitgadget bot commented Jul 24, 2020

gitgitgadget bot commented Jul 26, 2020

gitgitgadget bot commented Jul 27, 2020

gitgitgadget bot commented Jul 30, 2020

gitgitgadget bot commented Jul 30, 2020

gitgitgadget bot commented Jul 30, 2020

gitgitgadget bot commented Jul 30, 2020

gitgitgadget bot Oct 15, 2020

Choose a reason for hiding this comment

gitgitgadget bot Oct 15, 2020

Choose a reason for hiding this comment

gitgitgadget bot Oct 16, 2020

Choose a reason for hiding this comment

gitgitgadget bot Oct 16, 2020

Choose a reason for hiding this comment

gitgitgadget bot Oct 16, 2020

Choose a reason for hiding this comment

gitgitgadget bot commented Oct 15, 2020

gitgitgadget bot commented Oct 15, 2020

gitgitgadget bot commented Oct 16, 2020

derrickstolee commented Jun 15, 2020 •

edited by gitgitgadget bot

Loading