-
Notifications
You must be signed in to change notification settings - Fork 140
More commit-graph/Bloom filter improvements #659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More commit-graph/Bloom filter improvements #659
Conversation
/submit |
Submitted as [email protected] |
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
This branch is now known as |
This patch series was integrated into pu via git@7ac860a. |
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
This patch series was integrated into pu via git@7fd559e. |
@@ -882,6 +882,7 @@ struct write_commit_graph_context { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, René Scharfe wrote (reply to this):
Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
> From: Derrick Stolee <[email protected]>
>
> Place an instance of struct bloom_settings into the struct
> write_commit_graph_context. This allows simplifying the function
> prototype of write_graph_chunk_bloom_data(). This will allow us
> to combine the function prototypes and use function pointers to
> simplify write_commit_graph_file().
>
> Reported-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> commit-graph.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 887837e8826..05b7035d8d5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>
> const struct split_commit_graph_opts *split_opts;
> size_t total_bloom_filter_data_size;
> + struct bloom_filter_settings bloom_settings;
That structure is quite busy already, so adding one more member wouldn't
matter much.
Passing so many things to lots of functions makes it harder to argue
about them, though, as all of them effectively become part of their
signature, and you have to look at their implementation to see which
pseudo-parameters they actually use. It's like a God object.
> };
>
> static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> }
>
> static void write_graph_chunk_bloom_data(struct hashfile *f,
> - struct write_commit_graph_context *ctx,
> - const struct bloom_filter_settings *settings)
> + struct write_commit_graph_context *ctx)
> {
> struct commit **list = ctx->commits.list;
> struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
> _("Writing changed paths Bloom filters data"),
> ctx->commits.nr);
>
> - hashwrite_be32(f, settings->hash_version);
> - hashwrite_be32(f, settings->num_hashes);
> - hashwrite_be32(f, settings->bits_per_entry);
> + hashwrite_be32(f, ctx->bloom_settings.hash_version);
> + hashwrite_be32(f, ctx->bloom_settings.num_hashes);
> + hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>
> while (list < last) {
> struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> struct object_id file_hash;
> const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> + ctx->bloom_settings = bloom_settings;
So we use the defaults, no customization? Then you could simply move
the declaration of bloom_settings from write_commit_graph_file() to
write_graph_chunk_bloom_data(). Glancing at pu I don't see additional
uses there, so no need to put it into the context (yet?).
René
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 6/18/2020 4:30 PM, René Scharfe wrote:
> Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
>> From: Derrick Stolee <[email protected]>
>>
>> Place an instance of struct bloom_settings into the struct
>> write_commit_graph_context. This allows simplifying the function
>> prototype of write_graph_chunk_bloom_data(). This will allow us
>> to combine the function prototypes and use function pointers to
>> simplify write_commit_graph_file().
>>
>> Reported-by: SZEDER Gábor <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>> commit-graph.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 887837e8826..05b7035d8d5 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>>
>> const struct split_commit_graph_opts *split_opts;
>> size_t total_bloom_filter_data_size;
>> + struct bloom_filter_settings bloom_settings;
>
> That structure is quite busy already, so adding one more member wouldn't
> matter much.
>
> Passing so many things to lots of functions makes it harder to argue
> about them, though, as all of them effectively become part of their
> signature, and you have to look at their implementation to see which
> pseudo-parameters they actually use. It's like a God object.
Correct. The write_commit_graph_context _is_ a God object for the
commit-graph write. The good news is that it is limited only to
commit-graph.c and the write operations therein. Hopefully, the
code organization benefits enough from this structure to justify
the massive struct.
In contrast, it's still smaller and more contained than
"struct rev_info"!
>> };
>>
>> static void write_graph_chunk_fanout(struct hashfile *f,
>> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>> }
>>
>> static void write_graph_chunk_bloom_data(struct hashfile *f,
>> - struct write_commit_graph_context *ctx,
>> - const struct bloom_filter_settings *settings)
>> + struct write_commit_graph_context *ctx)
>> {
>> struct commit **list = ctx->commits.list;
>> struct commit **last = ctx->commits.list + ctx->commits.nr;
>> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>> _("Writing changed paths Bloom filters data"),
>> ctx->commits.nr);
>>
>> - hashwrite_be32(f, settings->hash_version);
>> - hashwrite_be32(f, settings->num_hashes);
>> - hashwrite_be32(f, settings->bits_per_entry);
>> + hashwrite_be32(f, ctx->bloom_settings.hash_version);
>> + hashwrite_be32(f, ctx->bloom_settings.num_hashes);
>> + hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>>
>> while (list < last) {
>> struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>> struct object_id file_hash;
>> const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>
>> + ctx->bloom_settings = bloom_settings;
>
> So we use the defaults, no customization? Then you could simply move
> the declaration of bloom_settings from write_commit_graph_file() to
> write_graph_chunk_bloom_data(). Glancing at pu I don't see additional
> uses there, so no need to put it into the context (yet?).
It certainly is not customized by a user (yet). However, you do make an
excellent point that I need to be more careful here! Patch 8
(commit-graph: persist existence of changed-paths) needs to load the
bloom_filter_settings from the existing commit-graph so we can be
future-proof from a future version customizing the settings inside the
commit-graph file!
This means that in v2 I'll move patches 7 & 8 to be after patch 1 and
add a test to verify the filter settings are preserved (after manually
changing the data in the file).
Thanks!
-Stolee
@@ -882,10 +882,11 @@ struct write_commit_graph_context { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, René Scharfe wrote (reply to this):
Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
> - struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_oids(struct hashfile *f,
> + struct write_commit_graph_context *ctx)
> {
> struct commit **list = ctx->commits.list;
> int count;
> for (count = 0; count < ctx->commits.nr; count++, list++) {
> display_progress(ctx->progress, ++ctx->progress_cnt);
> - hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
> + hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
Before the cast was forcing an int into an int (huh?), now it forces a
size_t into an int, but hashwrite() expects an unsigned int. Do we
really need that cast?
> }
> +
> + return 0;
> }
>
> static const unsigned char *commit_to_sha1(size_t index, void *table)
> @@ -926,8 +930,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
> return commits[index]->object.oid.hash;
> }
>
> -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> - struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_data(struct hashfile *f,
> + struct write_commit_graph_context *ctx)
> {
> struct commit **list = ctx->commits.list;
> struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -944,7 +948,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> die(_("unable to parse commit %s"),
> oid_to_hex(&(*list)->object.oid));
> tree = get_commit_tree_oid(*list);
> - hashwrite(f, tree->hash, hash_len);
> + hashwrite(f, tree->hash, the_hash_algo->rawsz);
... and here's the answer: No, we don't need to cast.
René
@@ -882,10 +882,11 @@ struct write_commit_graph_context { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, René Scharfe wrote (reply to this):
Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> In write_commit_graph_file() we now have one block of code filling the
> array of 'struct chunk_info' with the IDs and sizes of chunks to be
> written, and an other block of code calling the functions responsible
> for writing individual chunks. In case of optional chunks like Extra
> Edge List an Base Graphs List there is also a condition checking
> whether that chunk is necessary/desired, and that same condition is
> repeated in both blocks of code. Other, newer chunks have similar
> optional conditions.
>
> Eliminate these repeated conditions by storing the function pointers
> responsible for writing individual chunks in the 'struct chunk_info'
> array as well, and calling them in a loop to write the commit-graph
> file. This will open up the possibility for a bit of foolproofing in
> the following patch.
OK. An alternative would be a switch in the loop that calls the right
function based on the chunk id. That would not require uniform
interfaces for all write functions; patch 2 would not be necessary.
>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> commit-graph.c | 31 +++++++++++++++++++------------
> 1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 3bae1e52ed0..78e023be664 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1532,9 +1532,13 @@ static int write_graph_chunk_base(struct hashfile *f,
> return 0;
> }
>
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> + struct write_commit_graph_context *ctx);
> +
> struct chunk_info {
> uint32_t id;
> uint64_t size;
> + chunk_write_fn write_fn;
> };
>
> static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> @@ -1591,27 +1595,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
> chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
> chunks[0].size = GRAPH_FANOUT_SIZE;
> + chunks[0].write_fn = write_graph_chunk_fanout;
> chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
> chunks[1].size = hashsz * ctx->commits.nr;
> + chunks[1].write_fn = write_graph_chunk_oids;
> chunks[2].id = GRAPH_CHUNKID_DATA;
> chunks[2].size = (hashsz + 16) * ctx->commits.nr;
> + chunks[2].write_fn = write_graph_chunk_data;
> if (ctx->num_extra_edges) {
> chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
> chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> + chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
> num_chunks++;
> }
> if (ctx->changed_paths) {
> chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
> chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> + chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
> num_chunks++;
> chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
> chunks[num_chunks].size = sizeof(uint32_t) * 3
> + ctx->total_bloom_filter_data_size;
> + chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
> num_chunks++;
> }
> if (ctx->num_commit_graphs_after > 1) {
> chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
> chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
> + chunks[num_chunks].write_fn = write_graph_chunk_base;
> num_chunks++;
> }
>
> @@ -1647,19 +1658,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> progress_title.buf,
> num_chunks * ctx->commits.nr);
> }
> - write_graph_chunk_fanout(f, ctx);
> - write_graph_chunk_oids(f, ctx);
> - write_graph_chunk_data(f, ctx);
> - if (ctx->num_extra_edges)
> - write_graph_chunk_extra_edges(f, ctx);
> - if (ctx->changed_paths) {
> - write_graph_chunk_bloom_indexes(f, ctx);
> - write_graph_chunk_bloom_data(f, ctx);
> - }
> - if (ctx->num_commit_graphs_after > 1 &&
> - write_graph_chunk_base(f, ctx)) {
> - return -1;
> +
> + for (i = 0; i < num_chunks; i++) {
> + if (chunks[i].write_fn(f, ctx)) {
> + error(_("failed writing chunk with id %"PRIx32""),
> + chunks[i].id);
This error message is new and not mentioned in the commit message.
write_graph_chunk_base() seems to be the only write function that can
return something else than 0, and it already reports an error in that
case. So do we really want the one here as well?
René
@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, René Scharfe wrote (reply to this):
Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well. Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks. The table below compares the average
> false positive rate and runtime of
>
> git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
> Average false Average Average
> positive rate runtime runtime
> before after before after difference
> ------------------------------------------------------------------
> git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
> linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
> tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
Nice!
>
> *Path selection was done with the following pipeline:
>
> git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here. However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches). If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
> $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
> $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
> commit-graph write --reachable
> $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
> $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
> rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> revision.c | 35 ++++++++++++++++++++++++++---------
> revision.h | 6 ++++--
> t/t4216-log-bloom.sh | 2 +-
> 3 files changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index c644c660917..027ae3982b4 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> {
> struct pathspec_item *pi;
> char *path_alloc = NULL;
> - const char *path;
> + const char *path, *p;
> int last_index;
> - int len;
> + size_t len;
> + int path_component_nr = 0, j;
>
> if (!revs->commits)
> return;
> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
> len = strlen(path);
>
> - revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> + p = path;
> + do {
> + p = strchrnul(p + 1, '/');
> + path_component_nr++;
> + } while (p - path < len);
Hmm, that "+ 1" makes me a bit nervous. Can we be sure that path is not
an empty string?
And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of
hard-coding a slash?
> +
> + revs->bloom_keys_nr = path_component_nr;
> + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> + p = path;
> + for (j = 0; j < revs->bloom_keys_nr; j++) {
> + p = strchrnul(p + 1, '/');
Same here, of course.
Also note that this puts shorter sub-strings first.
> +
> + fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> + revs->bloom_filter_settings);
> + }
>
> if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> atexit(trace2_bloom_filter_statistics_atexit);
> @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> struct commit *commit)
> {
> struct bloom_filter *filter;
> - int result;
> + int result = 1, j;
>
> if (!revs->repo->objects->commit_graph)
> return -1;
> @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> return -1;
> }
>
> - result = bloom_filter_contains(filter,
> - revs->bloom_key,
> - revs->bloom_filter_settings);
> + for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> + result = bloom_filter_contains(filter,
> + &revs->bloom_keys[j],
> + revs->bloom_filter_settings);
> + }
This checks shorter sub-strings first, contradicting the "deepest first"
strategy mentioned in the commit message.
This can easily be fixed by inverting the traversal of one of the loops,
of course. Or perhaps do something like this?
revs->bloom_keys = NULL;
revs->bloom_keys_nr = 0;
strbuf_add(&path, pi->match, pi->len);
strbuf_trim_trailing_dir_sep(&path);
for (;;) {
const char *sep;
ALLOC_GROW(revs->bloom_keys, revs->bloom_keys_nr + 1, alloc);
fill_bloom_key(path.buf, path.len,
&revs->bloom_keys[revs->bloom_keys_nr++],
revs->bloom_filter_settings);
sep = find_last_dir_sep(path.buf);
if (!sep)
break;
strbuf_setlen(&path, sep - path.buf);
}
strbuf_release(&path);
The find_last_dir_sep() calls scan the first part of the string over and
over, which is a bit silly. A strbuf_trim_trailing_path_component()
could start at the end and scan backwards if that turns out to be an
actual problem.
ALLOC_GROW wastes memory on revs->bloom_keys, and reallocating instead
of allocating the right size from the start has a cost as well, but I'd
expect this to be dwarfed by the actual revision walk.
>
> if (result)
> count_bloom_filter_maybe++;
> @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
> return REV_TREE_SAME;
> }
>
> - if (revs->bloom_key && !nth_parent) {
> + if (revs->bloom_keys_nr && !nth_parent) {
> bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
> if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41fc..abbfb4ab59a 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
> struct topo_walk_info *topo_walk_info;
>
> /* Commit graph bloom filter fields */
> - /* The bloom filter key for the pathspec */
> - struct bloom_key *bloom_key;
> + /* The bloom filter key(s) for the pathspec */
> + struct bloom_key *bloom_keys;
> + int bloom_keys_nr;
> +
> /*
> * The bloom filter settings used to generate the key.
> * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c7011f33e2c..c13b97d3bda 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
> test_bloom_filters_used_when_some_filters_are_missing () {
> log_args=$1
> - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
> setup "$log_args" &&
> grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
> test_cmp log_wo_bloom log_w_bloom
>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, René Scharfe wrote (reply to this):
Am 18.06.20 um 22:31 schrieb René Scharfe:
> Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>> {
>> struct pathspec_item *pi;
>> char *path_alloc = NULL;
>> - const char *path;
>> + const char *path, *p;
>> int last_index;
>> - int len;
>> + size_t len;
>> + int path_component_nr = 0, j;
>>
>> if (!revs->commits)
>> return;
>> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>
>> len = strlen(path);
>>
>> - revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> + p = path;
>> + do {
>> + p = strchrnul(p + 1, '/');
>> + path_component_nr++;
>> + } while (p - path < len);
> And shouldn't we use is_dir_sep() or find_last_dir_sep() instead of
> hard-coding a slash?
Not necessarily. Paths should be normalized to use one specific
separator, probably slash, both when building and querying the Bloom
filter. Otherwise a filter that knows e.g. "foo/bar" could confidently
claim that "foo\bar" does not match. If this is done in a previous
step then using a literal slash here would be correct.
René
This patch series was integrated into pu via git@6c75447. |
This patch series was integrated into pu via git@518b032. |
@@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
Hi Stolee,
On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well. Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks. The table below compares the average
> false positive rate and runtime of
>
> git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
> Average false Average Average
> positive rate runtime runtime
> before after before after difference
> ------------------------------------------------------------------
> git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
> linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
> tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
>
> *Path selection was done with the following pipeline:
>
> git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here. However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches). If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
> $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
> $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
> commit-graph write --reachable
> $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
> $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
> rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
> revision.c | 35 ++++++++++++++++++++++++++---------
> revision.h | 6 ++++--
> t/t4216-log-bloom.sh | 2 +-
> 3 files changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index c644c660917..027ae3982b4 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> {
> struct pathspec_item *pi;
> char *path_alloc = NULL;
> - const char *path;
> + const char *path, *p;
> int last_index;
> - int len;
> + size_t len;
> + int path_component_nr = 0, j;
>
> if (!revs->commits)
> return;
> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
> len = strlen(path);
>
> - revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> + p = path;
> + do {
> + p = strchrnul(p + 1, '/');
> + path_component_nr++;
> + } while (p - path < len);
> +
> + revs->bloom_keys_nr = path_component_nr;
> + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> + p = path;
> + for (j = 0; j < revs->bloom_keys_nr; j++) {
> + p = strchrnul(p + 1, '/');
> +
> + fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> + revs->bloom_filter_settings);
> + }
>
Somewhat related to our off-list discussion yesterday, there is a bug in
both 2.27 and this patch which produces incorrect results when (1)
Bloom filters are enabled, and (2) we are doing a revision walk from
root with the pathspec '.'.
What appears to be going on is that our normalization takes '.' -> '',
and then we form a Bloom key based on the empty string, which will
return 'definitely not' when querying the Bloom filter some of the time,
which should never happen. This is a consequence of never inserting the
empty key into the Bloom filter upon generation.
As a result, I have patched this in GitHub's fork (which is currently
based on 2.27 and doesn't have these patches yet) by doing an early
return when 'strlen(path) == 0'. Since it looks like these patches are
going to land, here is some clean-up and a fix for the bug that you
should feel free to test with and apply on top:
--- >8 ---
diff --git a/revision.c b/revision.c
index 8bd383b1dd..123e72698d 100644
--- a/revision.c
+++ b/revision.c
@@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
{
struct pathspec_item *pi;
char *path_alloc = NULL;
- const char *path, *p;
+ char *path, *p;
int last_index;
size_t len;
- int path_component_nr = 0, j;
+ int path_component_nr = 1, j;
if (!revs->commits)
return;
@@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
/* remove single trailing slash from path, if needed */
if (pi->match[last_index] == '/') {
- path_alloc = xstrdup(pi->match);
- path_alloc[last_index] = '\0';
- path = path_alloc;
- } else
- path = pi->match;
+ path_alloc = xstrdup(pi->match);
+ path_alloc[last_index] = '\0';
+ path = path_alloc;
+ } else {
+ path = pi->match;
+ len = pi->len;
+ }
- len = strlen(path);
+ if (!len)
+ return;
- p = path;
do {
- p = strchrnul(p + 1, '/');
- path_component_nr++;
- } while (p - path < len);
+ if (is_dir_sep(*p)) {
+ *p = '\0';
+ path_component_nr++;
+ }
+ } while (*p++);
revs->bloom_keys_nr = path_component_nr;
ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
p = path;
for (j = 0; j < revs->bloom_keys_nr; j++) {
- p = strchrnul(p + 1, '/');
-
- fill_bloom_key(path, p - path, &revs->bloom_keys[j],
+ size_t plen = strlen(p);
+ fill_bloom_key(p, plen, &revs->bloom_keys[j],
revs->bloom_filter_settings);
+ p += plen;
}
if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> atexit(trace2_bloom_filter_statistics_atexit);
> @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> struct commit *commit)
> {
> struct bloom_filter *filter;
> - int result;
> + int result = 1, j;
>
> if (!revs->repo->objects->commit_graph)
> return -1;
> @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> return -1;
> }
>
> - result = bloom_filter_contains(filter,
> - revs->bloom_key,
> - revs->bloom_filter_settings);
> + for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> + result = bloom_filter_contains(filter,
> + &revs->bloom_keys[j],
> + revs->bloom_filter_settings);
> + }
>
> if (result)
> count_bloom_filter_maybe++;
> @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
> return REV_TREE_SAME;
> }
>
> - if (revs->bloom_key && !nth_parent) {
> + if (revs->bloom_keys_nr && !nth_parent) {
> bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
> if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41fc..abbfb4ab59a 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
> struct topo_walk_info *topo_walk_info;
>
> /* Commit graph bloom filter fields */
> - /* The bloom filter key for the pathspec */
> - struct bloom_key *bloom_key;
> + /* The bloom filter key(s) for the pathspec */
> + struct bloom_key *bloom_keys;
> + int bloom_keys_nr;
> +
> /*
> * The bloom filter settings used to generate the key.
> * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index c7011f33e2c..c13b97d3bda 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
> test_bloom_filters_used_when_some_filters_are_missing () {
> log_args=$1
> - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
> setup "$log_args" &&
> grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
> test_cmp log_wo_bloom log_w_bloom
> --
> gitgitgadget
>
Thanks,
Taylor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Fri, Jun 19, 2020 at 11:17:17AM -0600, Taylor Blau wrote:
> Hi Stolee,
>
> On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
> > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
> >
> > The file 'dir/subdir/file' can only be modified if its leading
> > directories 'dir' and 'dir/subdir' are modified as well.
> >
> > So when checking modified path Bloom filters looking for commits
> > modifying a path with multiple path components, then check not only
> > the full path in the Bloom filters, but all its leading directories as
> > well. Take care to check these paths in "deepest first" order,
> > because it's the full path that is least likely to be modified, and
> > the Bloom filter queries can short circuit sooner.
> >
> > This can significantly reduce the average false positive rate, by
> > about an order of magnitude or three(!), and can further speed up
> > pathspec-limited revision walks. The table below compares the average
> > false positive rate and runtime of
> >
> > git rev-list HEAD -- "$path"
> >
> > before and after this change for 5000+ randomly* selected paths from
> > each repository:
> >
> > Average false Average Average
> > positive rate runtime runtime
> > before after before after difference
> > ------------------------------------------------------------------
> > git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
> > linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
> > tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
> >
> > *Path selection was done with the following pipeline:
> >
> > git ls-tree -r --name-only HEAD | sort -R | head -n 5000
> >
> > The improvements in runtime are much smaller than the improvements in
> > average false positive rate, as we are clearly reaching diminishing
> > returns here. However, all these timings depend on that accessing
> > tree objects is reasonably fast (warm caches). If we had a partial
> > clone and the tree objects had to be fetched from a promisor remote,
> > e.g.:
> >
> > $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
> > $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
> > commit-graph write --reachable
> > $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
> > $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
> > rev-list HEAD -- "$path"
> >
> > then checking all leading path component can reduce the runtime from
> > over an hour to a few seconds (and this is with the clone and the
> > promisor on the same machine).
> >
> > This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> > concrete way to notice the improvement.
> >
> > Signed-off-by: SZEDER Gábor <[email protected]>
> > Signed-off-by: Derrick Stolee <[email protected]>
> > ---
> > revision.c | 35 ++++++++++++++++++++++++++---------
> > revision.h | 6 ++++--
> > t/t4216-log-bloom.sh | 2 +-
> > 3 files changed, 31 insertions(+), 12 deletions(-)
> >
> > diff --git a/revision.c b/revision.c
> > index c644c660917..027ae3982b4 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> > {
> > struct pathspec_item *pi;
> > char *path_alloc = NULL;
> > - const char *path;
> > + const char *path, *p;
> > int last_index;
> > - int len;
> > + size_t len;
> > + int path_component_nr = 0, j;
> >
> > if (!revs->commits)
> > return;
> > @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> >
> > len = strlen(path);
> >
> > - revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> > - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> > + p = path;
> > + do {
> > + p = strchrnul(p + 1, '/');
> > + path_component_nr++;
> > + } while (p - path < len);
> > +
> > + revs->bloom_keys_nr = path_component_nr;
> > + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> > +
> > + p = path;
> > + for (j = 0; j < revs->bloom_keys_nr; j++) {
> > + p = strchrnul(p + 1, '/');
> > +
> > + fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> > + revs->bloom_filter_settings);
> > + }
> >
>
> Somewhat related to our off-list discussion yesterday, there is a bug in
> both 2.27 and this patch which produces incorrect results when (1)
> Bloom filters are enabled, and (2) we are doing a revision walk from
> root with the pathspec '.'.
>
> What appears to be going on is that our normalization takes '.' -> '',
> and then we form a Bloom key based on the empty string, which will
> return 'definitely not' when querying the Bloom filter some of the time,
> which should never happen. This is a consequence of never inserting the
> empty key into the Bloom filter upon generation.
>
> As a result, I have patched this in GitHub's fork (which is currently
> based on 2.27 and doesn't have these patches yet) by doing an early
> return when 'strlen(path) == 0'. Since it looks like these patches are
> going to land, here is some clean-up and a fix for the bug that you
> should feel free to test with and apply on top:
>
> --- >8 ---
>
> diff --git a/revision.c b/revision.c
> index 8bd383b1dd..123e72698d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> {
> struct pathspec_item *pi;
> char *path_alloc = NULL;
> - const char *path, *p;
> + char *path, *p;
> int last_index;
> size_t len;
> - int path_component_nr = 0, j;
> + int path_component_nr = 1, j;
>
> if (!revs->commits)
> return;
> @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
> /* remove single trailing slash from path, if needed */
> if (pi->match[last_index] == '/') {
> - path_alloc = xstrdup(pi->match);
> - path_alloc[last_index] = '\0';
> - path = path_alloc;
> - } else
> - path = pi->match;
> + path_alloc = xstrdup(pi->match);
> + path_alloc[last_index] = '\0';
> + path = path_alloc;
> + } else {
> + path = pi->match;
> + len = pi->len;
> + }
>
> - len = strlen(path);
> + if (!len)
> + return;
I should note that _this_ is the critical fix, and it should fix the bug
if you only applied just this hunk.
Everything else is purely style clean-ups on top (ranging from the four
spaces used instead of a tab, to some string processing niceties that I
_think_ should address Rene's concern, although I'm not sure if an
actual bug is lurking there or not...)
> - p = path;
> do {
> - p = strchrnul(p + 1, '/');
> - path_component_nr++;
> - } while (p - path < len);
> + if (is_dir_sep(*p)) {
> + *p = '\0';
> + path_component_nr++;
> + }
> + } while (*p++);
>
> revs->bloom_keys_nr = path_component_nr;
> ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>
> p = path;
> for (j = 0; j < revs->bloom_keys_nr; j++) {
> - p = strchrnul(p + 1, '/');
> -
> - fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> + size_t plen = strlen(p);
> + fill_bloom_key(p, plen, &revs->bloom_keys[j],
> revs->bloom_filter_settings);
> + p += plen;
> }
>
> if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>
> > if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
> > atexit(trace2_bloom_filter_statistics_atexit);
> > @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> > struct commit *commit)
> > {
> > struct bloom_filter *filter;
> > - int result;
> > + int result = 1, j;
> >
> > if (!revs->repo->objects->commit_graph)
> > return -1;
> > @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
> > return -1;
> > }
> >
> > - result = bloom_filter_contains(filter,
> > - revs->bloom_key,
> > - revs->bloom_filter_settings);
> > + for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> > + result = bloom_filter_contains(filter,
> > + &revs->bloom_keys[j],
> > + revs->bloom_filter_settings);
> > + }
> >
> > if (result)
> > count_bloom_filter_maybe++;
> > @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs,
> > return REV_TREE_SAME;
> > }
> >
> > - if (revs->bloom_key && !nth_parent) {
> > + if (revs->bloom_keys_nr && !nth_parent) {
> > bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
> >
> > if (bloom_ret == 0)
> > diff --git a/revision.h b/revision.h
> > index 7c026fe41fc..abbfb4ab59a 100644
> > --- a/revision.h
> > +++ b/revision.h
> > @@ -295,8 +295,10 @@ struct rev_info {
> > struct topo_walk_info *topo_walk_info;
> >
> > /* Commit graph bloom filter fields */
> > - /* The bloom filter key for the pathspec */
> > - struct bloom_key *bloom_key;
> > + /* The bloom filter key(s) for the pathspec */
> > + struct bloom_key *bloom_keys;
> > + int bloom_keys_nr;
> > +
> > /*
> > * The bloom filter settings used to generate the key.
> > * This is loaded from the commit-graph being used.
> > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> > index c7011f33e2c..c13b97d3bda 100755
> > --- a/t/t4216-log-bloom.sh
> > +++ b/t/t4216-log-bloom.sh
> > @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
> >
> > test_bloom_filters_used_when_some_filters_are_missing () {
> > log_args=$1
> > - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> > + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
> > setup "$log_args" &&
> > grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
> > test_cmp log_wo_bloom log_w_bloom
> > --
> > gitgitgadget
> >
> Thanks,
> Taylor
Thanks,
Taylor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 6/19/2020 1:17 PM, Taylor Blau wrote:
> Hi Stolee,
>
> On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>>
>> The file 'dir/subdir/file' can only be modified if its leading
>> directories 'dir' and 'dir/subdir' are modified as well.
>>
>> So when checking modified path Bloom filters looking for commits
>> modifying a path with multiple path components, then check not only
>> the full path in the Bloom filters, but all its leading directories as
>> well. Take care to check these paths in "deepest first" order,
>> because it's the full path that is least likely to be modified, and
>> the Bloom filter queries can short circuit sooner.
>>
>> This can significantly reduce the average false positive rate, by
>> about an order of magnitude or three(!), and can further speed up
>> pathspec-limited revision walks. The table below compares the average
>> false positive rate and runtime of
>>
>> git rev-list HEAD -- "$path"
>>
>> before and after this change for 5000+ randomly* selected paths from
>> each repository:
>>
>> Average false Average Average
>> positive rate runtime runtime
>> before after before after difference
>> ------------------------------------------------------------------
>> git 3.220% 0.7853% 0.0558s 0.0387s -30.6%
>> linux 2.453% 0.0296% 0.1046s 0.0766s -26.8%
>> tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2%
>>
>> *Path selection was done with the following pipeline:
>>
>> git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>>
>> The improvements in runtime are much smaller than the improvements in
>> average false positive rate, as we are clearly reaching diminishing
>> returns here. However, all these timings depend on that accessing
>> tree objects is reasonably fast (warm caches). If we had a partial
>> clone and the tree objects had to be fetched from a promisor remote,
>> e.g.:
>>
>> $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>> $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>> commit-graph write --reachable
>> $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>> $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>> rev-list HEAD -- "$path"
>>
>> then checking all leading path component can reduce the runtime from
>> over an hour to a few seconds (and this is with the clone and the
>> promisor on the same machine).
>>
>> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
>> concrete way to notice the improvement.
>>
>> Signed-off-by: SZEDER Gábor <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>> revision.c | 35 ++++++++++++++++++++++++++---------
>> revision.h | 6 ++++--
>> t/t4216-log-bloom.sh | 2 +-
>> 3 files changed, 31 insertions(+), 12 deletions(-)
>>
>> diff --git a/revision.c b/revision.c
>> index c644c660917..027ae3982b4 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>> {
>> struct pathspec_item *pi;
>> char *path_alloc = NULL;
>> - const char *path;
>> + const char *path, *p;
>> int last_index;
>> - int len;
>> + size_t len;
>> + int path_component_nr = 0, j;
>>
>> if (!revs->commits)
>> return;
>> @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>
>> len = strlen(path);
>>
>> - revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> + p = path;
>> + do {
>> + p = strchrnul(p + 1, '/');
>> + path_component_nr++;
>> + } while (p - path < len);
>> +
>> + revs->bloom_keys_nr = path_component_nr;
>> + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>> +
>> + p = path;
>> + for (j = 0; j < revs->bloom_keys_nr; j++) {
>> + p = strchrnul(p + 1, '/');
>> +
>> + fill_bloom_key(path, p - path, &revs->bloom_keys[j],
>> + revs->bloom_filter_settings);
>> + }
>>
>
> Somewhat related to our off-list discussion yesterday, there is a bug in
> both 2.27 and this patch which produces incorrect results when (1)
> Bloom filters are enabled, and (2) we are doing a revision walk from
> root with the pathspec '.'.
>
> What appears to be going on is that our normalization takes '.' -> '',
> and then we form a Bloom key based on the empty string, which will
> return 'definitely not' when querying the Bloom filter some of the time,
> which should never happen. This is a consequence of never inserting the
> empty key into the Bloom filter upon generation.
>
> As a result, I have patched this in GitHub's fork (which is currently
> based on 2.27 and doesn't have these patches yet) by doing an early
> return when 'strlen(path) == 0'. Since it looks like these patches are
> going to land, here is some clean-up and a fix for the bug that you
> should feel free to test with and apply on top:
>
> --- >8 ---
>
> diff --git a/revision.c b/revision.c
> index 8bd383b1dd..123e72698d 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
> {
> struct pathspec_item *pi;
> char *path_alloc = NULL;
> - const char *path, *p;
> + char *path, *p;
> int last_index;
> size_t len;
> - int path_component_nr = 0, j;
> + int path_component_nr = 1, j;
>
> if (!revs->commits)
> return;
> @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>
> /* remove single trailing slash from path, if needed */
> if (pi->match[last_index] == '/') {
> - path_alloc = xstrdup(pi->match);
> - path_alloc[last_index] = '\0';
> - path = path_alloc;
> - } else
> - path = pi->match;
> + path_alloc = xstrdup(pi->match);
> + path_alloc[last_index] = '\0';
> + path = path_alloc;
> + } else {
> + path = pi->match;
> + len = pi->len;
> + }
>
> - len = strlen(path);
> + if (!len)
> + return;
>
> - p = path;
> do {
> - p = strchrnul(p + 1, '/');
> - path_component_nr++;
> - } while (p - path < len);
> + if (is_dir_sep(*p)) {
> + *p = '\0';
> + path_component_nr++;
> + }
> + } while (*p++);
>
> revs->bloom_keys_nr = path_component_nr;
> ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>
> p = path;
> for (j = 0; j < revs->bloom_keys_nr; j++) {
> - p = strchrnul(p + 1, '/');
> -
> - fill_bloom_key(path, p - path, &revs->bloom_keys[j],
> + size_t plen = strlen(p);
> + fill_bloom_key(p, plen, &revs->bloom_keys[j],
> revs->bloom_filter_settings);
> + p += plen;
I don't think this is correct at all. Looking at it, it seems
that it would take a path "A/B/C" and add keys for "A", "B", and
"C" instead of "A", "A/B", and "A/B/C".
Looking more closely, there are a few issues that makes it clear
why you didn't see a failing test:
1. You use "while (*p++)" instead of "while (*++p)" so the scan
terminates after the first directory split. (So only "A" is
added, which won't fail, but will be slower than intended.)
Changing that, we see the next problem:
2. You use "p += plen" instead of "p += plen + 1". This causes
the filters to add "A", "", and "" (because we don't skip the
terminating character). This _is_ incorrect and would result
in test failures.
Changing that, we then see "A", "B", and "C" are added as keys.
I'm going to take the style issues that you presented, and
change them in one commit (reported-by you) and then the
if (!len) fix in a separate patch.
I'll update the scan loops to use is_dir_sep() accordingly
in this patch.
Thanks,
-Stolee
This patch series was integrated into pu via git@97ed3c6. |
This branch is now known as |
This patch series was integrated into pu via git@fc5ffbe. |
Place an instance of struct bloom_settings into the struct write_commit_graph_context. This allows simplifying the function prototype of write_graph_chunk_bloom_data(). This will allow us to combine the function prototypes and use function pointers to simplify write_commit_graph_file(). By using a pointer, we can later replace the settings to match those that exist in the current commit-graph, in case a future Git version allows customization of these parameters. Reported-by: SZEDER Gábor <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment variable. This was created to verify that commit-graph was not loaded when writing a new non-incremental commit-graph. An upcoming change wants to load a commit-graph in some valuable cases, but we want to maintain that we don't trust the commit-graph data when writing our new file. Instead of dying on load, instead die if we ever try to parse a commit from the commit-graph. This functionally verifies the same intended behavior, but allows a more advanced feature in the next change. Signed-off-by: Derrick Stolee <[email protected]>
485e6bf
to
1022c0a
Compare
/submit |
Submitted as [email protected] |
This patch series was integrated into seen via git@3af73ec. |
This patch series was integrated into seen via git@0812eda. |
This patch series was integrated into seen via git@1b5d3d8. |
This patch series was integrated into seen via git@bd95731. |
This patch series was integrated into seen via git@277bc3c. |
This patch series was integrated into seen via git@ee1d34d. |
This patch series was integrated into seen via git@2188883. |
This patch series was integrated into seen via git@bf511ce. |
This patch series was integrated into seen via git@57cb778. |
This patch series was integrated into seen via git@179c51a. |
This patch series was integrated into seen via git@4ff1624. |
This patch series was integrated into seen via git@8157d78. |
This patch series was integrated into seen via git@522b5b8. |
This patch series was integrated into seen via git@9bd7fbc. |
This patch series was integrated into seen via git@bd9b7d8. |
This patch series was integrated into seen via git@ceb72d1. |
This patch series was integrated into seen via git@a652433. |
This patch series was integrated into seen via git@bb7e490. |
This patch series was integrated into seen via git@097b424. |
This patch series was integrated into seen via git@d4c116d. |
This patch series was integrated into seen via git@2cf3d20. |
This patch series was integrated into seen via git@70cdbbe. |
This patch series was integrated into next via git@70cdbbe. |
This patch series was integrated into master via git@70cdbbe. |
Closed via 70cdbbe. |
@@ -60,7 +60,10 @@ existing commit-graph file. | |||
With the `--changed-paths` option, compute and write information about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, SZEDER Gábor wrote (reply to this):
On Wed, Jul 01, 2020 at 01:27:24PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
>
> The changed-path Bloom filters were released in v2.27.0, but have a
> significant drawback. A user can opt-in to writing the changed-path
> filters using the "--changed-paths" option to "git commit-graph write"
> but the next write will drop the filters unless that option is
> specified.
>
> This becomes even more important when considering the interaction with
> gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
> features.experimental). These config options trigger commit-graph writes
> that the user did not signal, and hence there is no --changed-paths
> option available.
>
> Allow a user that opts-in to the changed-path filters to persist the
> property of "my commit-graph has changed-path filters" automatically. A
> user can drop filters using the --no-changed-paths option.
The above parts of the commit message and the corresponding changes
are OK, but ...
> In the process, we need to be extremely careful to match the Bloom
> filter settings as specified by the commit-graph. This will allow future
> versions of Git to customize these settings, and the version with this
> change will persist those settings as commit-graphs are rewritten on
> top.
As pointed out in my original bug report [1], modified path Bloom
filters are computed with hardcoded settings in
bloom.c:get_bloom_filter(). Since this patch does not touch bloom.c
at all, it still computes Bloom filters with those hardcoded settings,
and, consequently, despite the commit message's claims, it does not
persist the settings in the existing commit-graph.
[1] https://public-inbox.org/git/[email protected]/
> Use the trace2 API to signal the settings used during the write, and
> check that output in a test after manually adjusting the correct bytes
> in the commit-graph file.
This test is insufficient, as it only checks what settings trace2
believes the Bloom filters are computed with, not what settings they
are actually computed with; that's why it succeeded while the bug
whose absence it was supposed to ensure was still there.
More robust tests should instead look at what actually gets written to
the commit-graph, and how that is interpreted during pathspec-limited
revision walks.
> Signed-off-by: Derrick Stolee <[email protected]>
A "Reported-by: me" trailer would have been appropriate here.
> ---
> Documentation/git-commit-graph.txt | 5 +++-
> builtin/commit-graph.c | 5 +++-
> commit-graph.c | 45 ++++++++++++++++++++++++++++--
> commit-graph.h | 1 +
> t/t4216-log-bloom.sh | 17 ++++++++++-
> 5 files changed, 67 insertions(+), 6 deletions(-)
Anyway, this is now partially fixed in 9a7a9ed10d (bloom: use provided
'struct bloom_filter_settings', 2020-09-16), though, unfortunately,
its commit message is not quite clear on this. Alas, that's only a
partial fix, because we still only look at the top level commit-graph
file for existing Bloom filter settings. However, deeper commit-graph
layers can contain Bloom filters with non-default settings even when
the top level doesn't, and these failing tests below demonstrate:
--- >8 ---
#!/bin/sh
test_description='test'
. ./test-lib.sh
test_expect_success 'setup' '
git commit --allow-empty -m "Bloom filters are written but ignored for root commits :(" &&
for i in 1 2 3
do
echo $i >file &&
git add file &&
git commit -m "$i" || return 1
done &&
git log --oneline --no-decorate -- file >expect
'
test_expect_success 'split' '
# Compute Bloom filters with "unusual" settings.
git rev-parse HEAD^^ | GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git commit-graph write --stdin-commits --changed-paths --split &&
# A commit-graph layer without Bloom filters "hides" the layers
# below ...
git rev-parse HEAD^ | git commit-graph write --stdin-commits --no-changed-paths --split=no-merge &&
# ... so this does not look at existing Bloom filters and their
# settings in the bottom commit-graph layer and computes new
# Bloom filters using the default 7 hashes.
git rev-parse HEAD | git commit-graph write --stdin-commits --changed-paths --split=no-merge &&
# Just to make sure that there are as many graph layers as I
# think there should be.
test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain &&
# This checks Bloom filters using settings in the top layer,
# thus misses commits modifying file in the bottom commit-graph
# layer.
git log --oneline --no-decorate -- file >actual &&
test_cmp expect actual
'
test_expect_success 'merged' '
# This merges all existing layers, and computes missing Bloom
# filters with the settings in the top layer, without noticing
# that filters in the bottom layer were computed with different
# settings.
git commit-graph write --reachable --changed-paths &&
# Just to make sure...
test_path_is_file .git/objects/info/commit-graph &&
# This misses commits modifying file that were merged from the
# bottom commit-graph layer.
git log --oneline --no-decorate -- file >actual &&
test_cmp expect actual
'
test_done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Thu, Oct 15, 2020 at 03:21:47PM +0200, SZEDER Gábor wrote:
> As pointed out in my original bug report [1], modified path Bloom
> filters are computed with hardcoded settings in
> bloom.c:get_bloom_filter(). Since this patch does not touch bloom.c
> at all, it still computes Bloom filters with those hardcoded settings,
> and, consequently, despite the commit message's claims, it does not
> persist the settings in the existing commit-graph.
>
> [1] https://public-inbox.org/git/[email protected]/
Right. It is worth mentioning here (as you do below) that this was fixed
as of 9a7a9ed10d (bloom: use provided 'struct bloom_filter_settings',
2020-09-16).
> > Use the trace2 API to signal the settings used during the write, and
> > check that output in a test after manually adjusting the correct bytes
> > in the commit-graph file.
>
> This test is insufficient, as it only checks what settings trace2
> believes the Bloom filters are computed with, not what settings they
> are actually computed with; that's why it succeeded while the bug
> whose absence it was supposed to ensure was still there.
>
> More robust tests should instead look at what actually gets written to
> the commit-graph, and how that is interpreted during pathspec-limited
> revision walks.
Thanks for the test! That saved me a little bit of work trying to set up
the scenario you're describing in a reproducible way. I think that the
third test can be fixed relatively easily. The crux of the issue there
is that we are computing Bloom filters for commits, some of which
already had Bloom filters computed in an earlier layer, but with
different settings than the one that we're using to write the current
layer.
So, we need to be more aggressively checking the Bloom filter settings
in any layer we want to reuse a Bloom filter out of before reusing it
verbatim in the current layer. The patch below the scissors line is
sufficient to do that, and it causes the third test to start passing.
...But, teaching the revision machinery how to handle a walk through
commits in multiple commit-graph layers with incompatible Bloom filter
settings is trickier. Right now we compute all of the Bloom keys up
front using the Bloom filter settings in the top layer. I think we'd
probably want to change this to lazily compute those keys by:
- Keeping around the contents of the Bloom keys, i.e., the pathspecs
themselves.
- Keeping a hashmap from Bloom filter settings to computed Bloom keys
corresponding to those settings.
- Lazily filling in that hashmap as we traverse through different
commits.
At least, that's the best way that I can think to do it. It does get
kind of slow, though; we'd have to scan every commit graph layer at each
commit in the worst case to find the actual 'struct commit_graph *' in
order to get its Bloom filter settings. So, I think that's sort of
show-stoppingly slow, and that we should probably think more deeply
about it before taking up that direction.
Maybe Stolee has some better thoughts about how we can quickly go from a
commit to either (a) the commit-graph struct that that commit is stored
in, or (b) the Bloom filter settings belonging to that struct.
Thanks,
Taylor
--- >8 ---
Subject: [PATCH] bloom: recompute filters with incompatible settings
---
bloom.c | 21 +++++++++++++++++++--
commit-graph.c | 4 ++--
2 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/bloom.c b/bloom.c
index 68c73200a5..30da128474 100644
--- a/bloom.c
+++ b/bloom.c
@@ -30,7 +30,8 @@ static inline unsigned char get_bitmask(uint32_t pos)
static int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
- struct commit *c)
+ struct commit *c,
+ const struct bloom_filter_settings *settings)
{
uint32_t lex_pos, start_index, end_index;
uint32_t graph_pos = commit_graph_position(c);
@@ -42,6 +43,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
if (!g->chunk_bloom_indexes)
return 0;
+ if (settings) {
+ struct bloom_filter_settings *graph_settings = g->bloom_filter_settings;
+ /*
+ * Check that the settings used to generate new Bloom filters is
+ * compatible with the settings Bloom filters in this
+ * commit-graph layer were generated with.
+ */
+ if (settings->hash_version != graph_settings->hash_version)
+ return 0;
+ if (settings->num_hashes != graph_settings->num_hashes)
+ return 0;
+ if (settings->bits_per_entry != graph_settings->bits_per_entry)
+ return 0;
+ }
+
lex_pos = graph_pos - g->num_commits_in_base;
end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
@@ -205,7 +221,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
if (!filter->data) {
load_commit_graph_info(r, c);
if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
- load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+ load_bloom_filter_from_graph(r->objects->commit_graph,
+ filter, c, settings);
}
if (filter->data && filter->len)
diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..afe14af2a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1188,7 +1188,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
uint32_t cur_pos = 0;
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
size_t len = filter ? filter->len : 0;
cur_pos += len;
display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1228,7 +1228,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
while (list < last) {
- struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+ struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
size_t len = filter ? filter->len : 0;
display_progress(ctx->progress, ++ctx->progress_cnt);
--
2.29.0.rc1.dirty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 10/15/2020 5:41 PM, Taylor Blau wrote:
> So, we need to be more aggressively checking the Bloom filter settings
> in any layer we want to reuse a Bloom filter out of before reusing it
> verbatim in the current layer. The patch below the scissors line is
> sufficient to do that, and it causes the third test to start passing.
I think there are three things we should keep in mind:
1. Incompatible Bloom filter settings between layers should be seen
as _inconsistent data_ as Git should not be writing incremental
commit-graph files with inconsistent Bloom filter settings. Thus,
when reading the commit-graph chain we should prevent incompatible
filters from being used. One way to do this is to notice different
settings and completely disable Bloom filters. The other way would
be to take the settings from the first layer with filters and then
clear the chunk_bloom_indexes and chunk_bloom_data fields for the
layers that don't agree. This fits with an expectation that lower
layers are larger, so more filters can be used in that situation.
2. We should be sure that Git will not agree to write incompatible
settings between layers of a commit-graph chain. Even more, it
should definitely not re-use filters when merging layers with
incompatible filter values. The strategy above to ignore Bloom
filters in incompatible upper layers is enough of a guard against
the "merge layers" situation.
3. Allowing users (or future Git developers) to adjust the default
Bloom filter settings is one that is good to do for future-proofing,
but not one that I expect to be widely used (any gains here are
minuscule compared to the results already achieved with the defaults).
On top of that, including incompatible settings across layers is even
less likely to be a real use case. We should not be straining to make
the code even worse for this imaginary scenario.
With that said...
> ...But, teaching the revision machinery how to handle a walk through
> commits in multiple commit-graph layers with incompatible Bloom filter
> settings is trickier. Right now we compute all of the Bloom keys up
> front using the Bloom filter settings in the top layer. I think we'd
> probably want to change this to lazily compute those keys by:
It would probably be easiest to compute an array of bloom_key structs
(per path) where the index corresponds to the depth of the commit-graph
layer. You can then access the correct key after you have identified
which layer the commit is from.
> At least, that's the best way that I can think to do it. It does get
> kind of slow, though; we'd have to scan every commit graph layer at each
> commit in the worst case to find the actual 'struct commit_graph *' in
> order to get its Bloom filter settings. So, I think that's sort of
> show-stoppingly slow, and that we should probably think more deeply
> about it before taking up that direction.
>
> Maybe Stolee has some better thoughts about how we can quickly go from a
> commit to either (a) the commit-graph struct that that commit is stored
> in, or (b) the Bloom filter settings belonging to that struct.
We already have code that finds a commit in a commit-graph layer
based on its integer position by iterating until num_commits_in_base
is below our position. The lexicographic position within that layer
is found by subtracting num_commits_in_base.
For us, we would simply need:
int get_commit_graph_layer(struct commit_graph *g, uint32_t pos)
{
uint32_t layer_index = 0;
while (g && pos < g->num_commits_in_base) {
g = g->base_graph;
layer_index++;
}
return layer_index;
}
You could then use the response from get_commit_graph_layer()
to load the correct Bloom key.
Again, I strongly suggest _not_ working hard to support this
case. We should only put in proper safeguards to prevent data
like this being written and protect a Git process that might
stumble across data in this shape.
Thanks,
-Stolee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote:
> On 10/15/2020 5:41 PM, Taylor Blau wrote:
> > So, we need to be more aggressively checking the Bloom filter settings
> > in any layer we want to reuse a Bloom filter out of before reusing it
> > verbatim in the current layer. The patch below the scissors line is
> > sufficient to do that, and it causes the third test to start passing.
>
> I think there are three things we should keep in mind:
>
> 1. Incompatible Bloom filter settings between layers should be seen
> as _inconsistent data_ as Git should not be writing incremental
> commit-graph files with inconsistent Bloom filter settings. Thus,
> when reading the commit-graph chain we should prevent incompatible
> filters from being used. One way to do this is to notice different
> settings and completely disable Bloom filters. The other way would
> be to take the settings from the first layer with filters and then
> clear the chunk_bloom_indexes and chunk_bloom_data fields for the
> layers that don't agree. This fits with an expectation that lower
> layers are larger, so more filters can be used in that situation.
Sure; I'd be fine with only allowing filters computed with the settings
present in the lowest or largest layer in the event that multiple layers
exist with incompatible settings.
I'm trying to point us towards a direction of not optimizing too far
along a direction that we're unlikely to take, while also trying to do
something relatively non-invasive to make it possible for a version of
Git to change the default Bloom settings. That is, if a user is writing
split commit-graphs, and we change the default Bloom settings, they
shouldn't have to recompute or merge down all of their Bloom filters.
If that's something that we never think is going to happen, I'm fine
with not thinking too hard about it. But, I also don't want to paint
ourselves into a corner, so I think something like the patch I wrote in
the email that you're replying to actually may be worth pursuing
further. I dunno. Definitely after 2.29, though.
> 2. We should be sure that Git will not agree to write incompatible
> settings between layers of a commit-graph chain. Even more, it
> should definitely not re-use filters when merging layers with
> incompatible filter values. The strategy above to ignore Bloom
> filters in incompatible upper layers is enough of a guard against
> the "merge layers" situation.
Yeah, this would be fine with me.
> 3. Allowing users (or future Git developers) to adjust the default
> Bloom filter settings is one that is good to do for future-proofing,
> but not one that I expect to be widely used (any gains here are
> minuscule compared to the results already achieved with the defaults).
> On top of that, including incompatible settings across layers is even
> less likely to be a real use case. We should not be straining to make
> the code even worse for this imaginary scenario.
Right, I think we're pretty much in agreement here. Doing something easy
to make sure that we don't run into a wall seems to make sense, but I
think modifying the revision walk machinery to keep track of multiple
Bloom keys computed with different settings corresponding to the set of
Bloom filter settings in commit-graph layers is probably too far in that
direction.
For what it's worth, I was mainly talking about it to say that it would
be more effort than it's probably worth to do. There's also nothing that
we're currently discussing that would prevent us from taking that same
direction up in six months from now.
Thanks,
Taylor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 10/15/2020 11:18 PM, Taylor Blau wrote:
> On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote:
>> On 10/15/2020 5:41 PM, Taylor Blau wrote:
>>> So, we need to be more aggressively checking the Bloom filter settings
>>> in any layer we want to reuse a Bloom filter out of before reusing it
>>> verbatim in the current layer. The patch below the scissors line is
>>> sufficient to do that, and it causes the third test to start passing.
>>
>> I think there are three things we should keep in mind:
>>
>> 1. Incompatible Bloom filter settings between layers should be seen
>> as _inconsistent data_ as Git should not be writing incremental
>> commit-graph files with inconsistent Bloom filter settings. Thus,
>> when reading the commit-graph chain we should prevent incompatible
>> filters from being used. One way to do this is to notice different
>> settings and completely disable Bloom filters. The other way would
>> be to take the settings from the first layer with filters and then
>> clear the chunk_bloom_indexes and chunk_bloom_data fields for the
>> layers that don't agree. This fits with an expectation that lower
>> layers are larger, so more filters can be used in that situation.
>
> Sure; I'd be fine with only allowing filters computed with the settings
> present in the lowest or largest layer in the event that multiple layers
> exist with incompatible settings.
>
> I'm trying to point us towards a direction of not optimizing too far
> along a direction that we're unlikely to take, while also trying to do
> something relatively non-invasive to make it possible for a version of
> Git to change the default Bloom settings. That is, if a user is writing
> split commit-graphs, and we change the default Bloom settings, they
> shouldn't have to recompute or merge down all of their Bloom filters.
They would need to recompute when they merge layers, which introduces
a big question about how we should handle such a case.
> If that's something that we never think is going to happen, I'm fine
> with not thinking too hard about it. But, I also don't want to paint
> ourselves into a corner, so I think something like the patch I wrote in
> the email that you're replying to actually may be worth pursuing
> further. I dunno. Definitely after 2.29, though.
I think the proposed "react properly to this unlikely situation"
is a good way to prevent getting locked into our choices now. It
makes it possible for "old" Git versions (2.30 until we decide to
allow this mix) to interact with the inconsistent settings without
failure.
We don't need to do the 100% "optimal" case of using all filters
in order to enable this choice in the future.
[...]
> For what it's worth, I was mainly talking about it to say that it would
> be more effort than it's probably worth to do. There's also nothing that
> we're currently discussing that would prevent us from taking that same
> direction up in six months from now.
Yes, I just want to make sure that everyone agrees there is a
middle ground without saying that inconsistent filter settings
across layers is a "fully supported" feature. If someone wants
to tackle the work to make it a desirable state, then they can
try that (with great care).
Thanks,
-Stolee
User |
User |
User |
This builds on sg/commit-graph-cleanups, which took several patches from Szeder's series [1] and applied them almost directly to a more-recent version of Git [2].
[1] https://lore.kernel.org/git/[email protected]/
[2] https://lore.kernel.org/git/[email protected]/
This series adds a few extra improvements, several of which are rooted in Szeder's original series. I maintained his authorship and sign-off, even though the patches did not apply or cherry-pick at all.
(In v2, I have removed the range-diff comparison to Szeder's series, so look at the v1 cover letter for that.)
The patches have been significantly reordered. René pointed out (and Szeder discovered in the old thread) that we are not re-using the bloom_filter_settings from the existing commit-graph when writing a new one.
These are mostly the same, except we now use a pointer to the settings in the commit-graph write context.
This new patch is a subtle change in behavior that will become relevant in the very next patch. In fact, if we swap patch 3 and 4, then t4216-log-bloom.sh fails with a segfault due to a NULL filter.
This patch is now updated to use the existing changed-path filter settings.
These are all the same as before.
This patch is the cleanup part of Taylor's patch.
Here is Taylor's fix for empty pathspecs.
Finally, we get these performance patches. Patch 10 is updated to have the better logic around directory separators and empty paths. Also, the list of Bloom keys is ordered with the deepest path first. That has some tiny performance benefits for deep paths since we can short-circuit the multi-key checks more often. That code path is much faster than the tree parsing, so it is hard to measure any change.
Updates in V3:
Update in V4;
fixup!
I sent earlier doesn't actually squash cleanly, so I resolved the conflicts during the rebase.Thanks,
-Stolee
Cc: [email protected], [email protected], [email protected]
cc: SZEDER Gábor [email protected]
cc: Taylor Blau [email protected]
cc: Derrick Stolee [email protected]