Skip to content

More commit-graph/Bloom filter improvements #659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5 changes: 4 additions & 1 deletion Documentation/git-commit-graph.txt
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,10 @@ existing commit-graph file.
With the `--changed-paths` option, compute and write information about the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Wed, Jul 01, 2020 at 01:27:24PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
> 
> The changed-path Bloom filters were released in v2.27.0, but have a
> significant drawback. A user can opt-in to writing the changed-path
> filters using the "--changed-paths" option to "git commit-graph write"
> but the next write will drop the filters unless that option is
> specified.
> 
> This becomes even more important when considering the interaction with
> gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
> features.experimental). These config options trigger commit-graph writes
> that the user did not signal, and hence there is no --changed-paths
> option available.
> 
> Allow a user that opts-in to the changed-path filters to persist the
> property of "my commit-graph has changed-path filters" automatically. A
> user can drop filters using the --no-changed-paths option.

The above parts of the commit message and the corresponding changes
are OK, but ...

> In the process, we need to be extremely careful to match the Bloom
> filter settings as specified by the commit-graph. This will allow future
> versions of Git to customize these settings, and the version with this
> change will persist those settings as commit-graphs are rewritten on
> top.

As pointed out in my original bug report [1], modified path Bloom
filters are computed with hardcoded settings in
bloom.c:get_bloom_filter().  Since this patch does not touch bloom.c
at all, it still computes Bloom filters with those hardcoded settings,
and, consequently, despite the commit message's claims, it does not
persist the settings in the existing commit-graph.

[1] https://public-inbox.org/git/[email protected]/

> Use the trace2 API to signal the settings used during the write, and
> check that output in a test after manually adjusting the correct bytes
> in the commit-graph file.

This test is insufficient, as it only checks what settings trace2
believes the Bloom filters are computed with, not what settings they
are actually computed with; that's why it succeeded while the bug
whose absence it was supposed to ensure was still there.

More robust tests should instead look at what actually gets written to
the commit-graph, and how that is interpreted during pathspec-limited
revision walks.

> Signed-off-by: Derrick Stolee <[email protected]>

A "Reported-by: me" trailer would have been appropriate here.

> ---
>  Documentation/git-commit-graph.txt |  5 +++-
>  builtin/commit-graph.c             |  5 +++-
>  commit-graph.c                     | 45 ++++++++++++++++++++++++++++--
>  commit-graph.h                     |  1 +
>  t/t4216-log-bloom.sh               | 17 ++++++++++-
>  5 files changed, 67 insertions(+), 6 deletions(-)

Anyway, this is now partially fixed in 9a7a9ed10d (bloom: use provided
'struct bloom_filter_settings', 2020-09-16), though, unfortunately,
its commit message is not quite clear on this.  Alas, that's only a
partial fix, because we still only look at the top level commit-graph
file for existing Bloom filter settings.  However, deeper commit-graph
layers can contain Bloom filters with non-default settings even when
the top level doesn't, and these failing tests below demonstrate:

  ---  >8  ---

#!/bin/sh

test_description='test'

. ./test-lib.sh

test_expect_success 'setup' '
	git commit --allow-empty -m "Bloom filters are written but ignored for root commits :(" &&
	for i in 1 2 3
	do
		echo $i >file &&
		git add file &&
		git commit -m "$i" || return 1
	done &&
	git log --oneline --no-decorate -- file >expect
'

test_expect_success 'split' '
	# Compute Bloom filters with "unusual" settings.
	git rev-parse HEAD^^ | GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=3 git commit-graph write --stdin-commits --changed-paths --split &&
	# A commit-graph layer without Bloom filters "hides" the layers
	# below ...
	git rev-parse HEAD^ | git commit-graph write --stdin-commits --no-changed-paths --split=no-merge &&
	# ... so this does not look at existing Bloom filters and their
	# settings in the bottom commit-graph layer and computes new
	# Bloom filters using the default 7 hashes.
	git rev-parse HEAD | git commit-graph write --stdin-commits --changed-paths --split=no-merge &&

	# Just to make sure that there are as many graph layers as I
	# think there should be.
	test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain &&

	# This checks Bloom filters using settings in the top layer,
	# thus misses commits modifying file in the bottom commit-graph
	# layer.
	git log --oneline --no-decorate -- file >actual &&
	test_cmp expect actual
'

test_expect_success 'merged' '
	# This merges all existing layers, and computes missing Bloom
	# filters with the settings in the top layer, without noticing
	# that filters in the bottom layer were computed with different
	# settings.
	git commit-graph write --reachable --changed-paths &&

	# Just to make sure...
	test_path_is_file .git/objects/info/commit-graph &&

	# This misses commits modifying file that were merged from the
	# bottom commit-graph layer.
	git log --oneline --no-decorate -- file >actual &&
	test_cmp expect actual
'

test_done

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Oct 15, 2020 at 03:21:47PM +0200, SZEDER Gábor wrote:
> As pointed out in my original bug report [1], modified path Bloom
> filters are computed with hardcoded settings in
> bloom.c:get_bloom_filter().  Since this patch does not touch bloom.c
> at all, it still computes Bloom filters with those hardcoded settings,
> and, consequently, despite the commit message's claims, it does not
> persist the settings in the existing commit-graph.
>
> [1] https://public-inbox.org/git/[email protected]/

Right. It is worth mentioning here (as you do below) that this was fixed
as of 9a7a9ed10d (bloom: use provided 'struct bloom_filter_settings',
2020-09-16).

> > Use the trace2 API to signal the settings used during the write, and
> > check that output in a test after manually adjusting the correct bytes
> > in the commit-graph file.
>
> This test is insufficient, as it only checks what settings trace2
> believes the Bloom filters are computed with, not what settings they
> are actually computed with; that's why it succeeded while the bug
> whose absence it was supposed to ensure was still there.
>
> More robust tests should instead look at what actually gets written to
> the commit-graph, and how that is interpreted during pathspec-limited
> revision walks.

Thanks for the test! That saved me a little bit of work trying to set up
the scenario you're describing in a reproducible way. I think that the
third test can be fixed relatively easily. The crux of the issue there
is that we are computing Bloom filters for commits, some of which
already had Bloom filters computed in an earlier layer, but with
different settings than the one that we're using to write the current
layer.

So, we need to be more aggressively checking the Bloom filter settings
in any layer we want to reuse a Bloom filter out of before reusing it
verbatim in the current layer. The patch below the scissors line is
sufficient to do that, and it causes the third test to start passing.

...But, teaching the revision machinery how to handle a walk through
commits in multiple commit-graph layers with incompatible Bloom filter
settings is trickier. Right now we compute all of the Bloom keys up
front using the Bloom filter settings in the top layer. I think we'd
probably want to change this to lazily compute those keys by:

  - Keeping around the contents of the Bloom keys, i.e., the pathspecs
    themselves.

  - Keeping a hashmap from Bloom filter settings to computed Bloom keys
    corresponding to those settings.

  - Lazily filling in that hashmap as we traverse through different
    commits.

At least, that's the best way that I can think to do it. It does get
kind of slow, though; we'd have to scan every commit graph layer at each
commit in the worst case to find the actual 'struct commit_graph *' in
order to get its Bloom filter settings. So, I think that's sort of
show-stoppingly slow, and that we should probably think more deeply
about it before taking up that direction.

Maybe Stolee has some better thoughts about how we can quickly go from a
commit to either (a) the commit-graph struct that that commit is stored
in, or (b) the Bloom filter settings belonging to that struct.

Thanks,
Taylor

--- >8 ---

Subject: [PATCH] bloom: recompute filters with incompatible settings
---
 bloom.c        | 21 +++++++++++++++++++--
 commit-graph.c |  4 ++--
 2 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/bloom.c b/bloom.c
index 68c73200a5..30da128474 100644
--- a/bloom.c
+++ b/bloom.c
@@ -30,7 +30,8 @@ static inline unsigned char get_bitmask(uint32_t pos)

 static int load_bloom_filter_from_graph(struct commit_graph *g,
 					struct bloom_filter *filter,
-					struct commit *c)
+					struct commit *c,
+					const struct bloom_filter_settings *settings)
 {
 	uint32_t lex_pos, start_index, end_index;
 	uint32_t graph_pos = commit_graph_position(c);
@@ -42,6 +43,21 @@ static int load_bloom_filter_from_graph(struct commit_graph *g,
 	if (!g->chunk_bloom_indexes)
 		return 0;

+	if (settings) {
+		struct bloom_filter_settings *graph_settings = g->bloom_filter_settings;
+		/*
+		 * Check that the settings used to generate new Bloom filters is
+		 * compatible with the settings Bloom filters in this
+		 * commit-graph layer were generated with.
+		 */
+		if (settings->hash_version != graph_settings->hash_version)
+			return 0;
+		if (settings->num_hashes != graph_settings->num_hashes)
+			return 0;
+		if (settings->bits_per_entry != graph_settings->bits_per_entry)
+			return 0;
+	}
+
 	lex_pos = graph_pos - g->num_commits_in_base;

 	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
@@ -205,7 +221,8 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph,
+						     filter, c, settings);
 	}

 	if (filter->data && filter->len)
diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..afe14af2a3 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1188,7 +1188,7 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 	uint32_t cur_pos = 0;

 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
 		size_t len = filter ? filter->len : 0;
 		cur_pos += len;
 		display_progress(ctx->progress, ++ctx->progress_cnt);
@@ -1228,7 +1228,7 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);

 	while (list < last) {
-		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
+		struct bloom_filter *filter = get_or_compute_bloom_filter(ctx->r, *list, 0, ctx->bloom_settings, NULL);
 		size_t len = filter ? filter->len : 0;

 		display_progress(ctx->progress, ++ctx->progress_cnt);
--
2.29.0.rc1.dirty

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/15/2020 5:41 PM, Taylor Blau wrote:
> So, we need to be more aggressively checking the Bloom filter settings
> in any layer we want to reuse a Bloom filter out of before reusing it
> verbatim in the current layer. The patch below the scissors line is
> sufficient to do that, and it causes the third test to start passing.

I think there are three things we should keep in mind:

1. Incompatible Bloom filter settings between layers should be seen
   as _inconsistent data_ as Git should not be writing incremental
   commit-graph files with inconsistent Bloom filter settings. Thus,
   when reading the commit-graph chain we should prevent incompatible
   filters from being used. One way to do this is to notice different
   settings and completely disable Bloom filters. The other way would
   be to take the settings from the first layer with filters and then
   clear the chunk_bloom_indexes and chunk_bloom_data fields for the
   layers that don't agree. This fits with an expectation that lower
   layers are larger, so more filters can be used in that situation.

2. We should be sure that Git will not agree to write incompatible
   settings between layers of a commit-graph chain. Even more, it
   should definitely not re-use filters when merging layers with
   incompatible filter values. The strategy above to ignore Bloom
   filters in incompatible upper layers is enough of a guard against
   the "merge layers" situation.

3. Allowing users (or future Git developers) to adjust the default
   Bloom filter settings is one that is good to do for future-proofing,
   but not one that I expect to be widely used (any gains here are
   minuscule compared to the results already achieved with the defaults).
   On top of that, including incompatible settings across layers is even
   less likely to be a real use case. We should not be straining to make
   the code even worse for this imaginary scenario.

With that said...
 
> ...But, teaching the revision machinery how to handle a walk through
> commits in multiple commit-graph layers with incompatible Bloom filter
> settings is trickier. Right now we compute all of the Bloom keys up
> front using the Bloom filter settings in the top layer. I think we'd
> probably want to change this to lazily compute those keys by:
It would probably be easiest to compute an array of bloom_key structs
(per path) where the index corresponds to the depth of the commit-graph
layer. You can then access the correct key after you have identified
which layer the commit is from.

> At least, that's the best way that I can think to do it. It does get
> kind of slow, though; we'd have to scan every commit graph layer at each
> commit in the worst case to find the actual 'struct commit_graph *' in
> order to get its Bloom filter settings. So, I think that's sort of
> show-stoppingly slow, and that we should probably think more deeply
> about it before taking up that direction.
> 
> Maybe Stolee has some better thoughts about how we can quickly go from a
> commit to either (a) the commit-graph struct that that commit is stored
> in, or (b) the Bloom filter settings belonging to that struct.

We already have code that finds a commit in a commit-graph layer
based on its integer position by iterating until num_commits_in_base
is below our position. The lexicographic position within that layer
is found by subtracting num_commits_in_base.

For us, we would simply need:

int get_commit_graph_layer(struct commit_graph *g, uint32_t pos)
{
	uint32_t layer_index = 0;

	while (g && pos < g->num_commits_in_base) {
		g = g->base_graph;
		layer_index++;
	}

	return layer_index;
}

You could then use the response from get_commit_graph_layer()
to load the correct Bloom key.

Again, I strongly suggest _not_ working hard to support this
case. We should only put in proper safeguards to prevent data
like this being written and protect a Git process that might
stumble across data in this shape.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote:
> On 10/15/2020 5:41 PM, Taylor Blau wrote:
> > So, we need to be more aggressively checking the Bloom filter settings
> > in any layer we want to reuse a Bloom filter out of before reusing it
> > verbatim in the current layer. The patch below the scissors line is
> > sufficient to do that, and it causes the third test to start passing.
>
> I think there are three things we should keep in mind:
>
> 1. Incompatible Bloom filter settings between layers should be seen
>    as _inconsistent data_ as Git should not be writing incremental
>    commit-graph files with inconsistent Bloom filter settings. Thus,
>    when reading the commit-graph chain we should prevent incompatible
>    filters from being used. One way to do this is to notice different
>    settings and completely disable Bloom filters. The other way would
>    be to take the settings from the first layer with filters and then
>    clear the chunk_bloom_indexes and chunk_bloom_data fields for the
>    layers that don't agree. This fits with an expectation that lower
>    layers are larger, so more filters can be used in that situation.

Sure; I'd be fine with only allowing filters computed with the settings
present in the lowest or largest layer in the event that multiple layers
exist with incompatible settings.

I'm trying to point us towards a direction of not optimizing too far
along a direction that we're unlikely to take, while also trying to do
something relatively non-invasive to make it possible for a version of
Git to change the default Bloom settings. That is, if a user is writing
split commit-graphs, and we change the default Bloom settings, they
shouldn't have to recompute or merge down all of their Bloom filters.

If that's something that we never think is going to happen, I'm fine
with not thinking too hard about it. But, I also don't want to paint
ourselves into a corner, so I think something like the patch I wrote in
the email that you're replying to actually may be worth pursuing
further. I dunno. Definitely after 2.29, though.

> 2. We should be sure that Git will not agree to write incompatible
>    settings between layers of a commit-graph chain. Even more, it
>    should definitely not re-use filters when merging layers with
>    incompatible filter values. The strategy above to ignore Bloom
>    filters in incompatible upper layers is enough of a guard against
>    the "merge layers" situation.

Yeah, this would be fine with me.

> 3. Allowing users (or future Git developers) to adjust the default
>    Bloom filter settings is one that is good to do for future-proofing,
>    but not one that I expect to be widely used (any gains here are
>    minuscule compared to the results already achieved with the defaults).
>    On top of that, including incompatible settings across layers is even
>    less likely to be a real use case. We should not be straining to make
>    the code even worse for this imaginary scenario.

Right, I think we're pretty much in agreement here. Doing something easy
to make sure that we don't run into a wall seems to make sense, but I
think modifying the revision walk machinery to keep track of multiple
Bloom keys computed with different settings corresponding to the set of
Bloom filter settings in commit-graph layers is probably too far in that
direction.

For what it's worth, I was mainly talking about it to say that it would
be more effort than it's probably worth to do. There's also nothing that
we're currently discussing that would prevent us from taking that same
direction up in six months from now.

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 10/15/2020 11:18 PM, Taylor Blau wrote:
> On Thu, Oct 15, 2020 at 10:18:47PM -0400, Derrick Stolee wrote:
>> On 10/15/2020 5:41 PM, Taylor Blau wrote:
>>> So, we need to be more aggressively checking the Bloom filter settings
>>> in any layer we want to reuse a Bloom filter out of before reusing it
>>> verbatim in the current layer. The patch below the scissors line is
>>> sufficient to do that, and it causes the third test to start passing.
>>
>> I think there are three things we should keep in mind:
>>
>> 1. Incompatible Bloom filter settings between layers should be seen
>>    as _inconsistent data_ as Git should not be writing incremental
>>    commit-graph files with inconsistent Bloom filter settings. Thus,
>>    when reading the commit-graph chain we should prevent incompatible
>>    filters from being used. One way to do this is to notice different
>>    settings and completely disable Bloom filters. The other way would
>>    be to take the settings from the first layer with filters and then
>>    clear the chunk_bloom_indexes and chunk_bloom_data fields for the
>>    layers that don't agree. This fits with an expectation that lower
>>    layers are larger, so more filters can be used in that situation.
> 
> Sure; I'd be fine with only allowing filters computed with the settings
> present in the lowest or largest layer in the event that multiple layers
> exist with incompatible settings.
> 
> I'm trying to point us towards a direction of not optimizing too far
> along a direction that we're unlikely to take, while also trying to do
> something relatively non-invasive to make it possible for a version of
> Git to change the default Bloom settings. That is, if a user is writing
> split commit-graphs, and we change the default Bloom settings, they
> shouldn't have to recompute or merge down all of their Bloom filters.

They would need to recompute when they merge layers, which introduces
a big question about how we should handle such a case.

> If that's something that we never think is going to happen, I'm fine
> with not thinking too hard about it. But, I also don't want to paint
> ourselves into a corner, so I think something like the patch I wrote in
> the email that you're replying to actually may be worth pursuing
> further. I dunno. Definitely after 2.29, though.

I think the proposed "react properly to this unlikely situation"
is a good way to prevent getting locked into our choices now. It
makes it possible for "old" Git versions (2.30 until we decide to
allow this mix) to interact with the inconsistent settings without
failure.

We don't need to do the 100% "optimal" case of using all filters
in order to enable this choice in the future.
 
[...]

> For what it's worth, I was mainly talking about it to say that it would
> be more effort than it's probably worth to do. There's also nothing that
> we're currently discussing that would prevent us from taking that same
> direction up in six months from now.

Yes, I just want to make sure that everyone agrees there is a
middle ground without saying that inconsistent filter settings
across layers is a "fully supported" feature. If someone wants
to tackle the work to make it a desirable state, then they can
try that (with great care).

Thanks,
-Stolee

paths changed between a commit and it's first parent. This operation can
take a while on large repositories. It provides significant performance gains
for getting history of a directory or a file with `git log -- <path>`.
for getting history of a directory or a file with `git log -- <path>`. If
this option is given, future commit-graph writes will automatically assume
that this option was intended. Use `--no-changed-paths` to stop storing this
data.
+
With the `--split` option, write the commit-graph as a chain of multiple
commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
Expand Down
14 changes: 6 additions & 8 deletions bloom.c
Original file line number Diff line number Diff line change
Expand Up @@ -186,24 +186,22 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
struct diff_options diffopt;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 23.06.20 um 19:47 schrieb Derrick Stolee via GitGitGadget:
> From: Derrick Stolee <[email protected]>
>
> The get_bloom_filter() method is a bit complicated in some parts where
> it does not need to be. In particular, it needs to return a NULL filter
> only when compute_if_not_present is zero AND the filter data cannot be
> loaded from a commit-graph file. This currently happens by accident
> because the commit-graph does not load changed-path Bloom filters from
> an existing commit-graph when writing a new one. This will change in a
> later patch.

So this is actually a logic fix, not just a cleanup as the subject says?

>
> Also clean up some style issues while we are here.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  bloom.c | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index c38d1cff0c..7291506564 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct diff_options diffopt;
>  	int max_changes = 512;
>
> -	if (bloom_filters.slab_size == 0)
> +	if (!bloom_filters.slab_size)
>  		return NULL;
>
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
> @@ -194,16 +194,15 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	if (!filter->data) {
>  		load_commit_graph_info(r, c);
>  		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> -			r->objects->commit_graph->chunk_bloom_indexes) {
> -			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> -				return filter;
> -			else
> -				return NULL;

... and the fix is that this else branch should not be taken if
compute_if_not_present is set.

> -		}
> +		    r->objects->commit_graph->chunk_bloom_indexes &&
> +		    load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +			return filter;

You could even drop this return as well and have the check below handle the
successful load case.

>  	}
>
> -	if (filter->data || !compute_if_not_present)
> +	if (filter->data)
>  		return filter;
> +	if (!filter->data && !compute_if_not_present)
            ^^^^^^^^^^^^^
The first condition is always true, as the check two lines above makes sure.
Removing it would be cleaner IMHO.

> +		return NULL;
>
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Fri, Jun 26, 2020 at 12:30:29PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
> 
> The get_bloom_filter() method is a bit complicated in some parts where
> it does not need to be. In particular, it needs to return a NULL filter
> only when compute_if_not_present is zero AND the filter data cannot be
> loaded from a commit-graph file. This currently happens by accident
> because the commit-graph does not load changed-path Bloom filters from
> an existing commit-graph when writing a new one. This will change in a
> later patch.
> 
> Also clean up some style issues while we are here.
> 
> Helped-by: René Scharfe <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  bloom.c | 14 ++++++--------
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index c38d1cff0c..2af5389795 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct diff_options diffopt;
>  	int max_changes = 512;
>  
> -	if (bloom_filters.slab_size == 0)
> +	if (!bloom_filters.slab_size)
>  		return NULL;
>
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
> @@ -194,16 +194,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	if (!filter->data) {
>  		load_commit_graph_info(r, c);
>  		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> -			r->objects->commit_graph->chunk_bloom_indexes) {
> -			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> -				return filter;
> -			else
> -				return NULL;
> -		}
> +		    r->objects->commit_graph->chunk_bloom_indexes)
> +			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
>  	}
>  
> -	if (filter->data || !compute_if_not_present)
> +	if (filter->data)
>  		return filter;
> +	if (!compute_if_not_present)
> +		return NULL;

Some callers of get_bloom_filter() invoke it with
compute_if_not_present=0, but are not prepared to handle a NULL return
value and dereference it right away:

  write_graph_chunk_bloom_indexes():

                struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
                cur_pos += filter->len;

  write_graph_chunk_bloom_data():

                struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
                display_progress(progress, ++i);
                hashwrite(f, filter->data, filter->len * sizeof(unsigned char));

I don't know whether this was an issue before, but I didn't really
tried.  Unfortunately, starting with this patch this causes
segmentation faults basically in all real repositories I use for
testing.

  expecting success of 9999.1 'test': 
          for i in $(test_seq 1 513)
          do
                  >file-$i || return 1
          done &&
          git add file-* &&
          git commit -q -m one &&
  
          git commit-graph write --reachable --changed-paths
  
  Segmentation fault
  not ok 1 - test


  Program received signal SIGSEGV, Segmentation fault.
  0x0000000000515848 in write_graph_chunk_bloom_indexes (f=0x9fe650, 
      ctx=0x9d2000) at commit-graph.c:1101
  1101                    cur_pos += filter->len;
  (gdb) print filter
  $1 = (struct bloom_filter *) 0x0



>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> -- 
> gitgitgadget
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/27/2020 12:33 PM, SZEDER Gábor wrote:
> On Fri, Jun 26, 2020 at 12:30:29PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <[email protected]>
>>
>> The get_bloom_filter() method is a bit complicated in some parts where
>> it does not need to be. In particular, it needs to return a NULL filter
>> only when compute_if_not_present is zero AND the filter data cannot be
>> loaded from a commit-graph file. This currently happens by accident
>> because the commit-graph does not load changed-path Bloom filters from
>> an existing commit-graph when writing a new one. This will change in a
>> later patch.
>>
>> Also clean up some style issues while we are here.
>>
>> Helped-by: René Scharfe <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>>  bloom.c | 14 ++++++--------
>>  1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/bloom.c b/bloom.c
>> index c38d1cff0c..2af5389795 100644
>> --- a/bloom.c
>> +++ b/bloom.c
>> @@ -186,7 +186,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>>  	struct diff_options diffopt;
>>  	int max_changes = 512;
>>  
>> -	if (bloom_filters.slab_size == 0)
>> +	if (!bloom_filters.slab_size)
>>  		return NULL;
>>
>>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>> @@ -194,16 +194,14 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>>  	if (!filter->data) {
>>  		load_commit_graph_info(r, c);
>>  		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
>> -			r->objects->commit_graph->chunk_bloom_indexes) {
>> -			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
>> -				return filter;
>> -			else
>> -				return NULL;
>> -		}
>> +		    r->objects->commit_graph->chunk_bloom_indexes)
>> +			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
>>  	}
>>  
>> -	if (filter->data || !compute_if_not_present)
>> +	if (filter->data)
>>  		return filter;
>> +	if (!compute_if_not_present)
>> +		return NULL;
> 
> Some callers of get_bloom_filter() invoke it with
> compute_if_not_present=0, but are not prepared to handle a NULL return
> value and dereference it right away:
> 
>   write_graph_chunk_bloom_indexes():
> 
>                 struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>                 cur_pos += filter->len;
> 
>   write_graph_chunk_bloom_data():
> 
>                 struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>                 display_progress(progress, ++i);
>                 hashwrite(f, filter->data, filter->len * sizeof(unsigned char));

In theory, these _should_ be safe, because we already computed
the filters in an earlier step, right? We should have generated
the filter and populated it in the slab.

> I don't know whether this was an issue before, but I didn't really
> tried.  Unfortunately, starting with this patch this causes
> segmentation faults basically in all real repositories I use for
> testing.
> 
>   expecting success of 9999.1 'test': 
>           for i in $(test_seq 1 513)
>           do
>                   >file-$i || return 1
>           done &&
>           git add file-* &&
>           git commit -q -m one &&
>   
>           git commit-graph write --reachable --changed-paths
>   
>   Segmentation fault
>   not ok 1 - test

However, you are demonstrating a failure that doesn't appear
in our test suite. I was able to reproduce it.

I can confirm that this patch causes a SIGSEGV when writing
the commit-graph in the Git repository, too.

So, what is wrong with my earlier assumption? There are
two problems.

The thing I notice is that an empty filter (no changes
with respect to the first parent) will have NULL
filter->data, so we are returning NULL instead of a
correctly-empty filter (with len zero).

But what you are hitting here is the max number of changes
limit. That also returns a NULL filter, because we mark
the filter as "TOO LARGE" to store. We store that as a
zero-length filter.

The following fixup corrects the bug and adds a test
similar to yours, but with extra care around ensuring the
revision walk still works appropriately for that large
commit.

In the next version, I will include more in the commit
message about these side-effect changes, especially around
the stats for zero-length filters. The trace2 message will
no longer differentiate between zero-length filters and
NULL filters.

Thanks,
-Stolee

-- >8 --

From f9867adc5de8a072f41b91fd6cd87edfcc92e05e Mon Sep 17 00:00:00 2001
From: Derrick Stolee <[email protected]>
Date: Mon, 29 Jun 2020 08:52:33 -0400
Subject: [PATCH] fixup! bloom: fix logic in get_bloom_filter()

Signed-off-by: Derrick Stolee <[email protected]>
---
 commit-graph.c       |  8 ++++++--
 revision.c           |  7 -------
 t/t4216-log-bloom.sh | 24 ++++++++++++++++++++++--
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a0766a86f5..6752916c1a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1108,7 +1108,8 @@ static int write_graph_chunk_bloom_indexes(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
-		cur_pos += filter->len;
+		size_t len = filter ? filter->len : 0;
+		cur_pos += len;
 		display_progress(progress, ++i);
 		hashwrite_be32(f, cur_pos);
 		list++;
@@ -1154,8 +1155,11 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
+		size_t len = filter ? filter->len : 0;
 		display_progress(progress, ++i);
-		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
+
+		if (len)
+			hashwrite(f, filter->data, len * sizeof(unsigned char));
 		list++;
 	}
 
diff --git a/revision.c b/revision.c
index b40bc5b51b..b9118001f9 100644
--- a/revision.c
+++ b/revision.c
@@ -633,7 +633,6 @@ static unsigned int count_bloom_filter_maybe;
 static unsigned int count_bloom_filter_definitely_not;
 static unsigned int count_bloom_filter_false_positive;
 static unsigned int count_bloom_filter_not_present;
-static unsigned int count_bloom_filter_length_zero;
 
 static void trace2_bloom_filter_statistics_atexit(void)
 {
@@ -641,7 +640,6 @@ static void trace2_bloom_filter_statistics_atexit(void)
 
 	jw_object_begin(&jw, 0);
 	jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present);
-	jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero);
 	jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe);
 	jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not);
 	jw_object_intmax(&jw, "false_positive", count_bloom_filter_false_positive);
@@ -765,11 +763,6 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	if (!filter->len) {
-		count_bloom_filter_length_zero++;
-		return -1;
-	}
-
 	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
 		result = bloom_filter_contains(filter,
 					       &revs->bloom_keys[j],
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index d7dd717347..4892364e74 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@ setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
@@ -171,4 +171,24 @@ test_expect_success 'persist filter settings' '
 	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
 '
 
+test_expect_success 'correctly report changes over limit' '
+	git init 513changes &&
+	(
+		cd 513changes &&
+		for i in $(test_seq 1 513)
+		do
+			echo $i >file$i.txt || return 1
+		done &&
+		git add . &&
+		git commit -m "files" &&
+		git commit-graph write --reachable --changed-paths &&
+		for i in $(test_seq 1 513)
+		do
+			git -c core.commitGraph=false log -- file$i.txt >expect &&
+			git log -- file$i.txt >actual &&
+			test_cmp expect actual || return 1
+		done
+	)
+'
+
 test_done
\ No newline at end of file
-- 
2.27.0.203.gf402ea6816

int max_changes = 512;

if (bloom_filters.slab_size == 0)
if (!bloom_filters.slab_size)
return NULL;

filter = bloom_filter_slab_at(&bloom_filters, c);

if (!filter->data) {
load_commit_graph_info(r, c);
if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
r->objects->commit_graph->chunk_bloom_indexes) {
if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
return filter;
else
return NULL;
}
r->objects->commit_graph->chunk_bloom_indexes)
load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
}

if (filter->data || !compute_if_not_present)
if (filter->data)
return filter;
if (!compute_if_not_present)
return NULL;

repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
Expand Down
5 changes: 4 additions & 1 deletion builtin/commit-graph.c
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,7 @@ static int graph_write(int argc, const char **argv)
};

opts.progress = isatty(2);
opts.enable_changed_paths = -1;
split_opts.size_multiple = 2;
split_opts.max_commits = 0;
split_opts.expire_time = 0;
Expand All @@ -171,7 +172,9 @@ static int graph_write(int argc, const char **argv)
flags |= COMMIT_GRAPH_WRITE_SPLIT;
if (opts.progress)
flags |= COMMIT_GRAPH_WRITE_PROGRESS;
if (opts.enable_changed_paths ||
if (!opts.enable_changed_paths)
flags |= COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS;
if (opts.enable_changed_paths == 1 ||
git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;

Expand Down
146 changes: 108 additions & 38 deletions commit-graph.c
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
#include "progress.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> Update the write_graph_chunk_*() helper functions to have the same
> signature:
>
>   - Return an int error code from all these functions.
>     write_graph_chunk_base() already has an int error code, now the
>     others will have one, too, but since they don't indicate any
>     error, they will always return 0.
>
>   - Drop the hash size parameter of write_graph_chunk_oids() and
>     write_graph_chunk_data(); its value can be read directly from
>     'the_hash_algo' inside these functions as well.
>
> This opens up the possibility for further cleanups and foolproofing in
> the following two patches.
>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  commit-graph.c | 42 ++++++++++++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 908f094271..f33bfe49b3 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -891,8 +891,8 @@ struct write_commit_graph_context {
>  	const struct bloom_filter_settings *bloom_settings;
>  };
>
> -static void write_graph_chunk_fanout(struct hashfile *f,
> -				     struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_fanout(struct hashfile *f,
> +				    struct write_commit_graph_context *ctx)
>  {
>  	int i, count = 0;
>  	struct commit **list = ctx->commits.list;
> @@ -913,17 +913,21 @@ static void write_graph_chunk_fanout(struct hashfile *f,
>
>  		hashwrite_be32(f, count);
>  	}
> +
> +	return 0;
>  }
>
> -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_oids(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	int count;
>  	for (count = 0; count < ctx->commits.nr; count++, list++) {
>  		display_progress(ctx->progress, ++ctx->progress_cnt);
> -		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
> +		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);
                                                       ^^^^^
This cast is confusing...

>  	}
> +
> +	return 0;
>  }
>
>  static const unsigned char *commit_to_sha1(size_t index, void *table)
> @@ -932,8 +936,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
>  	return commits[index]->object.oid.hash;
>  }
>
> -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_data(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -950,7 +954,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  			die(_("unable to parse commit %s"),
>  				oid_to_hex(&(*list)->object.oid));
>  		tree = get_commit_tree_oid(*list);
> -		hashwrite(f, tree->hash, hash_len);
> +		hashwrite(f, tree->hash, the_hash_algo->rawsz);

... and obviously not needed, as this example shows.

>
>  		parent = (*list)->parents;
>
> @@ -1030,10 +1034,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>
>  		list++;
>  	}
> +
> +	return 0;
>  }
>
> -static void write_graph_chunk_extra_edges(struct hashfile *f,
> -					  struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_extra_edges(struct hashfile *f,
> +					 struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1082,10 +1088,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>
>  		list++;
>  	}
> +
> +	return 0;
>  }
>
> -static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> -					    struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_bloom_indexes(struct hashfile *f,
> +					   struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1107,6 +1115,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  	}
>
>  	stop_progress(&progress);
> +	return 0;
>  }
>
>  static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
> @@ -1124,8 +1133,8 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
>  	jw_release(&jw);
>  }
>
> -static void write_graph_chunk_bloom_data(struct hashfile *f,
> -					 struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_bloom_data(struct hashfile *f,
> +					struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1151,6 +1160,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	}
>
>  	stop_progress(&progress);
> +	return 0;
>  }
>
>  static int oid_compare(const void *_a, const void *_b)
> @@ -1662,8 +1672,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			num_chunks * ctx->commits.nr);
>  	}
>  	write_graph_chunk_fanout(f, ctx);
> -	write_graph_chunk_oids(f, hashsz, ctx);
> -	write_graph_chunk_data(f, hashsz, ctx);
> +	write_graph_chunk_oids(f, ctx);
> +	write_graph_chunk_data(f, ctx);
>  	if (ctx->num_extra_edges)
>  		write_graph_chunk_extra_edges(f, ctx);
>  	if (ctx->changed_paths) {
>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> In my experience while experimenting with new commit-graph chunks,
> early versions of the corresponding new write_commit_graph_my_chunk()
> functions are, sadly but not surprisingly, often buggy, and write more
> or less data than they are supposed to, especially if the chunk size
> is not directly proportional to the number of commits.  This then
> causes all kinds of issues when reading such a bogus commit-graph
> file, raising the question of whether the writing or the reading part
> happens to be buggy this time.
>
> Let's catch such issues early, already when writing the commit-graph
> file, and check that each write_graph_chunk_*() function wrote the
> amount of data that it was expected to, and what has been encoded in
> the Chunk Lookup table.  Now that all commit-graph chunks are written
> in a loop we can do this check in a single place for all chunks, and
> any chunks added in the future will get checked as well.
>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  commit-graph.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 086fc2d070..1de6800d74 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1683,12 +1683,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			num_chunks * ctx->commits.nr);
>  	}
>
> +	chunk_offset = f->total + f->offset;
>  	for (i = 0; i < num_chunks; i++) {
> +		uint64_t end_offset;
> +

Hmm, the added code looks complicated because it keeps state outside the
loop, but it could be replace by this:

		uint64_t start_offset = f->total + f->offset;

>  		if (chunks[i].write_fn(f, ctx)) {
>  			error(_("failed writing chunk with id %"PRIx32""),
>  			      chunks[i].id);
>  			return -1;
>  		}
> +
> +		end_offset = f->total + f->offset;
> +		if (end_offset - chunk_offset != chunks[i].size)
> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
> +			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
> +		chunk_offset = end_offset;

... and that:

		if (f->total + f->offset != start_offset + chunks[i].size)
			BUG(...);

>  	}
>
>  	stop_progress(&ctx->progress);
>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/25/2020 3:25 AM, René Scharfe wrote:
> Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>>
>> In my experience while experimenting with new commit-graph chunks,
>> early versions of the corresponding new write_commit_graph_my_chunk()
>> functions are, sadly but not surprisingly, often buggy, and write more
>> or less data than they are supposed to, especially if the chunk size
>> is not directly proportional to the number of commits.  This then
>> causes all kinds of issues when reading such a bogus commit-graph
>> file, raising the question of whether the writing or the reading part
>> happens to be buggy this time.
>>
>> Let's catch such issues early, already when writing the commit-graph
>> file, and check that each write_graph_chunk_*() function wrote the
>> amount of data that it was expected to, and what has been encoded in
>> the Chunk Lookup table.  Now that all commit-graph chunks are written
>> in a loop we can do this check in a single place for all chunks, and
>> any chunks added in the future will get checked as well.
>>
>> Signed-off-by: SZEDER Gábor <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>>  commit-graph.c | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 086fc2d070..1de6800d74 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -1683,12 +1683,21 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>  			num_chunks * ctx->commits.nr);
>>  	}
>>
>> +	chunk_offset = f->total + f->offset;
>>  	for (i = 0; i < num_chunks; i++) {
>> +		uint64_t end_offset;
>> +
> 
> Hmm, the added code looks complicated because it keeps state outside the
> loop, but it could be replace by this:
> 
> 		uint64_t start_offset = f->total + f->offset;
> 
>>  		if (chunks[i].write_fn(f, ctx)) {
>>  			error(_("failed writing chunk with id %"PRIx32""),
>>  			      chunks[i].id);
>>  			return -1;
>>  		}
>> +
>> +		end_offset = f->total + f->offset;
>> +		if (end_offset - chunk_offset != chunks[i].size)
>> +			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
>> +			    chunks[i].size, chunks[i].id, end_offset - chunk_offset);
>> +		chunk_offset = end_offset;
> 
> ... and that:
> 
> 		if (f->total + f->offset != start_offset + chunks[i].size)
> 			BUG(...);

Thanks! I agree this approach is simpler and less prone to
bugs since we are using the local state.

-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> In write_commit_graph_file() we now have one block of code filling the
> array of 'struct chunk_info' with the IDs and sizes of chunks to be
> written, and an other block of code calling the functions responsible
> for writing individual chunks.  In case of optional chunks like Extra
> Edge List an Base Graphs List there is also a condition checking
> whether that chunk is necessary/desired, and that same condition is
> repeated in both blocks of code. Other, newer chunks have similar
> optional conditions.
>
> Eliminate these repeated conditions by storing the function pointers
> responsible for writing individual chunks in the 'struct chunk_info'
> array as well, and calling them in a loop to write the commit-graph
> file.  This will open up the possibility for a bit of foolproofing in
> the following patch.

You can do that without storing function pointers by selecting the
function to use based on the chunk ID -- like parse_commit_graph() does
on the read side.  Advantage: You don't need to press all write
functions into the same mold and can keep their individual signatures.

>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  commit-graph.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index f33bfe49b3..086fc2d070 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1555,9 +1555,13 @@ static int write_graph_chunk_base(struct hashfile *f,
>  	return 0;
>  }
>
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      struct write_commit_graph_context *ctx);
> +
>  struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
> +	chunk_write_fn write_fn;
>  };
>
>  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> @@ -1615,27 +1619,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
>  	chunks[0].size = GRAPH_FANOUT_SIZE;
> +	chunks[0].write_fn = write_graph_chunk_fanout;
>  	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunks[1].size = hashsz * ctx->commits.nr;
> +	chunks[1].write_fn = write_graph_chunk_oids;
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
> +	chunks[2].write_fn = write_graph_chunk_data;
>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> +		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
>  		num_chunks++;
>  	}
>  	if (ctx->changed_paths) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
>  		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
>  		num_chunks++;
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
>  		chunks[num_chunks].size = sizeof(uint32_t) * 3
>  					  + ctx->total_bloom_filter_data_size;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>  		num_chunks++;
>  	}
>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
>  		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
> +		chunks[num_chunks].write_fn = write_graph_chunk_base;
>  		num_chunks++;
>  	}
>
> @@ -1671,19 +1682,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			progress_title.buf,
>  			num_chunks * ctx->commits.nr);
>  	}
> -	write_graph_chunk_fanout(f, ctx);
> -	write_graph_chunk_oids(f, ctx);
> -	write_graph_chunk_data(f, ctx);
> -	if (ctx->num_extra_edges)
> -		write_graph_chunk_extra_edges(f, ctx);
> -	if (ctx->changed_paths) {
> -		write_graph_chunk_bloom_indexes(f, ctx);
> -		write_graph_chunk_bloom_data(f, ctx);
> -	}
> -	if (ctx->num_commit_graphs_after > 1 &&
> -	    write_graph_chunk_base(f, ctx)) {
> -		return -1;
> +
> +	for (i = 0; i < num_chunks; i++) {
> +		if (chunks[i].write_fn(f, ctx)) {
> +			error(_("failed writing chunk with id %"PRIx32""),
> +			      chunks[i].id);

Of all the write functions only write_graph_chunk_base() can return
non-zero and it already prints an error message in that case ("failed to
write correct number of base graph ids").  Why add this one?

> +			return -1;
> +		}
>  	}
> +
>  	stop_progress(&ctx->progress);
>  	strbuf_release(&progress_title);
>
>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/25/2020 3:25 AM, René Scharfe wrote:
> Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>>
>> In write_commit_graph_file() we now have one block of code filling the
>> array of 'struct chunk_info' with the IDs and sizes of chunks to be
>> written, and an other block of code calling the functions responsible
>> for writing individual chunks.  In case of optional chunks like Extra
>> Edge List an Base Graphs List there is also a condition checking
>> whether that chunk is necessary/desired, and that same condition is
>> repeated in both blocks of code. Other, newer chunks have similar
>> optional conditions.
>>
>> Eliminate these repeated conditions by storing the function pointers
>> responsible for writing individual chunks in the 'struct chunk_info'
>> array as well, and calling them in a loop to write the commit-graph
>> file.  This will open up the possibility for a bit of foolproofing in
>> the following patch.
> 
> You can do that without storing function pointers by selecting the
> function to use based on the chunk ID -- like parse_commit_graph() does
> on the read side.  Advantage: You don't need to press all write
> functions into the same mold and can keep their individual signatures.

I do think that the loop without a switch statement is valuable.
It focuses the updates for new chunks to be localized to the
section that calculates the offset values.

>>
>> Signed-off-by: SZEDER Gábor <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>>  commit-graph.c | 31 +++++++++++++++++++------------
>>  1 file changed, 19 insertions(+), 12 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index f33bfe49b3..086fc2d070 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -1555,9 +1555,13 @@ static int write_graph_chunk_base(struct hashfile *f,
>>  	return 0;
>>  }
>>
>> +typedef int (*chunk_write_fn)(struct hashfile *f,
>> +			      struct write_commit_graph_context *ctx);
>> +
>>  struct chunk_info {
>>  	uint32_t id;
>>  	uint64_t size;
>> +	chunk_write_fn write_fn;
>>  };
>>
>>  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>> @@ -1615,27 +1619,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>
>>  	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
>>  	chunks[0].size = GRAPH_FANOUT_SIZE;
>> +	chunks[0].write_fn = write_graph_chunk_fanout;
>>  	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
>>  	chunks[1].size = hashsz * ctx->commits.nr;
>> +	chunks[1].write_fn = write_graph_chunk_oids;
>>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
>> +	chunks[2].write_fn = write_graph_chunk_data;
>>  	if (ctx->num_extra_edges) {
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
>> +		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
>>  		num_chunks++;
>>  	}
>>  	if (ctx->changed_paths) {
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
>>  		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
>> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
>>  		num_chunks++;
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
>>  		chunks[num_chunks].size = sizeof(uint32_t) * 3
>>  					  + ctx->total_bloom_filter_data_size;
>> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>>  		num_chunks++;
>>  	}
>>  	if (ctx->num_commit_graphs_after > 1) {
>>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
>>  		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
>> +		chunks[num_chunks].write_fn = write_graph_chunk_base;
>>  		num_chunks++;
>>  	}
>>
>> @@ -1671,19 +1682,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>  			progress_title.buf,
>>  			num_chunks * ctx->commits.nr);
>>  	}
>> -	write_graph_chunk_fanout(f, ctx);
>> -	write_graph_chunk_oids(f, ctx);
>> -	write_graph_chunk_data(f, ctx);
>> -	if (ctx->num_extra_edges)
>> -		write_graph_chunk_extra_edges(f, ctx);
>> -	if (ctx->changed_paths) {
>> -		write_graph_chunk_bloom_indexes(f, ctx);
>> -		write_graph_chunk_bloom_data(f, ctx);
>> -	}
>> -	if (ctx->num_commit_graphs_after > 1 &&
>> -	    write_graph_chunk_base(f, ctx)) {
>> -		return -1;
>> +
>> +	for (i = 0; i < num_chunks; i++) {
>> +		if (chunks[i].write_fn(f, ctx)) {
>> +			error(_("failed writing chunk with id %"PRIx32""),
>> +			      chunks[i].id);
> 
> Of all the write functions only write_graph_chunk_base() can return
> non-zero and it already prints an error message in that case ("failed to
> write correct number of base graph ids").  Why add this one?

Ok, we can require the chunk methods to write an error() message with
appropriate context and simply return -1 here.

Thanks,
-Stolee

#include "bloom.h"
#include "commit-slab.h"
#include "json-writer.h"
#include "trace2.h"

void git_test_write_commit_graph_or_die(void)
{
Expand Down Expand Up @@ -564,10 +566,6 @@ static int prepare_commit_graph(struct repository *r)
return !!r->objects->commit_graph;
r->objects->commit_graph_attempted = 1;

if (git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD, 0))
die("dying as requested by the '%s' variable on commit-graph load!",
GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD);

prepare_repo_settings(r);

if (!git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
Expand Down Expand Up @@ -790,6 +788,14 @@ static int parse_commit_in_graph_one(struct repository *r,

int parse_commit_in_graph(struct repository *r, struct commit *item)
{
static int checked_env = 0;

if (!checked_env &&
git_env_bool(GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE, 0))
die("dying as requested by the '%s' variable on commit-graph parse!",
GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE);
checked_env = 1;

if (!prepare_commit_graph(r))
return 0;
return parse_commit_in_graph_one(r, r->objects->commit_graph, item);
Expand Down Expand Up @@ -882,10 +888,11 @@ struct write_commit_graph_context {

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
> From: Derrick Stolee <[email protected]>
>
> Place an instance of struct bloom_settings into the struct
> write_commit_graph_context. This allows simplifying the function
> prototype of write_graph_chunk_bloom_data(). This will allow us
> to combine the function prototypes and use function pointers to
> simplify write_commit_graph_file().
>
> Reported-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  commit-graph.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 887837e8826..05b7035d8d5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>
>  	const struct split_commit_graph_opts *split_opts;
>  	size_t total_bloom_filter_data_size;
> +	struct bloom_filter_settings bloom_settings;

That structure is quite busy already, so adding one more member wouldn't
matter much.

Passing so many things to lots of functions makes it harder to argue
about them, though, as all of them effectively become part of their
signature, and you have to look at their implementation to see which
pseudo-parameters they actually use.  It's like a God object.

>  };
>
>  static void write_graph_chunk_fanout(struct hashfile *f,
> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  }
>
>  static void write_graph_chunk_bloom_data(struct hashfile *f,
> -					 struct write_commit_graph_context *ctx,
> -					 const struct bloom_filter_settings *settings)
> +					 struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  			_("Writing changed paths Bloom filters data"),
>  			ctx->commits.nr);
>
> -	hashwrite_be32(f, settings->hash_version);
> -	hashwrite_be32(f, settings->num_hashes);
> -	hashwrite_be32(f, settings->bits_per_entry);
> +	hashwrite_be32(f, ctx->bloom_settings.hash_version);
> +	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
> +	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>
>  	while (list < last) {
>  		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  	struct object_id file_hash;
>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>
> +	ctx->bloom_settings = bloom_settings;

So we use the defaults, no customization?  Then you could simply move
the declaration of bloom_settings from write_commit_graph_file() to
write_graph_chunk_bloom_data().  Glancing at pu I don't see additional
uses there, so no need to put it into the context (yet?).

René

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 6/18/2020 4:30 PM, René Scharfe wrote:
> Am 15.06.20 um 22:14 schrieb Derrick Stolee via GitGitGadget:
>> From: Derrick Stolee <[email protected]>
>>
>> Place an instance of struct bloom_settings into the struct
>> write_commit_graph_context. This allows simplifying the function
>> prototype of write_graph_chunk_bloom_data(). This will allow us
>> to combine the function prototypes and use function pointers to
>> simplify write_commit_graph_file().
>>
>> Reported-by: SZEDER Gábor <[email protected]>
>> Signed-off-by: Derrick Stolee <[email protected]>
>> ---
>>  commit-graph.c | 14 ++++++++------
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 887837e8826..05b7035d8d5 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -882,6 +882,7 @@ struct write_commit_graph_context {
>>
>>  	const struct split_commit_graph_opts *split_opts;
>>  	size_t total_bloom_filter_data_size;
>> +	struct bloom_filter_settings bloom_settings;
> 
> That structure is quite busy already, so adding one more member wouldn't
> matter much.
> 
> Passing so many things to lots of functions makes it harder to argue
> about them, though, as all of them effectively become part of their
> signature, and you have to look at their implementation to see which
> pseudo-parameters they actually use.  It's like a God object.

Correct. The write_commit_graph_context _is_ a God object for the
commit-graph write. The good news is that it is limited only to
commit-graph.c and the write operations therein. Hopefully, the
code organization benefits enough from this structure to justify
the massive struct.

In contrast, it's still smaller and more contained than
"struct rev_info"!

>>  };
>>
>>  static void write_graph_chunk_fanout(struct hashfile *f,
>> @@ -1103,8 +1104,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>>  }
>>
>>  static void write_graph_chunk_bloom_data(struct hashfile *f,
>> -					 struct write_commit_graph_context *ctx,
>> -					 const struct bloom_filter_settings *settings)
>> +					 struct write_commit_graph_context *ctx)
>>  {
>>  	struct commit **list = ctx->commits.list;
>>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
>> @@ -1116,9 +1116,9 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>>  			_("Writing changed paths Bloom filters data"),
>>  			ctx->commits.nr);
>>
>> -	hashwrite_be32(f, settings->hash_version);
>> -	hashwrite_be32(f, settings->num_hashes);
>> -	hashwrite_be32(f, settings->bits_per_entry);
>> +	hashwrite_be32(f, ctx->bloom_settings.hash_version);
>> +	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
>> +	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
>>
>>  	while (list < last) {
>>  		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>> @@ -1541,6 +1541,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>>  	struct object_id file_hash;
>>  	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>
>> +	ctx->bloom_settings = bloom_settings;
> 
> So we use the defaults, no customization?  Then you could simply move
> the declaration of bloom_settings from write_commit_graph_file() to
> write_graph_chunk_bloom_data().  Glancing at pu I don't see additional
> uses there, so no need to put it into the context (yet?).

It certainly is not customized by a user (yet). However, you do make an
excellent point that I need to be more careful here! Patch 8
(commit-graph: persist existence of changed-paths) needs to load the
bloom_filter_settings from the existing commit-graph so we can be
future-proof from a future version customizing the settings inside the
commit-graph file!

This means that in v2 I'll move patches 7 & 8 to be after patch 1 and
add a test to verify the filter settings are preserved (after manually
changing the data in the file).

Thanks!
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> -static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_oids(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	int count;
>  	for (count = 0; count < ctx->commits.nr; count++, list++) {
>  		display_progress(ctx->progress, ++ctx->progress_cnt);
> -		hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
> +		hashwrite(f, (*list)->object.oid.hash, (int)the_hash_algo->rawsz);

Before the cast was forcing an int into an int (huh?), now it forces a
size_t into an int, but hashwrite() expects an unsigned int.  Do we
really need that cast?

>  	}
> +
> +	return 0;
>  }
>
>  static const unsigned char *commit_to_sha1(size_t index, void *table)
> @@ -926,8 +930,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
>  	return commits[index]->object.oid.hash;
>  }
>
> -static void write_graph_chunk_data(struct hashfile *f, int hash_len,
> -				   struct write_commit_graph_context *ctx)
> +static int write_graph_chunk_data(struct hashfile *f,
> +				  struct write_commit_graph_context *ctx)
>  {
>  	struct commit **list = ctx->commits.list;
>  	struct commit **last = ctx->commits.list + ctx->commits.nr;
> @@ -944,7 +948,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
>  			die(_("unable to parse commit %s"),
>  				oid_to_hex(&(*list)->object.oid));
>  		tree = get_commit_tree_oid(*list);
> -		hashwrite(f, tree->hash, hash_len);
> +		hashwrite(f, tree->hash, the_hash_algo->rawsz);

... and here's the answer: No, we don't need to cast.

René

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.20 um 22:14 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <[email protected]>
>
> In write_commit_graph_file() we now have one block of code filling the
> array of 'struct chunk_info' with the IDs and sizes of chunks to be
> written, and an other block of code calling the functions responsible
> for writing individual chunks.  In case of optional chunks like Extra
> Edge List an Base Graphs List there is also a condition checking
> whether that chunk is necessary/desired, and that same condition is
> repeated in both blocks of code. Other, newer chunks have similar
> optional conditions.
>
> Eliminate these repeated conditions by storing the function pointers
> responsible for writing individual chunks in the 'struct chunk_info'
> array as well, and calling them in a loop to write the commit-graph
> file.  This will open up the possibility for a bit of foolproofing in
> the following patch.

OK.  An alternative would be a switch in the loop that calls the right
function based on the chunk id.  That would not require uniform
interfaces for all write functions; patch 2 would not be necessary.

>
> Signed-off-by: SZEDER Gábor <[email protected]>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  commit-graph.c | 31 +++++++++++++++++++------------
>  1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 3bae1e52ed0..78e023be664 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1532,9 +1532,13 @@ static int write_graph_chunk_base(struct hashfile *f,
>  	return 0;
>  }
>
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      struct write_commit_graph_context *ctx);
> +
>  struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
> +	chunk_write_fn write_fn;
>  };
>
>  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> @@ -1591,27 +1595,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
>  	chunks[0].size = GRAPH_FANOUT_SIZE;
> +	chunks[0].write_fn = write_graph_chunk_fanout;
>  	chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
>  	chunks[1].size = hashsz * ctx->commits.nr;
> +	chunks[1].write_fn = write_graph_chunk_oids;
>  	chunks[2].id = GRAPH_CHUNKID_DATA;
>  	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
> +	chunks[2].write_fn = write_graph_chunk_data;
>  	if (ctx->num_extra_edges) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
>  		chunks[num_chunks].size = 4 * ctx->num_extra_edges;
> +		chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
>  		num_chunks++;
>  	}
>  	if (ctx->changed_paths) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
>  		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
>  		num_chunks++;
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
>  		chunks[num_chunks].size = sizeof(uint32_t) * 3
>  					  + ctx->total_bloom_filter_data_size;
> +		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
>  		num_chunks++;
>  	}
>  	if (ctx->num_commit_graphs_after > 1) {
>  		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
>  		chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
> +		chunks[num_chunks].write_fn = write_graph_chunk_base;
>  		num_chunks++;
>  	}
>
> @@ -1647,19 +1658,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  			progress_title.buf,
>  			num_chunks * ctx->commits.nr);
>  	}
> -	write_graph_chunk_fanout(f, ctx);
> -	write_graph_chunk_oids(f, ctx);
> -	write_graph_chunk_data(f, ctx);
> -	if (ctx->num_extra_edges)
> -		write_graph_chunk_extra_edges(f, ctx);
> -	if (ctx->changed_paths) {
> -		write_graph_chunk_bloom_indexes(f, ctx);
> -		write_graph_chunk_bloom_data(f, ctx);
> -	}
> -	if (ctx->num_commit_graphs_after > 1 &&
> -	    write_graph_chunk_base(f, ctx)) {
> -		return -1;
> +
> +	for (i = 0; i < num_chunks; i++) {
> +		if (chunks[i].write_fn(f, ctx)) {
> +			error(_("failed writing chunk with id %"PRIx32""),
> +			      chunks[i].id);

This error message is new and not mentioned in the commit message.
write_graph_chunk_base() seems to be the only write function that can
return something else than 0, and it already reports an error in that
case.  So do we really want the one here as well?

René

const struct split_commit_graph_opts *split_opts;
size_t total_bloom_filter_data_size;
const struct bloom_filter_settings *bloom_settings;
};

static void write_graph_chunk_fanout(struct hashfile *f,
struct write_commit_graph_context *ctx)
static int write_graph_chunk_fanout(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
int i, count = 0;
struct commit **list = ctx->commits.list;
Expand All @@ -906,17 +913,21 @@ static void write_graph_chunk_fanout(struct hashfile *f,

hashwrite_be32(f, count);
}

return 0;
}

static void write_graph_chunk_oids(struct hashfile *f, int hash_len,
struct write_commit_graph_context *ctx)
static int write_graph_chunk_oids(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
struct commit **list = ctx->commits.list;
int count;
for (count = 0; count < ctx->commits.nr; count++, list++) {
display_progress(ctx->progress, ++ctx->progress_cnt);
hashwrite(f, (*list)->object.oid.hash, (int)hash_len);
hashwrite(f, (*list)->object.oid.hash, the_hash_algo->rawsz);
}

return 0;
}

static const unsigned char *commit_to_sha1(size_t index, void *table)
Expand All @@ -925,8 +936,8 @@ static const unsigned char *commit_to_sha1(size_t index, void *table)
return commits[index]->object.oid.hash;
}

static void write_graph_chunk_data(struct hashfile *f, int hash_len,
struct write_commit_graph_context *ctx)
static int write_graph_chunk_data(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
struct commit **list = ctx->commits.list;
struct commit **last = ctx->commits.list + ctx->commits.nr;
Expand All @@ -943,7 +954,7 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,
die(_("unable to parse commit %s"),
oid_to_hex(&(*list)->object.oid));
tree = get_commit_tree_oid(*list);
hashwrite(f, tree->hash, hash_len);
hashwrite(f, tree->hash, the_hash_algo->rawsz);

parent = (*list)->parents;

Expand Down Expand Up @@ -1023,10 +1034,12 @@ static void write_graph_chunk_data(struct hashfile *f, int hash_len,

list++;
}

return 0;
}

static void write_graph_chunk_extra_edges(struct hashfile *f,
struct write_commit_graph_context *ctx)
static int write_graph_chunk_extra_edges(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
struct commit **list = ctx->commits.list;
struct commit **last = ctx->commits.list + ctx->commits.nr;
Expand Down Expand Up @@ -1075,10 +1088,12 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,

list++;
}

return 0;
}

static void write_graph_chunk_bloom_indexes(struct hashfile *f,
struct write_commit_graph_context *ctx)
static int write_graph_chunk_bloom_indexes(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
struct commit **list = ctx->commits.list;
struct commit **last = ctx->commits.list + ctx->commits.nr;
Expand All @@ -1093,41 +1108,63 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,

while (list < last) {
struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
cur_pos += filter->len;
size_t len = filter ? filter->len : 0;
cur_pos += len;
display_progress(progress, ++i);
hashwrite_be32(f, cur_pos);
list++;
}

stop_progress(&progress);
return 0;
}

static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
{
struct json_writer jw = JSON_WRITER_INIT;

jw_object_begin(&jw, 0);
jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
jw_end(&jw);

trace2_data_json("bloom", ctx->r, "settings", &jw);

jw_release(&jw);
}

static void write_graph_chunk_bloom_data(struct hashfile *f,
struct write_commit_graph_context *ctx,
const struct bloom_filter_settings *settings)
static int write_graph_chunk_bloom_data(struct hashfile *f,
struct write_commit_graph_context *ctx)
{
struct commit **list = ctx->commits.list;
struct commit **last = ctx->commits.list + ctx->commits.nr;
struct progress *progress = NULL;
int i = 0;

trace2_bloom_filter_settings(ctx);

if (ctx->report_progress)
progress = start_delayed_progress(
_("Writing changed paths Bloom filters data"),
ctx->commits.nr);

hashwrite_be32(f, settings->hash_version);
hashwrite_be32(f, settings->num_hashes);
hashwrite_be32(f, settings->bits_per_entry);
hashwrite_be32(f, ctx->bloom_settings->hash_version);
hashwrite_be32(f, ctx->bloom_settings->num_hashes);
hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);

while (list < last) {
struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
size_t len = filter ? filter->len : 0;
display_progress(progress, ++i);
hashwrite(f, filter->data, filter->len * sizeof(unsigned char));

if (len)
hashwrite(f, filter->data, len * sizeof(unsigned char));
list++;
}

stop_progress(&progress);
return 0;
}

static int oid_compare(const void *_a, const void *_b)
Expand Down Expand Up @@ -1522,9 +1559,13 @@ static int write_graph_chunk_base(struct hashfile *f,
return 0;
}

typedef int (*chunk_write_fn)(struct hashfile *f,
struct write_commit_graph_context *ctx);

struct chunk_info {
uint32_t id;
uint64_t size;
chunk_write_fn write_fn;
};

static int write_commit_graph_file(struct write_commit_graph_context *ctx)
Expand All @@ -1539,7 +1580,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
int num_chunks = 3;
uint64_t chunk_offset;
struct object_id file_hash;
const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;

if (!ctx->bloom_settings) {
bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
bloom_settings.bits_per_entry);
bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
bloom_settings.num_hashes);
ctx->bloom_settings = &bloom_settings;
}

if (ctx->split) {
struct strbuf tmp_file = STRBUF_INIT;
Expand Down Expand Up @@ -1579,27 +1628,34 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)

chunks[0].id = GRAPH_CHUNKID_OIDFANOUT;
chunks[0].size = GRAPH_FANOUT_SIZE;
chunks[0].write_fn = write_graph_chunk_fanout;
chunks[1].id = GRAPH_CHUNKID_OIDLOOKUP;
chunks[1].size = hashsz * ctx->commits.nr;
chunks[1].write_fn = write_graph_chunk_oids;
chunks[2].id = GRAPH_CHUNKID_DATA;
chunks[2].size = (hashsz + 16) * ctx->commits.nr;
chunks[2].write_fn = write_graph_chunk_data;
if (ctx->num_extra_edges) {
chunks[num_chunks].id = GRAPH_CHUNKID_EXTRAEDGES;
chunks[num_chunks].size = 4 * ctx->num_extra_edges;
chunks[num_chunks].write_fn = write_graph_chunk_extra_edges;
num_chunks++;
}
if (ctx->changed_paths) {
chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMINDEXES;
chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
chunks[num_chunks].write_fn = write_graph_chunk_bloom_indexes;
num_chunks++;
chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMDATA;
chunks[num_chunks].size = sizeof(uint32_t) * 3
+ ctx->total_bloom_filter_data_size;
chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
num_chunks++;
}
if (ctx->num_commit_graphs_after > 1) {
chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
chunks[num_chunks].size = hashsz * (ctx->num_commit_graphs_after - 1);
chunks[num_chunks].write_fn = write_graph_chunk_base;
num_chunks++;
}

Expand Down Expand Up @@ -1635,19 +1691,19 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
progress_title.buf,
num_chunks * ctx->commits.nr);
}
write_graph_chunk_fanout(f, ctx);
write_graph_chunk_oids(f, hashsz, ctx);
write_graph_chunk_data(f, hashsz, ctx);
if (ctx->num_extra_edges)
write_graph_chunk_extra_edges(f, ctx);
if (ctx->changed_paths) {
write_graph_chunk_bloom_indexes(f, ctx);
write_graph_chunk_bloom_data(f, ctx, &bloom_settings);
}
if (ctx->num_commit_graphs_after > 1 &&
write_graph_chunk_base(f, ctx)) {
return -1;

for (i = 0; i < num_chunks; i++) {
uint64_t start_offset = f->total + f->offset;

if (chunks[i].write_fn(f, ctx))
return -1;

if (f->total + f->offset != start_offset + chunks[i].size)
BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
chunks[i].size, chunks[i].id,
f->total + f->offset - start_offset);
}

stop_progress(&ctx->progress);
strbuf_release(&progress_title);

Expand Down Expand Up @@ -1964,9 +2020,23 @@ int write_commit_graph(struct object_directory *odb,
ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
ctx->split_opts = split_opts;
ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
ctx->total_bloom_filter_data_size = 0;

if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
ctx->changed_paths = 1;
if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
struct commit_graph *g;
prepare_commit_graph_one(ctx->r, ctx->odb);

g = ctx->r->objects->commit_graph;

/* We have changed-paths already. Keep them in the next graph */
if (g && g->chunk_bloom_data) {
ctx->changed_paths = 1;
ctx->bloom_settings = g->bloom_filter_settings;
}
}

if (ctx->split) {
struct commit_graph *g;
prepare_commit_graph(ctx->r);
Expand Down
3 changes: 2 additions & 1 deletion commit-graph.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#include "object-store.h"

#define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD "GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD"
#define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
#define GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS "GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS"

/*
Expand Down Expand Up @@ -96,6 +96,7 @@ enum commit_graph_write_flags {
/* Make sure that each OID in the input is a valid commit OID. */
COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS = (1 << 5),
};

struct split_commit_graph_opts {
Expand Down
Loading