Skip to content

Commit-Graph: Verify bloom filter #687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

sluongng
Copy link

@sluongng sluongng commented Jul 30, 2020

When I was working on git-care(1) and Gitaly(2), the need to check whether a commit-graph (split or non-split) were built with Bloom filter.
This is needed especially when a repository primary commit-graph write strategy is '--split' and the bottom chains might rarely be re-written (or never) thus Bloom filter is never applied to the graph.

Provides users with a straight forward way to validate the existence of Bloom filter chunks to save user having to read the commit-graph manually as show in (1) and (2).

References:

  1. sluongng/git-care@d0feaa3
  2. https://gitlab.com/sluongng/gitaly/-/commit/78dba8b73e720b11500482b19b755346ec853025

It's probably going to take me a bit more time to write up some tests for this, so I want to send it out first for comments.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 30, 2020

The pull request has 164 commits. The max allowed is 30. Please split the patch series into multiple pull requests. Also consider squashing related commits.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 30, 2020

There are merge commits in this Pull Request:

abb59f83a216876869f6c88d32485dc740ada78d
4cea3bfaac702014c2cd2fe46c67a879caf447e6
5a72620cbeae70e573f89cb4527f8316149df776
2901ff943eacea7443bd7252755dfc50a076bb17
697a10b0310c9c256dd94fe93bc7cbd0bf471006
167a6293598024fb4ea1aa26bd4eb174150dd359
2e63c8c0fc6ca5b33d78f20a5dc100c2f2d8f798
92944ed7f6acce88fbd4d386df4324679544c467
60144a59ed98504fce8647fb5a124fc8abe3fc46
ad3269ff400bd1296b5e78fc2807a70e020ffa9f
c72c7da667dba3465fb566ecb23457950e28893c
ee63b70dc2b3f8ce9b9876cd178bd759d779b2f3
912c337d6376c32bf531da21ca29ae4d597b05b3
5f7c822a9d08829ff220ee8724ffab7225eb70c4
bcb73516bf1f7fd16f882782d42250dd19f39b85
3521be3834efe8d15487dd79e15e6a071a6cb33d
5b31140c0ac3c780264878d71c5102b82f590355
2c4ec990a6808f39266233f1b2101fa2a16112cc
e0b54a001cf92988dc794838a6b86b23c8067272
393eff577acebb23ae393d0f7f44cdf8930caed4

Please rebase the branch and force-push.

@sluongng sluongng force-pushed the sluongngoc/verify-bloom-filter branch from 80d776b to 69a620b Compare July 30, 2020 20:30
Add '--has-changed-paths' option to 'git commit-graph verify' subcommand
to validate whether the commit-graph was written with '--changed-paths'
option.

Signed-off-by: Son Luong Ngoc <[email protected]>
@sluongng sluongng force-pushed the sluongngoc/verify-bloom-filter branch from 69a620b to 77e6e0a Compare July 31, 2020 06:21
@sluongng
Copy link
Author

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

Preview email sent as [email protected]

@sluongng
Copy link
Author

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

Preview email sent as [email protected]

@sluongng
Copy link
Author

/preview

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

Preview email sent as [email protected]

@sluongng
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

Submitted as [email protected]

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

On the Git mailing list, Christian Couder wrote (reply to this):

On Fri, Jul 31, 2020 at 9:52 AM Son Luong Ngoc via GitGitGadget
<[email protected]> wrote:
>
> From: Son Luong Ngoc <[email protected]>
>
> Add '--has-changed-paths' option to 'git commit-graph verify' subcommand
> to validate whether the commit-graph was written with '--changed-paths'
> option.
>
> Signed-off-by: Son Luong Ngoc <[email protected]>

[...]

>     It's probably going to take me a bit more time to write up some tests
>     for this,

It would need some documentation too.

> so I want to send it out first for comments.

[...]

> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index 16c9f6101a..ce8a7cbe90 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -18,7 +18,8 @@ static char const * const builtin_commit_graph_usage[] = {
>  };
>
>  static const char * const builtin_commit_graph_verify_usage[] = {
> -       N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> +       N_("git commit-graph verify [--object-dir <objdir>] [--shallow] "
> +           "[--has-changed-paths] [--[no-]progress]"),
>         NULL
>  };
>
> @@ -37,6 +38,7 @@ static struct opts_commit_graph {
>         int append;
>         int split;
>         int shallow;
> +       int has_changed_paths;
>         int progress;
>         int enable_changed_paths;
>  } opts;
> @@ -71,12 +73,14 @@ static int graph_verify(int argc, const char **argv)
>         int open_ok;
>         int fd;
>         struct stat st;
> -       int flags = 0;
> +       enum commit_graph_verify_flags flags = 0;
>
>         static struct option builtin_commit_graph_verify_options[] = {
>                 OPT_STRING(0, "object-dir", &opts.obj_dir,
>                            N_("dir"),
>                            N_("The object directory to store the graph")),
> +               OPT_BOOL(0, "has-changed-paths", &opts.has_changed_paths,
> +                        N_("verify that the commit-graph includes changed paths")),
>                 OPT_BOOL(0, "shallow", &opts.shallow,
>                          N_("if the commit-graph is split, only verify the tip file")),
>                 OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
> @@ -94,8 +98,10 @@ static int graph_verify(int argc, const char **argv)
>                 opts.obj_dir = get_object_directory();
>         if (opts.shallow)
>                 flags |= COMMIT_GRAPH_VERIFY_SHALLOW;
> +       if (opts.has_changed_paths)
> +               flags |= COMMIT_GRAPH_VERIFY_CHANGED_PATHS;

I wonder if OPT_BIT() could be used instead of OPT_BOOL() above to
directly set the above flag, as the 'has_changed_paths' field in
'struct opts_commit_graph' seems to be used only for the purpose of
setting this flag.

>         if (opts.progress)
> -               flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +               flags |= COMMIT_GRAPH_VERIFY_PROGRESS;

Does this change belong to this patch? I think it would deserve an
explanation in the commit message if that's the case.

>         odb = find_odb(the_repository, opts.obj_dir);
>         graph_name = get_commit_graph_filename(odb);
> diff --git a/commit-graph.c b/commit-graph.c
> index 1af68c297d..d83f5a2325 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -250,7 +250,7 @@ struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st,
>         return ret;
>  }
>
> -static int verify_commit_graph_lite(struct commit_graph *g)
> +static int verify_commit_graph_lite(struct commit_graph *g, int verify_changed_path)

[...]

> -int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> +int verify_commit_graph(struct repository *r,
> +                       struct commit_graph *g,
> +                       enum commit_graph_verify_flags flags)

It seems to me that it would be more coherent to have both
verify_commit_graph() and verify_commit_graph_lite() accept an 'enum
commit_graph_verify_flags flags' argument.

Right now the "has_changed_paths" option is first an int, then it's
converted to a flag and then to an int again before being passed to
verify_commit_graph_lite(). It would be simpler if it could be a flag
all along.

Thanks,
Christian.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Son Luong Ngoc via GitGitGadget" <[email protected]> writes:

> From: Son Luong Ngoc <[email protected]>
>
> Add '--has-changed-paths' option to 'git commit-graph verify' subcommand
> to validate whether the commit-graph was written with '--changed-paths'
> option.

The implementation seems to be only about "does this section exist?"
and not "does this section have healthy/uncorrupted data?", which
feels a bit strange for "verify".  Instead of setting ourselves up
to having to add "--has-this-section" and "--has-that-section" every
time a new kind of data is added to the system, how about giving the
verify command an option to list all the sections found in the file,
or a separate "git commit-graph list-sections" subcommand?

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Jul 31, 2020 at 07:49:25AM +0000, Son Luong Ngoc via GitGitGadget wrote:

> From: Son Luong Ngoc <[email protected]>
> 
> Add '--has-changed-paths' option to 'git commit-graph verify' subcommand
> to validate whether the commit-graph was written with '--changed-paths'
> option.

Is a single boolean flag sufficient? If you have incrementals, you might
have some slices with this chunk and some without. What should the
boolean be in that case?

I thought we had some way of reporting the number of commits covered by
filters, but I can't seem to find it.

Our "test-tool read-graph" can report on whether there's a bloom filter
chunk, but I think it also doesn't distinguish between different slices
(and anyway, it wouldn't be suitable for tools that don't rely on an
actual built git.git directory).

-Peff

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Jul 31, 2020 at 10:14:39AM -0700, Junio C Hamano wrote:
> "Son Luong Ngoc via GitGitGadget" <[email protected]> writes:
>
> > From: Son Luong Ngoc <[email protected]>
> >
> > Add '--has-changed-paths' option to 'git commit-graph verify' subcommand
> > to validate whether the commit-graph was written with '--changed-paths'
> > option.
>
> The implementation seems to be only about "does this section exist?"
> and not "does this section have healthy/uncorrupted data?", which
> feels a bit strange for "verify".  Instead of setting ourselves up
> to having to add "--has-this-section" and "--has-that-section" every
> time a new kind of data is added to the system, how about giving the
> verify command an option to list all the sections found in the file,
> or a separate "git commit-graph list-sections" subcommand?

Completely agreed. When I suggested that Son work on this, I more had in
mind something like 'git commit-graph verify --changed-paths' to mean
"verify the integrity of the commit-graph(s), including regenerating
changed-path Bloom filters and making sure they match".

If you are just curious whether or not the section exists, I'd rather
write a script to look for the 'BIDX' or 'BDAT' chunk IDs. That said, if
they're spread across incremental, maybe it makes more sense to extend
the commit-graph test tool.

I dunno.

Thanks,
Taylor

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Jul 31, 2020 at 02:02:35PM -0400, Jeff King wrote:
> On Fri, Jul 31, 2020 at 07:49:25AM +0000, Son Luong Ngoc via GitGitGadget wrote:
>
> > From: Son Luong Ngoc <[email protected]>
> >
> > Add '--has-changed-paths' option to 'git commit-graph verify' subcommand
> > to validate whether the commit-graph was written with '--changed-paths'
> > option.
>
> Is a single boolean flag sufficient? If you have incrementals, you might
> have some slices with this chunk and some without. What should the
> boolean be in that case?

I think you'd really want to know which layers do and don't have
filters. It might be even more interesting to have a tool like what 'git
show-index' is to '*.idx' files, maybe something like 'git show-graph'
or 'git show-commit-graph'. Its output would be one line per commit that
shows:

  - what layer in the chain it's located at
  - its graph_pos
  - its generation number
  - whether or not it has a Bloom filter
  - ???

That would be a useful tool for debugging anyway, even outside of the
test suite. It would be even better if we could replace the test-tool
with it.

On an unrelated note; this patch is broken as-is, since it will only
report that Bloom filters exist if the top-most graph has them. I have a
patch to fix this that I have been meaning to send out for most of this
week. I'll try to get to it shortly.

> I thought we had some way of reporting the number of commits covered by
> filters, but I can't seem to find it.

I don't recall having anything like that.

> Our "test-tool read-graph" can report on whether there's a bloom filter
> chunk, but I think it also doesn't distinguish between different slices
> (and anyway, it wouldn't be suitable for tools that don't rely on an
> actual built git.git directory).
>
> -Peff
Thanks,
Taylor

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Jul 31, 2020 at 02:09:56PM -0400, Taylor Blau wrote:

> > Is a single boolean flag sufficient? If you have incrementals, you might
> > have some slices with this chunk and some without. What should the
> > boolean be in that case?
> 
> I think you'd really want to know which layers do and don't have
> filters. It might be even more interesting to have a tool like what 'git
> show-index' is to '*.idx' files, maybe something like 'git show-graph'
> or 'git show-commit-graph'. Its output would be one line per commit that
> shows:
> 
>   - what layer in the chain it's located at
>   - its graph_pos
>   - its generation number
>   - whether or not it has a Bloom filter
>   - ???
> 
> That would be a useful tool for debugging anyway, even outside of the
> test suite. It would be even better if we could replace the test-tool
> with it.

Yeah, that was exactly what I had in mind, except that I'd make it a
sub-command of "git commit-graph" ("show" or perhaps "dump").

-Peff

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 31, 2020

On the Git mailing list, Son Luong Ngoc wrote (reply to this):

Note: re-send  to mailing list due to me forgot to turn on Plain Text format.
(sorry for the noise)

Hi Peff, Taylor, Junio and Christian,

Thanks a lot for the valuable feedbacks.
This is exactly what I was hoping for by sending out the patch early!

> On Jul 31, 2020, at 21:14, Jeff King <[email protected]> wrote:
> 
> On Fri, Jul 31, 2020 at 02:09:56PM -0400, Taylor Blau wrote:
> 
>>> Is a single boolean flag sufficient? If you have incrementals, you might
>>> have some slices with this chunk and some without. What should the
>>> boolean be in that case?
>> 
>> I think you'd really want to know which layers do and don't have
>> filters. It might be even more interesting to have a tool like what 'git
>> show-index' is to '*.idx' files, maybe something like 'git show-graph'
>> or 'git show-commit-graph'. Its output would be one line per commit that
>> shows:
>> 
>>  - what layer in the chain it's located at
>>  - its graph_pos
>>  - its generation number
>>  - whether or not it has a Bloom filter
>>  - ???
>> 
>> That would be a useful tool for debugging anyway, even outside of the
>> test suite. It would be even better if we could replace the test-tool
>> with it.
> 
> Yeah, that was exactly what I had in mind, except that I'd make it a
> sub-command of "git commit-graph" ("show" or perhaps "dump").

I loved Junio's initial suggestion and the follow up here.
I was thinking of something like 'git commit-graph verify --verbose' but 
now I agree that a distinct command such as 'show' might be more 
distinct and better communicate the purpose.

I will stick with my poor-man bash/golang script for now to invalidate
the commit-graph (chain or no-chain) as it does the job just fine.

Let me see if I have the capacity to implement 'show' sub-command
after. ^_^!

> 
> -Peff

Cheers,
Son Luong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant