Skip to content

PATH WALK II: Add --path-walk option to 'git pack-objects' #1819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Oct 29, 2024

Here is a full submission of the --path-walk feature for 'git pack-objects' and 'git repack'. It's been discussed in an RFC [1], as a future application for the path walk API [2], and is updated now that --name-hash-version=2 exists (as a replacement for the --full-name-hash option from the RFC) [3].

[1] https://lore.kernel.org/git/[email protected]/

[2] https://lore.kernel.org/git/[email protected]

[3] https://lore.kernel.org/git/[email protected]

This patch series does the following:

  1. Add a new '--path-walk' option to 'git pack-objects' that uses the path-walk API instead of the revision API to collect objects for delta compression.

  2. Add a new '--path-walk' option to 'git repack' to pass this option along to 'git pack-objects'.

  3. Add a new 'pack.usePathWalk' config option to opt into this option implicitly, such as in 'git push'.

  4. Optimize the '--path-walk' option using threading so it better competes with the existing multi-threaded delta compression mechanism.

  5. Update the path-walk API with a new 'edge_aggressive' option that pairs close to the --edge-aggressive option in the revision API. This is useful when creating thin packs inside shallow clones.

This feature works by using the path-walk API to emit groups of objects that appear at the same path. These groups are tracked so they can be tested for delta compression with each other, and then after those groups are tested a second pass using the name-hash attempts to find better (or first time) deltas across path boundaries. This second pass is much faster than a fresh pass since the existing deltas are used as a limit for the size of potentially new deltas, short-circuiting the checks when the delta size exceeds the current-best.

The benefits of the --path-walk feature first come into play when the name hash functions have many collisions, so sorting by name hash value leads to unhelpful groupings of objects. Many of these benefits are improved by --name-hash-version=2, but collisions still exist with any hash-based approach. There are also performance benefits in some cases due to the isolation of delta compression testing within path groups.

All of the benefits of the --path-walk feature are less dramatic when compared to --name-hash-version=2, but they can still exist in many cases. I have also seen some cases where --name-hash-version=2 compresses better than --path-walk with --name-hash-version=1, but these options can be combined to get the best of both worlds.

Detailed statistics are provided within patch messages, but a few are highlighted here:

The microsoft/fluentui is a public Javascript repo that suffers from many of the name hash collisions as internal repositories I've worked with. Here is a comparison of the compressed size and end-to-end time of the repack:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      28.16s

Less dramatic, but perhaps more standardly structured is the nodejs/node repository, with these stats:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk              698.0M      75.10s

Even the Linux kernel repository gains some benefits, even though the number of hash collisions is relatively low due to a preference for short filenames:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk                2.2G     559.00s

The drawbacks of the --path-walk feature is that it will be harder to integrate it with bitmap features, specifically delta islands. This is not insurmountable, but would require more work, such as a revision walk to paint objects with reachability information before using that during delta computations.

However, there should still be significant benefits to Git clients trying to save space and improve local performance.

This feature was shipped with similar features in microsoft/git as of v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo that had significant repository growth due to constructing a batch of beachball [5] CHANGELOG.[md|json] files and pushing them to a release branch. These pushes were frequently 70-200 MB due to poor delta compression. Using the 'pack.usePathWalk=true' config, these pushes dropped in size by 100x while improving performance. Since these CI machines were working with a shallow clone, the 'edge_aggressive' changes were required to enable the path-walk option.

[4] https://github.com/microsoft/git/releases/tag/v2.47.0.vfs.0.3

[5] https://github.com/microsoft/beachball

Updates in v2

  • Re-added a dropped comment when moving code in patch 1.
  • Updated documentation to include interaction with --use-bitmap-index.
  • An UNUSED parameter is now used, reducing the use of global variables slightly.

Updates in v3

Thanks for the review, Taylor. Sorry for my delay in getting back to your feedback.

  • Documentation has been edited slightly for simplicity.
  • is_oid_interesting() was swapped to is_oid_uninteresting()
  • sub_list_size renamed to sub_list_nr
  • Several uint32_t and uint64_t variables were converted to size_t.
  • Several 'unsigned int' variables were required to stay as-is, for now, until a refactor can be done.
  • An unnecessary update of tag_objects was removed.
  • The logic and error message around incompatible options is simpler.
  • Tests are expanded, especially around config options.
  • Fixed commit message typos.
  • Extra care around ALLOC_ARRAY() to avoid a zero- or negative-length array.

Thanks,
-Stolee

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]

@derrickstolee derrickstolee self-assigned this Oct 29, 2024
@derrickstolee derrickstolee force-pushed the api-upstream branch 3 times, most recently from 781b2ea to ef54342 Compare December 18, 2024 16:13
@derrickstolee derrickstolee changed the base branch from api-upstream to master March 3, 2025 19:40
@derrickstolee derrickstolee force-pushed the path-walk-upstream branch 3 times, most recently from 26e1afb to 2eb9250 Compare March 9, 2025 21:55
@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented Mar 10, 2025

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1819/derrickstolee/path-walk-upstream-v1

To fetch this version to local tag pr-1819/derrickstolee/path-walk-upstream-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1819/derrickstolee/path-walk-upstream-v1

Copy link

gitgitgadget bot commented Mar 10, 2025

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> ... deltas across path boundaries. This second pass is much faster than a fresh
> pass since the existing deltas are used as a limit for the size of
> potentially new deltas, short-circuiting the checks when the delta size
> exceeds the current-best.

Very nice.

> The microsoft/fluentui is a public Javascript repo that suffers from many of
> the name hash collisions as internal repositories I've worked with. Here is
> a comparison of the compressed size and end-to-end time of the repack:
>
> Repack Method    Pack Size       Time
> ---------------------------------------
> Hash v1             439.4M      87.24s
> Hash v2             161.7M      21.51s
> Path Walk           142.5M      28.16s
>
>
> Less dramatic, but perhaps more standardly structured is the nodejs/node
> repository, with these stats:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                739.9M      71.18s
> Hash v2                764.6M      67.82s
> Path Walk              698.0M      75.10s
>
>
> Even the Linux kernel repository gains some benefits, even though the number
> of hash collisions is relatively low due to a preference for short
> filenames:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                  2.5G     554.41s
> Hash v2                  2.5G     549.62s
> Path Walk                2.2G     559.00s

This third one, v2 not performing much better than v1, is quite
surprising.

> The drawbacks of the --path-walk feature is that it will be harder to
> integrate it with bitmap features, specifically delta islands. This is not
> insurmountable, but would require more work, such as a revision walk to
> paint objects with reachability information before using that during delta
> computations.
>
> However, there should still be significant benefits to Git clients trying to
> save space and improve local performance.

Sure.  More experiments and more approaches will eventually give us
overall improvement.  I am hoping that we will be able to condense
the result of these different approaches and their combinations into
easy-to-choose-from canned choices (as opposed to a myriad of little
knobs the users need to futz with without really understanding what
they are tweaking).

> This feature was shipped with similar features in microsoft/git as of
> v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
> that had significant repository growth due to constructing a batch of
> beachball [5] CHANGELOG.[md|json] files and pushing them to a release
> branch. These pushes were frequently 70-200 MB due to poor delta
> compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
> in size by 100x while improving performance. Since these CI machines were
> working with a shallow clone, the 'edge_aggressive' changes were required to
> enable the path-walk option.

Nice, thanks.

Copy link

gitgitgadget bot commented Mar 10, 2025

This patch series was integrated into seen via git@e51880c.

@gitgitgadget gitgitgadget bot added the seen label Mar 10, 2025
Copy link

gitgitgadget bot commented Mar 11, 2025

This branch is now known as ds/path-walk-2.

Copy link

gitgitgadget bot commented Mar 11, 2025

This patch series was integrated into seen via git@28416f0.

Copy link

gitgitgadget bot commented Mar 11, 2025

This patch series was integrated into seen via git@4fc875f.

return 0;
}

return 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Mar 24, 2025 at 03:22:46PM +0000, Derrick Stolee via GitGitGadget wrote:
> The current implementation is not integrated with threads, but could be
> done in a future update.

I think that "in a future update" may be worth replacing with "the
following commit", as the former suggests (to me) that it may be perused
outside of this series.

> Since we do not attempt to sort objects by size until after exploring
> all trees, we can remove the previous change to t5530 due to a different
> error message appearing first.

Makes sense.

> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  builtin/pack-objects.c       | 82 +++++++++++++++++++++++++-----------
>  pack-objects.h               | 12 ++++++
>  t/t5300-pack-object.sh       |  8 +++-
>  t/t5530-upload-pack-error.sh |  6 ---
>  4 files changed, 75 insertions(+), 33 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 0ea85754c52..d4e05ca4434 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3236,6 +3236,51 @@ static int should_attempt_deltas(struct object_entry *entry)
>  	return 1;
>  }
>
> +static void find_deltas_for_region(struct object_entry *list,
> +				   struct packing_region *region,
> +				   unsigned int *processed)
> +{
> +	struct object_entry **delta_list;
> +	uint32_t delta_list_nr = 0;
> +
> +	ALLOC_ARRAY(delta_list, region->nr);
> +	for (uint32_t i = 0; i < region->nr; i++) {

I know that these values are all unsigned, and very unlikely to get near
the 32-bit maximum, but I do think we should use size_t here (and
likewise when dealing with values that deal with how many entries in a
list we've allocated and indexes into that list).

> +		struct object_entry *entry = list + region->start + i;
> +		if (should_attempt_deltas(entry))
> +			delta_list[delta_list_nr++] = entry;
> +	}
> +
> +	QSORT(delta_list, delta_list_nr, type_size_sort);
> +	find_deltas(delta_list, &delta_list_nr, window, depth, processed);
> +	free(delta_list);
> +}

The rest of this function (modulo the inline comment removal, which I
wrote about below) appear to be a straightforward copy of the previous
home of this function.

> +static void find_deltas_by_region(struct object_entry *list,
> +				  struct packing_region *regions,
> +				  uint32_t start, uint32_t nr)

I imagine that "start" here is setup for the following commit which will
parallelize this task across multiple threads, with each thread starting
at a different position.

I wonder if (as an alternative) we could get away with passing in a
(struct packing_region *, size_t) pair and drop "start". I think this
can work if you pass in the result of adding whatever the value of
"start" would be to to_pack.regions here.

That somewhat hides the fact that this code is meant to be run across
multiple threads, but in a way that I think is worth doing. It lets the
function avoid having to do things like:

  for (size_t i = start; i < start + nr; i++)

and instead do the simpler:

  for (size_t i = 0; i < nr; i++)

since the caller already adjusted the regions pointer for us. As a
side-effect, it also means that the call below in prepare_pack() can
avoid passing a literal zero.

> +{
> +	unsigned int processed = 0;
> +	uint32_t progress_nr;

This uses a mix of uint32_t and unsigned int types. Should these all be
the same (and/or size_t's)?

> -	/*
> -	 * Find delta bases among this list of objects that all match the same
> -	 * path. This causes the delta compression to be interleaved in the
> -	 * object walk, which can lead to confusing progress indicators. This is
> -	 * also incompatible with threaded delta calculations. In the future,
> -	 * consider creating a list of regions in the full to_pack.objects array
> -	 * that could be picked up by the threaded delta computation.
> -	 */

Nice, it is very satisfying to see this comment go away ;-).

> -	if (sub_list_size && window) {
> -		QSORT(delta_list, sub_list_size, type_size_sort);
> -		find_deltas(delta_list, &sub_list_size, window, depth, processed);
> -	}
> +	*processed += oids->nr;
> +	display_progress(progress_state, *processed);
>
> -	free(delta_list);
>  	return 0;
>  }
>
> diff --git a/pack-objects.h b/pack-objects.h
> index d73e3843c92..7ba9f3448fe 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -119,11 +119,23 @@ struct object_entry {
>  	unsigned ext_base:1; /* delta_idx points outside packlist */
>  };
>
> +/**

This is an extreme nitpick, but I think that "/**" (as opposed to "/*")
is preferred, as the former is used for Doxygen-style comments. I feel
like I have seen Junio comment on this before, but searching 'f:Junio
"/**"' yields a measly 83,000+ results, so I am unlikely to find it ;-).

(Not worth rerolling on its own, but if you are rerolling anyway, which
is what I gathered from some of your earlier replies, it may be worth
picking up along the way.)

> + * A packing region is a section of the packing_data.objects array
> + * as given by a starting index and a number of elements.
> + */
> +struct packing_region {
> +	uint32_t start;
> +	uint32_t nr;
> +};

Same note here about the uint32_t versus size_t.
> +
>  struct packing_data {
>  	struct repository *repo;
>  	struct object_entry *objects;
>  	uint32_t nr_objects, nr_alloc;
>
> +	struct packing_region *regions;
> +	uint64_t nr_regions, nr_regions_alloc;

Ditto, but this one jumps to uint64_t. What differentiates these two?

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 5/6/25 9:14 PM, Taylor Blau wrote:
> On Mon, Mar 24, 2025 at 03:22:46PM +0000, Derrick Stolee via GitGitGadget wrote:
>> The current implementation is not integrated with threads, but could be
>> done in a future update.
> > I think that "in a future update" may be worth replacing with "the
> following commit", as the former suggests (to me) that it may be perused
> outside of this series.

Noted and fixed.


>> +static void find_deltas_for_region(struct object_entry *list,
>> +				   struct packing_region *region,
>> +				   unsigned int *processed)
>> +{
>> +	struct object_entry **delta_list;
>> +	uint32_t delta_list_nr = 0;
>> +
>> +	ALLOC_ARRAY(delta_list, region->nr);
>> +	for (uint32_t i = 0; i < region->nr; i++) {
> > I know that these values are all unsigned, and very unlikely to get near
> the 32-bit maximum, but I do think we should use size_t here (and
> likewise when dealing with values that deal with how many entries in a
> list we've allocated and indexes into that list).

I've updated things to size_t where I could across this patch. Thanks.

>> +static void find_deltas_by_region(struct object_entry *list,
>> +				  struct packing_region *regions,
>> +				  uint32_t start, uint32_t nr)
> > I imagine that "start" here is setup for the following commit which will
> parallelize this task across multiple threads, with each thread starting
> at a different position.
> > I wonder if (as an alternative) we could get away with passing in a
> (struct packing_region *, size_t) pair and drop "start". I think this
> can work if you pass in the result of adding whatever the value of
> "start" would be to to_pack.regions here.
> > That somewhat hides the fact that this code is meant to be run across
> multiple threads, but in a way that I think is worth doing. It lets the
> function avoid having to do things like:
> >    for (size_t i = start; i < start + nr; i++)
> > and instead do the simpler:
> >    for (size_t i = 0; i < nr; i++)
> > since the caller already adjusted the regions pointer for us. As a
> side-effect, it also means that the call below in prepare_pack() can
> avoid passing a literal zero.

I generally avoid this kind of pattern, because it makes it more
difficult to understand what is going on when debugging. The cost
of adding 'start' to the index seems low relative to having a
consistent indexing scheme when accessing from the "global" list
of items.

This only gets more complicated when the threaded version iterates
on a "consecutive sublist of regions".

>> +/**
> > This is an extreme nitpick, but I think that "/**" (as opposed to "/*")
> is preferred, as the former is used for Doxygen-style comments. I feel
> like I have seen Junio comment on this before, but searching 'f:Junio
> "/**"' yields a measly 83,000+ results, so I am unlikely to find it ;-).
> > (Not worth rerolling on its own, but if you are rerolling anyway, which
> is what I gathered from some of your earlier replies, it may be worth
> picking up along the way.)

I'm confused. Are you saying I should change from "/**" to "/*" (which
is the opposite of your preferred statement) or did you misread that I
had used "/**"? I can't determine which statement is the intended one.
 Thanks,
-Stolee

@@ -2954,6 +2964,7 @@ static void find_deltas(struct object_entry **list, unsigned *list_size,
struct thread_params {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Mar 24, 2025 at 03:22:47PM +0000, Derrick Stolee via GitGitGadget wrote:
> Using the Git repository as a test repo, the p5313 performance test
> shows that the resulting size of the repo is the same, but the threaded
> implementation gives gains of varying degrees depending on the number of
> objects being packed. (This was tested on a 16-core machine.)
>
> Test                        HEAD~1      HEAD
> ---------------------------------------------------
> 5313.20: big pack             2.38      1.99 -16.4%
> 5313.21: big pack size       16.1M     16.0M  -0.2%
> 5313.24: repack             107.32     45.41 -57.7%
> 5313.25: repack size        213.3M    213.2M  -0.0%

Very nice indeed!

> This ~60% reduction in 'git repack --path-walk' time is typical across
> all repos I used for testing. What is interesting is to compare when the
> overall time improves enough to outperform the --name-hash-version=1
> case. These time improvements correlate with repositories with data
> shapes that significantly improve their data size as well. The

I had to read this sentence a couple of times to wrap my head around it,
but perhaps you can double-check my understanding. Are you saying that
reducing the size of the resulting packs suggests that our runtime with
the path-walk algorithm is improved enough to be competitive with (or
better than) the non-path-walk implementation with name-hash v1?

I think that's what you mean, and I admit that I don't find my own
restatement to be all that much clearer. So I think it's fine to leave
this as-is, though it is interesting to think about why the statement
is true.

I suspect (without having tested it thoroughly) that a significant
proportion of the time saved on the path-walk side is because we found
better deltas that are easier to compress, as a result of a more
well-tuned search technique. I don't think you necessarily have to get
too far into the details here, but I was curious (not for commit message
polishing, but purely for my own curiosity) if you had any thoughts or
measurements of why this might be the case.

But...

> --path-walk feature frequently takes longer than --name-hash-verison=2,
> trading some extrac computation for some additional compression. The

s/extrac/extra/

> natural place where this additional computation comes from is the two
> compression passes that --path-walk takes, though the first pass is
> naturally faster due to the path boundaries avoiding a number of delta
> compression attempts.

... I should have kept reading before commenting the above! ;-)

> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  builtin/pack-objects.c | 163 ++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 161 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index d4e05ca4434..2a6246c1e78 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -2964,6 +2964,7 @@ static void find_deltas(struct object_entry **list, unsigned *list_size,
>  struct thread_params {
>  	pthread_t thread;
>  	struct object_entry **list;
> +	struct packing_region *regions;
>  	unsigned list_size;
>  	unsigned remaining;
>  	int window;
> @@ -3281,6 +3282,164 @@ static void find_deltas_by_region(struct object_entry *list,
>  	stop_progress(&progress_state);
>  }
>
> +static void *threaded_find_deltas_by_path(void *arg)
> +{
> +	struct thread_params *me = arg;
> +
> +	progress_lock();
> +	while (me->remaining) {
> +		while (me->remaining) {

I am not quite following the double-while loop here. Could you help
clarify what's going on here? Is this due to the work-stealing below?

> +static void ll_find_deltas_by_region(struct object_entry *list,
> +				     struct packing_region *regions,
> +				     uint32_t start, uint32_t nr)
> +{
> +	struct thread_params *p;
> +	int i, ret, active_threads = 0;
> +	unsigned int processed = 0;
> +	uint32_t progress_nr;
> +	init_threaded_search();
> +
> +	if (!nr)
> +		return;
> +
> +	progress_nr =  regions[nr - 1].start + regions[nr - 1].nr;
> +	if (delta_search_threads <= 1) {
> +		find_deltas_by_region(list, regions, start, nr);
> +		cleanup_threaded_search();
> +		return;
> +	}
> +
> +	if (progress > pack_to_stdout)
> +		fprintf_ln(stderr, _("Path-based delta compression using up to %d threads"),
> +			   delta_search_threads);

This looks a copy-and-paste from the non-path-walk code, but I wonder if
it might be worth using Q_() here instead of _() to provide better
output in the case that delta_search_threads is 1.

The rest of the implementation of ll_find_deltas_by_region() looks good
to me.

Thanks,
Taylor

Copy link

gitgitgadget bot commented May 7, 2025

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, May 02, 2025 at 07:44:30PM -0400, Taylor Blau wrote:
> OK, I was able to get through the first 8 or so patches in the series,
> and left a handful of comments throughout. I'm running out of time ATM
> to finish reviewing, but I should be able to pick it up next Tuesday (I
> am out of office on Monday).

All set, and thanks again to Stolee for their patience. I think I got
through everything I wanted to, and I think in total there is enough to
make a reroll worthwhile here.

Thanks,
Taylor

Copy link

gitgitgadget bot commented May 8, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will merge to 'next'?
cf. <[email protected]>
source: <[email protected]>

Copy link

gitgitgadget bot commented May 8, 2025

This patch series was integrated into seen via git@97911ea.

Copy link

gitgitgadget bot commented May 8, 2025

This patch series was integrated into seen via git@2862565.

Copy link

gitgitgadget bot commented May 9, 2025

This patch series was integrated into seen via git@4db9cc2.

Copy link

gitgitgadget bot commented May 12, 2025

This patch series was integrated into seen via git@7a36c89.

Copy link

gitgitgadget bot commented May 13, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Waiting for review responses.
cf. <[email protected]>
cf. <[email protected]>
source: <[email protected]>

Copy link

gitgitgadget bot commented May 13, 2025

This patch series was integrated into seen via git@7461ab3.

Copy link

gitgitgadget bot commented May 13, 2025

This patch series was integrated into seen via git@d638de2.

Copy link

gitgitgadget bot commented May 14, 2025

This patch series was integrated into seen via git@2b3d27e.

Copy link

gitgitgadget bot commented May 15, 2025

This patch series was integrated into seen via git@49e32e8.

In order to more easily compute delta bases among objects that appear at
the exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given
by the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

The current implementation performs delta calculations while walking
objects, which is not ideal for a few reasons. First, this will cause
the "Enumerating objects" phase to be much longer than usual. Second, it
does not take advantage of threading during the path-scoped delta
calculations. Even with this lack of threading, the path-walk option is
sometimes faster than the usual approach. Future changes will refactor
this code to allow for threading, but that complexity is deferred until
later to keep this patch as simple as possible.

This new walk is incompatible with some features and is ignored by
others:

 * Object filters are not currently integrated with the path-walk API,
   such as sparse-checkout or tree depth. A blobless packfile could be
   integrated easily, but that is deferred for later.

 * Server-focused features such as delta islands, shallow packs, and
   using a bitmap index are incompatible with the path-walk API.

 * The path walk API is only compatible with the --revs option, not
   taking object lists or pack lists over stdin. These alternative ways
   to specify the objects currently ignores the --path-walk option
   without even a warning.

Future changes will create performance tests that demonstrate the power
of this approach.

Signed-off-by: Derrick Stolee <[email protected]>
The t0450 test script verifies that builtin usage matches the synopsis
in the documentation. Adjust the builtin to match and then remove 'git
pack-objects' from the exception list.

Signed-off-by: Derrick Stolee <[email protected]>
The previous change added a --path-walk option to 'git pack-objects'.
Create a performance test that demonstrates the time and space benefits
of the feature.

In order to get an appropriate comparison, we need to avoid reusing
deltas and recompute them from scratch.

Compare the creation of a thin pack representing a small push and the
creation of a relatively large non-thin pack.

Running on my copy of the Git repository results in this data (removing
the repack tests for --name-hash-version):

Test                                                     this tree
------------------------------------------------------------------------
5313.2: thin pack with --name-hash-version=1             0.02(0.01+0.01)
5313.3: thin pack size with --name-hash-version=1                   1.6K
5313.4: big pack with --name-hash-version=1              2.55(4.20+0.26)
5313.5: big pack size with --name-hash-version=1                   16.4M
5313.6: shallow fetch pack with --name-hash-version=1    1.24(2.03+0.08)
5313.7: shallow pack size with --name-hash-version=1               12.2M
5313.10: thin pack with --name-hash-version=2            0.03(0.01+0.01)
5313.11: thin pack size with --name-hash-version=2                  1.6K
5313.12: big pack with --name-hash-version=2             1.91(3.23+0.20)
5313.13: big pack size with --name-hash-version=2                  16.4M
5313.14: shallow fetch pack with --name-hash-version=2   1.06(1.57+0.10)
5313.15: shallow pack size with --name-hash-version=2              12.5M
5313.18: thin pack with --path-walk                      0.03(0.01+0.01)
5313.19: thin pack size with --path-walk                            1.6K
5313.20: big pack with --path-walk                       2.05(3.24+0.27)
5313.21: big pack size with --path-walk                            16.3M
5313.22: shallow fetch pack with --path-walk             1.08(1.66+0.07)
5313.23: shallow pack size with --path-walk                        12.4M

This can be reformatted as follows:

Pack Type            Hash v1   Hash v2     Path Walk
---------------------------------------------------
thin pack    (time)    0.02s      0.03s      0.03s
             (size)    1.6K       1.6K       1.6K
big pack     (time)    2.55s      1.91s      2.05s
             (size)   16.4M      16.4M      16.3M
shallow pack (time)    1.24s      1.06s      1.08s
             (size)   12.2M      12.5M      12.4M

Note that the timing is slower because there is no threading in the
--path-walk case (yet). Also, the shallow pack cases are really not
using the --path-walk logic right now because it is disabled until some
additions are made to the path walk API.

The cases where the --path-walk option really shines is when the default
name-hash is overwhelmed with unhelpful collisions. An open source
example can be found in the microsoft/fluentui repo [1] at a certain
commit [2].

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Running the tests on this repo results in the following comparison table:

Pack Type            Hash v1    Hash v2    Path Walk
---------------------------------------------------
thin pack    (time)    0.36s      0.12s      0.08s
             (size)    1.2M      22.0K      18.4K
big pack     (time)    2.00s      2.90s      2.21s
             (size)   20.4M      25.9M      19.5M
shallow pack (time)    1.41s      1.80s      1.65s
             (size)   34.4M      33.7M      33.6M

Notice in particular that in the small thin pack, the time performance
has improved from 0.36s for --name-hash-version=1 to 0.08s and this is
likely due to the improved size of the resulting pack: 18.4K instead of
1.2M.  The relatively new --name-hash-version=2 is competitive with
--path-walk (0.12s and 22.0K) but not quite as successful.

Finally, running this on a copy of the Linux kernel repository results
in these data points:

Pack Type            Hash v1    Hash v2    Path Walk
---------------------------------------------------
thin pack    (time)    0.03s      0.13s      0.03s
             (size)    4.6K       4.6K       4.6K
big pack     (time)   15.29s     12.32s     13.92s
             (size)  201.1M     159.1M     158.5M
shallow pack (time)   10.88s     22.93s     22.74s
             (size)  269.2M     273.8M     267.7M

Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, helping to find tests that are overly specific to the
default object walk. These include:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Similar to GIT_TEST_NAME_HASH_VERSION, we do not add this option to the
linux-TEST-vars CI build as that's already an overloaded build.

Signed-off-by: Derrick Stolee <[email protected]>
It can be notoriously difficult to detect if delta bases are being
computed properly during 'git push'. Construct an example where it will
make a kilobyte worth of difference when a delta base is not found. We
can then use the progress indicators to distinguish between bytes and
KiB depending on whether the delta base is found and used.

Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the --path-walk tests in p5313 look like this:

Test                                                     this tree
-------------------------------------------------------------------------
5313.18: thin pack with --path-walk                      0.08(0.06+0.02)
5313.19: thin pack size with --path-walk                           18.4K
5313.20: big pack with --path-walk                       2.10(7.80+0.26)
5313.21: big pack size with --path-walk                            19.8M
5313.22: shallow fetch pack with --path-walk             1.62(3.38+0.17)
5313.23: shallow pack size with --path-walk                        33.6M
5313.24: repack with --path-walk                         81.29(96.08+0.71)
5313.25: repack size with --path-walk                             142.5M

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

Along with the earlier tests in p5313, I'll instead reformat the
comparison as follows:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      81.29s

There are a few things to notice here:

 1. The benefits of --name-hash-version=2 over --name-hash-version=1 are
    significant, but --path-walk still compresses better than that
    option.

 2. The --path-walk command is still using --name-hash-version=1 for the
    second pass of delta computation, using the increased name hash
    collisions as a potential method for opportunistic compression on
    top of the path-focused compression.

 3. The --path-walk algorithm is currently sequential and does not use
    multiple threads for delta compression. Threading will be
    implemented in a future change so the computation time will improve
    to better compete in this metric.

There are small benefits in size for my copy of the Git repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             248.8M      30.44s
Hash v2             249.0M      30.15s
Path Walk           213.2M     142.50s

As well as in the nodejs/node repository [3]:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             739.9M      71.18s
Hash v2             764.6M      67.82s
Path Walk           698.1M     208.10s

[3] https://github.com/nodejs/node

This benefit also repeats in my copy of the Linux kernel repository:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1               2.5G     554.41s
Hash v2               2.5G     549.62s
Path Walk             2.2G    1562.36s

It is important to see that even when the repository shape does not have
many name-hash collisions, there is a slight space boost to be found
using this method.

As this repacking strategy was released in Git for Windows 2.47.0, some
users have reported cases where the --path-walk compression is slightly
worse than the --name-hash-version=2 option. In those cases, it may be
beneficial to combine the two options. However, there has not been a
released version of Git that has both options and I don't have access to
these repos for testing.

Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

In order to test that this config works, add a new trace2 region around
the path walk code that can be checked by a 'git push' command.

Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <[email protected]>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but we are
setting it up to arrive in the next change.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <[email protected]>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                        HEAD~1      HEAD
---------------------------------------------------
5313.20: big pack             2.38      1.99 -16.4%
5313.21: big pack size       16.1M     16.0M  -0.2%
5313.24: repack             107.32     45.41 -57.7%
5313.25: repack size        213.3M    213.2M  -0.0%

(Test output is formatted to better fit in message.)

This ~60% reduction in 'git repack --path-walk' time is typical across
all repos I used for testing. What is interesting is to compare when the
overall time improves enough to outperform the --name-hash-version=1
case. These time improvements correlate with repositories with data
shapes that significantly improve their data size as well. The
--path-walk feature frequently takes longer than --name-hash-version=2,
trading some extra computation for some additional compression. The
natural place where this additional computation comes from is the two
compression passes that --path-walk takes, though the first pass is
naturally faster due to the path boundaries avoiding a number of delta
compression attempts.

For example, the microsoft/fluentui repo has significant size reduction
from --name-hash-version=1 to --name-hash-version=2 followed by further
improvements with --path-walk. The threaded computation makes
--path-walk more competitive in time compared to --name-hash-version=2,
though still ~31% more expensive in that metric.

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                439.4M      87.24s
Hash v2                161.7M      21.51s
Path Walk (Before)     142.5M      81.29s
Path Walk (After)      142.5M      28.16s

Similar results hold for the Git repository:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                248.8M      30.44s
Hash v2                249.0M      30.15s
Path Walk (Before)     213.2M     142.50s
Path Walk (After)      213.3M      45.41s

...as well as the nodejs/node repository:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk (Before)     698.1M     208.10s
Path Walk (After)      698.0M      75.10s

Finally, the Linux kernel repository is a good test for this repacking
time change, even though the space savings is more subtle:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk (before)       2.2G    1562.36s
Path Walk (before)       2.2G     559.00s

Signed-off-by: Derrick Stolee <[email protected]>
In preparation for allowing both the --shallow and --path-walk options
in the 'git pack-objects' builtin, create a new 'edge_aggressive' option
in the path-walk API. This option will help walk the boundary more
thoroughly and help avoid sending extra objects during fetches and
pushes.

The only use of the 'edge_hint_aggressive' option in the revision API is
within mark_edges_uninteresting(), which is usually called before
between prepare_revision_walk() and before visiting commits with
get_revision(). In prepare_revision_walk(), the UNINTERESTING commits
are walked until a boundary is found.

Signed-off-by: Derrick Stolee <[email protected]>
There does not appear to be anything particularly incompatible about the
--shallow and --path-walk options of 'git pack-objects'. If shallow
commits are to be handled differently, then it is by the revision walk
that defines the commit set and which are interesting or uninteresting.

However, before the previous change, a trivial removal of the warning
would cause a failure in t5500-fetch-pack.sh when
GIT_TEST_PACK_PATH_WALK is enabled. The shallow fetch would provide more
objects than we desired, due to some incorrect behavior of the path-walk
API, especially around walking uninteresting objects.

The recently-added tests in t5538-push-shallow.sh help to confirm this
behavior is working with the --path-walk option if
GIT_TEST_PACK_PATH_WALK is enabled. These tests passed previously due to
the --path-walk feature being disabled in the presence of a shallow
clone.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee
Copy link
Author

/submit

Copy link

gitgitgadget bot commented May 16, 2025

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1819/derrickstolee/path-walk-upstream-v3

To fetch this version to local tag pr-1819/derrickstolee/path-walk-upstream-v3:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1819/derrickstolee/path-walk-upstream-v3

Copy link

gitgitgadget bot commented May 16, 2025

This patch series was integrated into seen via git@2feff74.

Copy link

gitgitgadget bot commented May 17, 2025

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Comments?
source: <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant