PATH WALK II: Add --path-walk option to 'git pack-objects' #1819

derrickstolee · 2024-10-29T02:36:04Z

Here is a full submission of the --path-walk feature for 'git pack-objects' and 'git repack'. It's been discussed in an RFC [1], as a future application for the path walk API [2], and is updated now that --name-hash-version=2 exists (as a replacement for the --full-name-hash option from the RFC) [3].

[1] https://lore.kernel.org/git/[email protected]/

[2] https://lore.kernel.org/git/[email protected]

[3] https://lore.kernel.org/git/[email protected]

This patch series does the following:

Add a new '--path-walk' option to 'git pack-objects' that uses the path-walk API instead of the revision API to collect objects for delta compression.
Add a new '--path-walk' option to 'git repack' to pass this option along to 'git pack-objects'.
Add a new 'pack.usePathWalk' config option to opt into this option implicitly, such as in 'git push'.
Optimize the '--path-walk' option using threading so it better competes with the existing multi-threaded delta compression mechanism.
Update the path-walk API with a new 'edge_aggressive' option that pairs close to the --edge-aggressive option in the revision API. This is useful when creating thin packs inside shallow clones.

This feature works by using the path-walk API to emit groups of objects that appear at the same path. These groups are tracked so they can be tested for delta compression with each other, and then after those groups are tested a second pass using the name-hash attempts to find better (or first time) deltas across path boundaries. This second pass is much faster than a fresh pass since the existing deltas are used as a limit for the size of potentially new deltas, short-circuiting the checks when the delta size exceeds the current-best.

The benefits of the --path-walk feature first come into play when the name hash functions have many collisions, so sorting by name hash value leads to unhelpful groupings of objects. Many of these benefits are improved by --name-hash-version=2, but collisions still exist with any hash-based approach. There are also performance benefits in some cases due to the isolation of delta compression testing within path groups.

All of the benefits of the --path-walk feature are less dramatic when compared to --name-hash-version=2, but they can still exist in many cases. I have also seen some cases where --name-hash-version=2 compresses better than --path-walk with --name-hash-version=1, but these options can be combined to get the best of both worlds.

Detailed statistics are provided within patch messages, but a few are highlighted here:

The microsoft/fluentui is a public Javascript repo that suffers from many of the name hash collisions as internal repositories I've worked with. Here is a comparison of the compressed size and end-to-end time of the repack:

Repack Method    Pack Size       Time
---------------------------------------
Hash v1             439.4M      87.24s
Hash v2             161.7M      21.51s
Path Walk           142.5M      28.16s

Less dramatic, but perhaps more standardly structured is the nodejs/node repository, with these stats:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                739.9M      71.18s
Hash v2                764.6M      67.82s
Path Walk              698.0M      75.10s

Even the Linux kernel repository gains some benefits, even though the number of hash collisions is relatively low due to a preference for short filenames:

Repack Method       Pack Size       Time
------------------------------------------
Hash v1                  2.5G     554.41s
Hash v2                  2.5G     549.62s
Path Walk                2.2G     559.00s

The drawbacks of the --path-walk feature is that it will be harder to integrate it with bitmap features, specifically delta islands. This is not insurmountable, but would require more work, such as a revision walk to paint objects with reachability information before using that during delta computations.

However, there should still be significant benefits to Git clients trying to save space and improve local performance.

This feature was shipped with similar features in microsoft/git as of v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo that had significant repository growth due to constructing a batch of beachball [5] CHANGELOG.[md|json] files and pushing them to a release branch. These pushes were frequently 70-200 MB due to poor delta compression. Using the 'pack.usePathWalk=true' config, these pushes dropped in size by 100x while improving performance. Since these CI machines were working with a shallow clone, the 'edge_aggressive' changes were required to enable the path-walk option.

[4] https://github.com/microsoft/git/releases/tag/v2.47.0.vfs.0.3

[5] https://github.com/microsoft/beachball

Updates in v2

Re-added a dropped comment when moving code in patch 1.
Updated documentation to include interaction with --use-bitmap-index.
An UNUSED parameter is now used, reducing the use of global variables slightly.

Updates in v3

Thanks for the review, Taylor. Sorry for my delay in getting back to your feedback.

Documentation has been edited slightly for simplicity.
is_oid_interesting() was swapped to is_oid_uninteresting()
sub_list_size renamed to sub_list_nr
Several uint32_t and uint64_t variables were converted to size_t.
Several 'unsigned int' variables were required to stay as-is, for now, until a refactor can be done.
An unnecessary update of tag_objects was removed.
The logic and error message around incompatible options is simpler.
Tests are expanded, especially around config options.
Fixed commit message typos.
Extra care around ALLOC_ARRAY() to avoid a zero- or negative-length array.

Thanks,
-Stolee

cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]

builtin/pack-objects.c

derrickstolee · 2025-03-10T01:49:48Z

/submit

gitgitgadget · 2025-03-10T01:51:19Z

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1819/derrickstolee/path-walk-upstream-v1

To fetch this version to local tag pr-1819/derrickstolee/path-walk-upstream-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1819/derrickstolee/path-walk-upstream-v1

gitgitgadget · 2025-03-10T17:32:23Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> ... deltas across path boundaries. This second pass is much faster than a fresh
> pass since the existing deltas are used as a limit for the size of
> potentially new deltas, short-circuiting the checks when the delta size
> exceeds the current-best.

Very nice.

> The microsoft/fluentui is a public Javascript repo that suffers from many of
> the name hash collisions as internal repositories I've worked with. Here is
> a comparison of the compressed size and end-to-end time of the repack:
>
> Repack Method    Pack Size       Time
> ---------------------------------------
> Hash v1             439.4M      87.24s
> Hash v2             161.7M      21.51s
> Path Walk           142.5M      28.16s
>
>
> Less dramatic, but perhaps more standardly structured is the nodejs/node
> repository, with these stats:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                739.9M      71.18s
> Hash v2                764.6M      67.82s
> Path Walk              698.0M      75.10s
>
>
> Even the Linux kernel repository gains some benefits, even though the number
> of hash collisions is relatively low due to a preference for short
> filenames:
>
> Repack Method       Pack Size       Time
> ------------------------------------------
> Hash v1                  2.5G     554.41s
> Hash v2                  2.5G     549.62s
> Path Walk                2.2G     559.00s

This third one, v2 not performing much better than v1, is quite
surprising.

> The drawbacks of the --path-walk feature is that it will be harder to
> integrate it with bitmap features, specifically delta islands. This is not
> insurmountable, but would require more work, such as a revision walk to
> paint objects with reachability information before using that during delta
> computations.
>
> However, there should still be significant benefits to Git clients trying to
> save space and improve local performance.

Sure.  More experiments and more approaches will eventually give us
overall improvement.  I am hoping that we will be able to condense
the result of these different approaches and their combinations into
easy-to-choose-from canned choices (as opposed to a myriad of little
knobs the users need to futz with without really understanding what
they are tweaking).

> This feature was shipped with similar features in microsoft/git as of
> v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
> that had significant repository growth due to constructing a batch of
> beachball [5] CHANGELOG.[md|json] files and pushing them to a release
> branch. These pushes were frequently 70-200 MB due to poor delta
> compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
> in size by 100x while improving performance. Since these CI machines were
> working with a shallow clone, the 'edge_aggressive' changes were required to
> enable the path-walk option.

Nice, thanks.

gitgitgadget · 2025-03-10T23:24:35Z

This patch series was integrated into seen via git@e51880c.

gitgitgadget · 2025-03-11T21:10:08Z

This branch is now known as ds/path-walk-2.

gitgitgadget · 2025-03-11T21:10:08Z

This patch series was integrated into seen via git@28416f0.

gitgitgadget · 2025-03-11T23:55:50Z

This patch series was integrated into seen via git@4fc875f.

gitgitgadget · 2025-05-30T21:46:33Z

This patch series was integrated into seen via git@fe28f74.

gitgitgadget · 2025-05-31T00:30:02Z

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Comments?
source: <[email protected]>

gitgitgadget · 2025-06-02T04:39:16Z

This patch series was integrated into seen via git@ed40d39.

gitgitgadget · 2025-06-02T19:56:41Z

This patch series was integrated into seen via git@e78edc7.

gitgitgadget · 2025-06-03T16:49:56Z

This patch series was integrated into seen via git@e24b3f8.

gitgitgadget · 2025-06-04T23:15:04Z

This patch series was integrated into seen via git@4aae12c.

gitgitgadget · 2025-06-05T22:08:38Z

This patch series was integrated into seen via git@1c6c6c0.

gitgitgadget · 2025-06-05T22:08:39Z

This patch series was integrated into next via git@e59d4b1.

gitgitgadget · 2025-06-05T23:43:06Z

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

gitgitgadget · 2025-06-06T22:17:47Z

This patch series was integrated into seen via git@0481447.

gitgitgadget · 2025-06-07T20:08:12Z

This patch series was integrated into seen via git@20b0ce2.

gitgitgadget · 2025-06-07T20:18:52Z

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

gitgitgadget · 2025-06-08T21:12:16Z

This patch series was integrated into seen via git@258d7b6.

gitgitgadget · 2025-06-09T16:42:33Z

This patch series was integrated into seen via git@1476d75.

gitgitgadget · 2025-06-09T20:54:24Z

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

gitgitgadget · 2025-06-10T00:06:19Z

This patch series was integrated into seen via git@4864b2c.

gitgitgadget · 2025-06-10T22:15:46Z

This patch series was integrated into seen via git@0654674.

gitgitgadget · 2025-06-11T22:26:37Z

This patch series was integrated into seen via git@b4ef194.

gitgitgadget · 2025-06-12T22:17:21Z

This patch series was integrated into seen via git@3683f76.

gitgitgadget · 2025-06-12T23:49:35Z

This patch series was integrated into seen via git@d189351.

gitgitgadget · 2025-06-13T00:19:44Z

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

gitgitgadget · 2025-06-13T22:03:58Z

This patch series was integrated into seen via git@bd70b9d.

gitgitgadget · 2025-06-16T17:02:49Z

This patch series was integrated into seen via git@a210b57.

gitgitgadget · 2025-06-16T17:25:29Z

There was a status update in the "Cooking" section about the branch ds/path-walk-2 on the Git mailing list:

"git pack-objects" learns to find delta bases from blobs at the
same path, using the --path-walk API.

Will cook in 'next'.
source: <[email protected]>

gitgitgadget · 2025-06-17T21:27:40Z

This patch series was integrated into seen via git@88134a8.

gitgitgadget · 2025-06-17T21:27:40Z

This patch series was integrated into master via git@88134a8.

gitgitgadget · 2025-06-17T21:27:43Z

Closed via 88134a8.

derrickstolee self-assigned this Oct 29, 2024

derrickstolee force-pushed the path-walk-upstream branch from 2b762f3 to 7ae9a40 Compare October 30, 2024 20:06

derrickstolee force-pushed the api-upstream branch from 97d669a to 5252076 Compare October 30, 2024 20:07

derrickstolee force-pushed the path-walk-upstream branch from 7ae9a40 to 389c18f Compare October 30, 2024 22:20

derrickstolee force-pushed the api-upstream branch from 5252076 to 0bb607e Compare October 30, 2024 22:20

derrickstolee mentioned this pull request Oct 31, 2024

PATH WALK I: The path-walk API #1818

Closed

derrickstolee force-pushed the path-walk-upstream branch from 389c18f to bc37596 Compare November 8, 2024 16:01

derrickstolee mentioned this pull request Dec 2, 2024

pack-objects: Create an alternative name hash algorithm (recreated) #1823

Closed

derrickstolee force-pushed the path-walk-upstream branch from bc37596 to 68bc637 Compare December 6, 2024 19:41

derrickstolee force-pushed the api-upstream branch from 0bb607e to e716672 Compare December 6, 2024 19:42

derrickstolee force-pushed the path-walk-upstream branch from 68bc637 to bc37596 Compare December 18, 2024 15:14

derrickstolee force-pushed the api-upstream branch 3 times, most recently from 781b2ea to ef54342 Compare December 18, 2024 16:13

derrickstolee mentioned this pull request Feb 18, 2025

pack-objects: add --path-walk option for better deltas #1813

Closed

derrickstolee force-pushed the path-walk-upstream branch from bc37596 to 785dfb3 Compare February 22, 2025 23:59

derrickstolee force-pushed the path-walk-upstream branch from 785dfb3 to c288df6 Compare March 3, 2025 19:39

derrickstolee changed the base branch from api-upstream to master March 3, 2025 19:40

derrickstolee commented Mar 3, 2025

View reviewed changes

builtin/pack-objects.c Show resolved Hide resolved

derrickstolee force-pushed the path-walk-upstream branch 3 times, most recently from 26e1afb to 2eb9250 Compare March 9, 2025 21:55

gitgitgadget bot added the seen label Mar 10, 2025

gitgitgadget bot added the next label Jun 5, 2025

gitgitgadget bot added the master label Jun 17, 2025

gitgitgadget bot closed this Jun 17, 2025

PATH WALK II: Add --path-walk option to 'git pack-objects' #1819

PATH WALK II: Add --path-walk option to 'git pack-objects' #1819

Uh oh!

Conversation

derrickstolee commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Updates in v2

Updates in v3

Uh oh!

Uh oh!

derrickstolee commented Mar 10, 2025

Uh oh!

gitgitgadget bot commented Mar 10, 2025

Uh oh!

gitgitgadget bot commented Mar 10, 2025

Uh oh!

gitgitgadget bot commented Mar 10, 2025

Uh oh!

gitgitgadget bot commented Mar 11, 2025

Uh oh!

gitgitgadget bot commented Mar 11, 2025

Uh oh!

gitgitgadget bot commented Mar 11, 2025

Uh oh!

gitgitgadget bot commented May 30, 2025

Uh oh!

gitgitgadget bot commented May 31, 2025

Uh oh!

gitgitgadget bot commented Jun 2, 2025

Uh oh!

gitgitgadget bot commented Jun 2, 2025

Uh oh!

gitgitgadget bot commented Jun 3, 2025

Uh oh!

gitgitgadget bot commented Jun 4, 2025

Uh oh!

gitgitgadget bot commented Jun 5, 2025

Uh oh!

gitgitgadget bot commented Jun 5, 2025

Uh oh!

gitgitgadget bot commented Jun 5, 2025

Uh oh!

gitgitgadget bot commented Jun 6, 2025

Uh oh!

gitgitgadget bot commented Jun 7, 2025

Uh oh!

gitgitgadget bot commented Jun 7, 2025

Uh oh!

gitgitgadget bot commented Jun 8, 2025

Uh oh!

gitgitgadget bot commented Jun 9, 2025

Uh oh!

gitgitgadget bot commented Jun 9, 2025

Uh oh!

gitgitgadget bot commented Jun 10, 2025

Uh oh!

gitgitgadget bot commented Jun 10, 2025

Uh oh!

gitgitgadget bot commented Jun 11, 2025

Uh oh!

gitgitgadget bot commented Jun 12, 2025

Uh oh!

gitgitgadget bot commented Jun 12, 2025

Uh oh!

gitgitgadget bot commented Jun 13, 2025

Uh oh!

gitgitgadget bot commented Jun 13, 2025

Uh oh!

gitgitgadget bot commented Jun 16, 2025

Uh oh!

gitgitgadget bot commented Jun 16, 2025

Uh oh!

gitgitgadget bot commented Jun 17, 2025

Uh oh!

gitgitgadget bot commented Jun 17, 2025

Uh oh!

gitgitgadget bot commented Jun 17, 2025

Uh oh!

Uh oh!

derrickstolee commented Oct 29, 2024 •

edited

Loading