-
Notifications
You must be signed in to change notification settings - Fork 144
PATH WALK II: Add --path-walk option to 'git pack-objects' #1819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PATH WALK II: Add --path-walk option to 'git pack-objects' #1819
Conversation
2b762f3
to
7ae9a40
Compare
97d669a
to
5252076
Compare
7ae9a40
to
389c18f
Compare
5252076
to
0bb607e
Compare
389c18f
to
bc37596
Compare
bc37596
to
68bc637
Compare
0bb607e
to
e716672
Compare
68bc637
to
bc37596
Compare
781b2ea
to
ef54342
Compare
bc37596
to
785dfb3
Compare
785dfb3
to
c288df6
Compare
26e1afb
to
2eb9250
Compare
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> ... deltas across path boundaries. This second pass is much faster than a fresh
> pass since the existing deltas are used as a limit for the size of
> potentially new deltas, short-circuiting the checks when the delta size
> exceeds the current-best.
Very nice.
> The microsoft/fluentui is a public Javascript repo that suffers from many of
> the name hash collisions as internal repositories I've worked with. Here is
> a comparison of the compressed size and end-to-end time of the repack:
>
> Repack Method Pack Size Time
> ---------------------------------------
> Hash v1 439.4M 87.24s
> Hash v2 161.7M 21.51s
> Path Walk 142.5M 28.16s
>
>
> Less dramatic, but perhaps more standardly structured is the nodejs/node
> repository, with these stats:
>
> Repack Method Pack Size Time
> ------------------------------------------
> Hash v1 739.9M 71.18s
> Hash v2 764.6M 67.82s
> Path Walk 698.0M 75.10s
>
>
> Even the Linux kernel repository gains some benefits, even though the number
> of hash collisions is relatively low due to a preference for short
> filenames:
>
> Repack Method Pack Size Time
> ------------------------------------------
> Hash v1 2.5G 554.41s
> Hash v2 2.5G 549.62s
> Path Walk 2.2G 559.00s
This third one, v2 not performing much better than v1, is quite
surprising.
> The drawbacks of the --path-walk feature is that it will be harder to
> integrate it with bitmap features, specifically delta islands. This is not
> insurmountable, but would require more work, such as a revision walk to
> paint objects with reachability information before using that during delta
> computations.
>
> However, there should still be significant benefits to Git clients trying to
> save space and improve local performance.
Sure. More experiments and more approaches will eventually give us
overall improvement. I am hoping that we will be able to condense
the result of these different approaches and their combinations into
easy-to-choose-from canned choices (as opposed to a myriad of little
knobs the users need to futz with without really understanding what
they are tweaking).
> This feature was shipped with similar features in microsoft/git as of
> v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo
> that had significant repository growth due to constructing a batch of
> beachball [5] CHANGELOG.[md|json] files and pushing them to a release
> branch. These pushes were frequently 70-200 MB due to poor delta
> compression. Using the 'pack.usePathWalk=true' config, these pushes dropped
> in size by 100x while improving performance. Since these CI machines were
> working with a shallow clone, the 'edge_aggressive' changes were required to
> enable the path-walk option.
Nice, thanks. |
This patch series was integrated into seen via git@e51880c. |
This branch is now known as |
This patch series was integrated into seen via git@28416f0. |
This patch series was integrated into seen via git@4fc875f. |
This patch series was integrated into seen via git@fe28f74. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. Comments? source: <[email protected]> |
This patch series was integrated into seen via git@ed40d39. |
This patch series was integrated into seen via git@e78edc7. |
This patch series was integrated into seen via git@e24b3f8. |
This patch series was integrated into seen via git@4aae12c. |
This patch series was integrated into seen via git@1c6c6c0. |
This patch series was integrated into next via git@e59d4b1. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. Will cook in 'next'. source: <[email protected]> |
This patch series was integrated into seen via git@0481447. |
This patch series was integrated into seen via git@20b0ce2. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. Will cook in 'next'. source: <[email protected]> |
This patch series was integrated into seen via git@258d7b6. |
This patch series was integrated into seen via git@1476d75. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. Will cook in 'next'. source: <[email protected]> |
This patch series was integrated into seen via git@4864b2c. |
This patch series was integrated into seen via git@0654674. |
This patch series was integrated into seen via git@b4ef194. |
This patch series was integrated into seen via git@3683f76. |
This patch series was integrated into seen via git@d189351. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. Will cook in 'next'. source: <[email protected]> |
This patch series was integrated into seen via git@bd70b9d. |
This patch series was integrated into seen via git@a210b57. |
There was a status update in the "Cooking" section about the branch "git pack-objects" learns to find delta bases from blobs at the same path, using the --path-walk API. Will cook in 'next'. source: <[email protected]> |
This patch series was integrated into seen via git@88134a8. |
This patch series was integrated into master via git@88134a8. |
Closed via 88134a8. |
Here is a full submission of the --path-walk feature for 'git pack-objects' and 'git repack'. It's been discussed in an RFC [1], as a future application for the path walk API [2], and is updated now that --name-hash-version=2 exists (as a replacement for the --full-name-hash option from the RFC) [3].
[1] https://lore.kernel.org/git/[email protected]/
[2] https://lore.kernel.org/git/[email protected]
[3] https://lore.kernel.org/git/[email protected]
This patch series does the following:
Add a new '--path-walk' option to 'git pack-objects' that uses the path-walk API instead of the revision API to collect objects for delta compression.
Add a new '--path-walk' option to 'git repack' to pass this option along to 'git pack-objects'.
Add a new 'pack.usePathWalk' config option to opt into this option implicitly, such as in 'git push'.
Optimize the '--path-walk' option using threading so it better competes with the existing multi-threaded delta compression mechanism.
Update the path-walk API with a new 'edge_aggressive' option that pairs close to the --edge-aggressive option in the revision API. This is useful when creating thin packs inside shallow clones.
This feature works by using the path-walk API to emit groups of objects that appear at the same path. These groups are tracked so they can be tested for delta compression with each other, and then after those groups are tested a second pass using the name-hash attempts to find better (or first time) deltas across path boundaries. This second pass is much faster than a fresh pass since the existing deltas are used as a limit for the size of potentially new deltas, short-circuiting the checks when the delta size exceeds the current-best.
The benefits of the --path-walk feature first come into play when the name hash functions have many collisions, so sorting by name hash value leads to unhelpful groupings of objects. Many of these benefits are improved by --name-hash-version=2, but collisions still exist with any hash-based approach. There are also performance benefits in some cases due to the isolation of delta compression testing within path groups.
All of the benefits of the --path-walk feature are less dramatic when compared to --name-hash-version=2, but they can still exist in many cases. I have also seen some cases where --name-hash-version=2 compresses better than --path-walk with --name-hash-version=1, but these options can be combined to get the best of both worlds.
Detailed statistics are provided within patch messages, but a few are highlighted here:
The microsoft/fluentui is a public Javascript repo that suffers from many of the name hash collisions as internal repositories I've worked with. Here is a comparison of the compressed size and end-to-end time of the repack:
Less dramatic, but perhaps more standardly structured is the nodejs/node repository, with these stats:
Even the Linux kernel repository gains some benefits, even though the number of hash collisions is relatively low due to a preference for short filenames:
The drawbacks of the --path-walk feature is that it will be harder to integrate it with bitmap features, specifically delta islands. This is not insurmountable, but would require more work, such as a revision walk to paint objects with reachability information before using that during delta computations.
However, there should still be significant benefits to Git clients trying to save space and improve local performance.
This feature was shipped with similar features in microsoft/git as of v2.47.0.vfs.0.3 [4]. This was used in CI machines for an internal monorepo that had significant repository growth due to constructing a batch of beachball [5] CHANGELOG.[md|json] files and pushing them to a release branch. These pushes were frequently 70-200 MB due to poor delta compression. Using the 'pack.usePathWalk=true' config, these pushes dropped in size by 100x while improving performance. Since these CI machines were working with a shallow clone, the 'edge_aggressive' changes were required to enable the path-walk option.
[4] https://github.com/microsoft/git/releases/tag/v2.47.0.vfs.0.3
[5] https://github.com/microsoft/beachball
Updates in v2
--use-bitmap-index
.Updates in v3
Thanks for the review, Taylor. Sorry for my delay in getting back to your feedback.
Thanks,
-Stolee
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]
cc: [email protected]