-
Notifications
You must be signed in to change notification settings - Fork 140
Create 'expire' and 'repack' verbs for git-multi-pack-index #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create 'expire' and 'repack' verbs for git-multi-pack-index #92
Conversation
acdb5ff
to
8679fc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to keep accommodating for BSD's Unix tools. So much for "shell scripting is easy", eh?
d47bb42
to
41ef671
Compare
/submit |
Submitted as [email protected] |
41ef671
to
282fdb3
Compare
This branch is now known as |
This patch series was integrated into pu via git@8ea81da. |
11201f4
to
bef7aa0
Compare
/submit |
Submitted as [email protected] |
bef7aa0
to
f067cd4
Compare
0665185
to
b97fb35
Compare
/submit |
Submitted as [email protected] |
This patch series was integrated into pu via git@312f133. |
This patch series was integrated into pu via git@89ab236. |
This patch series was integrated into pu via git@aad77f7. |
97a9137
to
481b088
Compare
The multi-pack-index tracks objects in a collection of pack-files. Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers. It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file. Introduce a new 'expire' subcommand to the multi-pack-index builtin. This subcommand will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files. More details about the specifics will follow as the method is implemented. Add a test that verifies the 'expire' subcommand is correctly wired, but will still be valid when the verb is implemented. Specifically, create a set of packs that should all have referenced objects and should not be removed during an 'expire' operation. The packs are created carefully to ensure they have a specific order when sorted by size. This will be important in a later test. Signed-off-by: Derrick Stolee <[email protected]>
Before writing the multi-pack-index, we compute the length of the pack-index names concatenated together. This forms the data in the pack name chunk, and we precompute it to compute chunk offsets. The value is also modified to fit alignment needs. Previously, this computation was coupled with adding packs from the existing multi-pack-index and the remaining packs in the object dir not already covered by the multi-pack-index. In anticipation of this becoming more complicated with the 'expire' subcommand, simplify the computation by centralizing it to a single loop before writing the file. Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of the expire subcommand, refactor the way we sort the packfiles by name. This will greatly simplify our approach to dropping expired packs from the list. First, create 'struct pack_info' to replace 'struct pack_pair'. This struct contains the necessary information about a pack, including its name, a pointer to its packfile struct (if not already in the multi-pack-index), and the original pack-int-id. Second, track the pack information using an array of pack_info structs in the pack_list struct. This simplifies the logic around the multiple arrays we were tracking in that struct. Finally, update get_sorted_entries() to not permute the pack-int-id and instead supply the permutation to write_midx_object_offsets(). This requires sorting the packs after get_sorted_entries(). Signed-off-by: Derrick Stolee <[email protected]>
The 'git multi-pack-index expire' subcommand looks at the existing mult-pack-index, counts the number of objects referenced in each pack-file, deletes the pack-fils with no referenced objects, and rewrites the multi-pack-index to no longer reference those packs. Refactor the write_midx_file() method to call write_midx_internal() which now takes an existing 'struct multi_pack_index' and a list of pack-files to drop (as specified by the names of their pack- indexes). As we write the new multi-pack-index, we drop those file names from the list of known pack-files. The expire_midx_packs() method removes the unreferenced pack-files after carefully closing the packs to avoid open handles. Test that a new pack-file that covers the contents of two other pack-files leads to those pack-files being deleted during the expire subcommand. Be sure to read the multi-pack-index to ensure it no longer references those packs. Signed-off-by: Derrick Stolee <[email protected]>
In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. In addition, we hard-code the modified times of the packs in the pack directory to ensure the list of packs sorted by modified time matches the order if sorted by size (ascending). This will be important in a future test. Signed-off-by: Derrick Stolee <[email protected]>
To repack with a non-zero batch-size, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, compute their expected size, and add the packs to a list if they are smaller than the given batch-size. Stop when the total expected size is at least the batch size. If the batch size is zero, select all packs in the multi-pack-index. Finally, collect the objects from the multi-pack-index that are in the selected packs and send them to 'git pack-objects'. Write a new multi-pack-index that includes the new pack. Using a batch size of zero is very similar to a standard 'git repack' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs. This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the actual size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. The "expected size" version provides similar behavior, but could skip a pack-file if the average object size is much larger than the actual size of the referenced objects, or can create a large pack if the actual size of the referenced objects is larger than the expected size. Signed-off-by: Derrick Stolee <[email protected]>
During development of the multi-pack-index expire subcommand, a version went out that improperly computed the pack order if a new pack was introduced while other packs were being removed. Part of the subtlety of the bug involved the new pack being placed before other packs that already existed in the multi-pack-index. Add a test to t5319-multi-pack-index.sh that catches this issue. The test adds new packs that cause another pack to be expired, and creates new packs that are lexicographically sorted before and after the existing packs. Signed-off-by: Derrick Stolee <[email protected]>
The 'git multi-pack-index expire' subcommand may delete packs that are not needed from the perspective of the multi-pack-index. If a pack has a .keep file, then we should not delete that pack. Add a test that ensures we preserve a pack that would otherwise be expired. First, create a new pack that contains every object in the repo, then add it to the multi-pack-index. Then create a .keep file for a pack starting with "a-pack" that was added in the previous test. Finally, expire and verify that the pack remains and the other packs were expired. Signed-off-by: Derrick Stolee <[email protected]>
The 'git multi-pack-index repack' command can take a batch size of zero, which creates a new pack-file containing all objects in the multi-pack-index. The first 'repack' command will create one new pack-file, and an 'expire' command after that will delete the old pack-files, as they no longer contain any referenced objects in the multi-pack-index. We must remove the .keep file that was added in the previous test in order to expire that pack-file. Also test that a 'repack' will do nothing if there is only one pack-file. Signed-off-by: Derrick Stolee <[email protected]>
ca99cdd
to
3ed388f
Compare
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
@derrickstolee maybe close this here PR? |
1 similar comment
@derrickstolee maybe close this here PR? |
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used.
Create new subcommands for the multi-pack-index builtin.
'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index.
'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose "expected size" is below the batch size until we have a collection of packs whose expected sizes add up to the batch size. We compute the expected size by multiplying the number of referenced objects by the pack-size and dividing by the total number of objects in the pack. If the batch-size is zero, then select all packs. Create a new pack containing all objects that the multi-pack-index references to those packs.
This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model:
Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files.
Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted.
We are using this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs.
Updates in V2:
Added a method,
unlink_pack_path()
to remove packfiles, but with the additional check for a.keep
file. This borrows logic frombuiltin/repack.c
.Modified documentation and commit messages to replace 'verb' with 'subcommand'. Simplified the documentation. (I left 'verbs' in the title of the cover letter for consistency.)
Updates in V3:
Updates in V4:
More 'verb' and 'command' instances replaced with 'subcommand'. I grepped the patch to check these should be fixed everywhere.
Update the tests to check .keep files (in last patch).
Modify the tests to show the terminating condition of --batch-size when there are three packs that fit under the size, but the first two are large enough to stop adding packs. This required rearranging the packs slightly to get different sizes than we had before. Also, I added 'touch -t' to set the modified times so we can fix the order in which the packs are selected.
Added a comment about the purpose of pack_perm.
Updates in V5:
Fixed the error in PATCH 7 due to a missing line that existed in PATCH 8. Thanks, Josh Steadmon!
The 'repack' subcommand now computes the "expected size" of a pack instead of relying on the total size of the pack. This is actually really important to the way VFS for Git uses prefetch packs, and some packs are not being repacked because the pack size is larger than the batch size, but really there are only a few referenced objects.
The 'repack' subcommand now allows a batch size of zero to mean "create one pack containing all objects in the multi-pack-index". A new commit adds a test that hits the boundary cases here, but follows the 'expire' subcommand so we can show that cycle of repack-then-expire to safely replace the packs.
Junio: It appears that there are some conflicts with the trace2 changes in master. These are not new to the updates in this version. I'm assuming you've resolved these conflicts and replaying that resolution will work for you. If not, then here are my recommendations:
In midx.c, it's just a conflict on the last
#include
statement, so any order is fine.In builtin/multi-pack-index.c:
Place the trace2_cmd_mode(argv[0]) above the strcmp.
Thanks,
-Stolee
Cc: [email protected], [email protected], [email protected], [email protected], [email protected]