Skip to content

Commit d5d2e93

Browse files
derrickstoleegitster
authored andcommitted
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees from the interesting commits to find the interesting objects that are placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the performance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the oidset contains only interesting trees, then we will walk these trees in the final stage that collects the intersting objects to place in the pack. Thus, we only recurse if the oidset contains both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. A test is modified to demonstrate this behavior and to verify that the new logic is being exercised. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 4f6d26b commit d5d2e93

File tree

2 files changed

+139
-13
lines changed

2 files changed

+139
-13
lines changed

revision.c

Lines changed: 128 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
#include "commit-reach.h"
2828
#include "commit-graph.h"
2929
#include "prio-queue.h"
30+
#include "hashmap.h"
3031

3132
volatile show_early_output_fn_t show_early_output;
3233

@@ -99,29 +100,146 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
99100
mark_tree_contents_uninteresting(r, tree);
100101
}
101102

103+
struct path_and_oids_entry {
104+
struct hashmap_entry ent;
105+
char *path;
106+
struct oidset trees;
107+
};
108+
109+
static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
110+
const struct path_and_oids_entry *e1,
111+
const struct path_and_oids_entry *e2,
112+
const void *keydata)
113+
{
114+
return strcmp(e1->path, e2->path);
115+
}
116+
117+
static void paths_and_oids_init(struct hashmap *map)
118+
{
119+
hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, NULL, 0);
120+
}
121+
122+
static void paths_and_oids_clear(struct hashmap *map)
123+
{
124+
struct hashmap_iter iter;
125+
struct path_and_oids_entry *entry;
126+
hashmap_iter_init(map, &iter);
127+
128+
while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
129+
oidset_clear(&entry->trees);
130+
free(entry->path);
131+
}
132+
133+
hashmap_free(map, 1);
134+
}
135+
136+
static void paths_and_oids_insert(struct hashmap *map,
137+
const char *path,
138+
const struct object_id *oid)
139+
{
140+
int hash = strhash(path);
141+
struct path_and_oids_entry key;
142+
struct path_and_oids_entry *entry;
143+
144+
hashmap_entry_init(&key, hash);
145+
146+
/* use a shallow copy for the lookup */
147+
key.path = (char *)path;
148+
oidset_init(&key.trees, 0);
149+
150+
if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
151+
entry = xcalloc(1, sizeof(struct path_and_oids_entry));
152+
hashmap_entry_init(entry, hash);
153+
entry->path = xstrdup(key.path);
154+
oidset_init(&entry->trees, 16);
155+
hashmap_put(map, entry);
156+
}
157+
158+
oidset_insert(&entry->trees, oid);
159+
}
160+
161+
static void add_children_by_path(struct repository *r,
162+
struct tree *tree,
163+
struct hashmap *map)
164+
{
165+
struct tree_desc desc;
166+
struct name_entry entry;
167+
168+
if (!tree)
169+
return;
170+
171+
if (parse_tree_gently(tree, 1) < 0)
172+
return;
173+
174+
init_tree_desc(&desc, tree->buffer, tree->size);
175+
while (tree_entry(&desc, &entry)) {
176+
switch (object_type(entry.mode)) {
177+
case OBJ_TREE:
178+
paths_and_oids_insert(map, entry.path, entry.oid);
179+
180+
if (tree->object.flags & UNINTERESTING) {
181+
struct tree *child = lookup_tree(r, entry.oid);
182+
if (child)
183+
child->object.flags |= UNINTERESTING;
184+
}
185+
break;
186+
case OBJ_BLOB:
187+
if (tree->object.flags & UNINTERESTING) {
188+
struct blob *child = lookup_blob(r, entry.oid);
189+
if (child)
190+
child->object.flags |= UNINTERESTING;
191+
}
192+
break;
193+
default:
194+
/* Subproject commit - not in this repository */
195+
break;
196+
}
197+
}
198+
199+
free_tree_buffer(tree);
200+
}
201+
102202
void mark_trees_uninteresting_sparse(struct repository *r,
103203
struct oidset *trees)
104204
{
205+
unsigned has_interesting = 0, has_uninteresting = 0;
206+
struct hashmap map;
207+
struct hashmap_iter map_iter;
208+
struct path_and_oids_entry *entry;
105209
struct object_id *oid;
106210
struct oidset_iter iter;
107211

108212
oidset_iter_init(trees, &iter);
109-
while ((oid = oidset_iter_next(&iter))) {
213+
while ((!has_interesting || !has_uninteresting) &&
214+
(oid = oidset_iter_next(&iter))) {
110215
struct tree *tree = lookup_tree(r, oid);
111216

112217
if (!tree)
113218
continue;
114219

115-
if (tree->object.flags & UNINTERESTING) {
116-
/*
117-
* Remove the flag so the next call
118-
* is not a no-op. The flag is added
119-
* in mark_tree_unintersting().
120-
*/
121-
tree->object.flags ^= UNINTERESTING;
122-
mark_tree_uninteresting(r, tree);
123-
}
220+
if (tree->object.flags & UNINTERESTING)
221+
has_uninteresting = 1;
222+
else
223+
has_interesting = 1;
224+
}
225+
226+
/* Do not walk unless we have both types of trees. */
227+
if (!has_uninteresting || !has_interesting)
228+
return;
229+
230+
paths_and_oids_init(&map);
231+
232+
oidset_iter_init(trees, &iter);
233+
while ((oid = oidset_iter_next(&iter))) {
234+
struct tree *tree = lookup_tree(r, oid);
235+
add_children_by_path(r, tree, &map);
124236
}
237+
238+
hashmap_iter_init(&map, &map_iter);
239+
while ((entry = hashmap_iter_next(&map_iter)))
240+
mark_trees_uninteresting_sparse(r, &entry->trees);
241+
242+
paths_and_oids_clear(&map);
125243
}
126244

127245
struct commit_stack {

t/t5322-pack-objects-sparse.sh

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,9 @@ test_expect_success 'sparse pack-objects' '
7979
test_cmp required_objects.txt sparse_required_objects.txt
8080
'
8181

82+
# Demonstrate that the algorithms differ when we copy a tree wholesale
83+
# from one folder to another.
84+
8285
test_expect_success 'duplicate a folder from f1 into f3' '
8386
mkdir f3/f4 &&
8487
cp -r f1/f1/* f3/f4 &&
@@ -95,19 +98,24 @@ test_expect_success 'duplicate a folder from f1 into f3' '
9598
'
9699

97100
test_expect_success 'non-sparse pack-objects' '
98-
git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
101+
git pack-objects --stdout --revs --no-sparse <packinput.txt >nonsparse.pack &&
99102
git index-pack -o nonsparse.idx nonsparse.pack &&
100103
git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
101104
comm -1 -2 required_objects.txt nonsparse_objects.txt >nonsparse_required_objects.txt &&
102105
test_cmp required_objects.txt nonsparse_required_objects.txt
103106
'
104107

105108
test_expect_success 'sparse pack-objects' '
109+
git rev-parse \
110+
topic1 \
111+
topic1^{tree} \
112+
topic1:f3 \
113+
topic1:f3/f4 \
114+
topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&
106115
git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
107116
git index-pack -o sparse.idx sparse.pack &&
108117
git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
109-
comm -1 -2 required_objects.txt sparse_objects.txt >sparse_required_objects.txt &&
110-
test_cmp required_objects.txt sparse_required_objects.txt
118+
test_cmp expect_sparse_objects.txt sparse_objects.txt
111119
'
112120

113121
test_done

0 commit comments

Comments
 (0)