Skip to content

Commit c44172c

Browse files
committed
revision: implement sparse algorithm
When enumerating objects to place in a pack-file during 'git pack-objects --revs', we discover the "frontier" of commits that we care about and the boundary with commit we find uninteresting. From that point, we walk trees to discover which trees and blobs are uninteresting. Finally, we walk trees from the interesting commits to find the interesting objects that are placed in the pack. This commit introduces a new, "sparse" way to discover the uninteresting trees. We use the perspective of a single user trying to push their topic to a large repository. That user likely changed a very small fraction of the paths in their working directory, but we spend a lot of time walking all reachable trees. The way to switch the logic to work in this sparse way is to start caring about which paths introduce new trees. While it is not possible to generate a diff between the frontier boundary and all of the interesting commits, we can simulate that behavior by inspecting all of the root trees as a whole, then recursing down to the set of trees at each path. We already had taken the first step by passing an oidset to mark_trees_uninteresting_sparse(). We now create a dictionary whose keys are paths and values are oidsets. We consider the set of trees that appear at each path. While we inspect a tree, we add its subtrees to the oidsets corresponding to the tree entry's path. We also mark trees as UNINTERESTING if the tree we are parsing is UNINTERESTING. To actually improve the peformance, we need to terminate our recursion. If the oidset contains only UNINTERESTING trees, then we do not continue the recursion. This avoids walking trees that are likely to not be reachable from interesting trees. If the oidset contains only interesting trees, then we will walk these trees in the final stage that collects the intersting objects to place in the pack. Thus, we only recurse if the oidset contains both interesting and UNINITERESTING trees. There are a few ways that this is not a universally better option. First, we can pack extra objects. If someone copies a subtree from one tree to another, the first tree will appear UNINTERESTING and we will not recurse to see that the subtree should also be UNINTERESTING. We will walk the new tree and see the subtree as a "new" object and add it to the pack. We add a test case that demonstrates this as a way to prove that the --sparse option is actually working. Second, we can have extra memory pressure. If instead of being a single user pushing a small topic we are a server sending new objects from across the entire working directory, then we will gain very little (the recursion will rarely terminate early) but will spend extra time maintaining the path-oidset dictionaries. Despite these potential drawbacks, the benefits of the algorithm are clear. By adding a counter to 'add_children_by_path' and 'mark_tree_contents_uninteresting', I measured the number of parsed trees for the two algorithms in a variety of repos. For git.git, I used the following input: v2.19.0 ^v2.19.0~10 Objects to pack: 550 Walked (old alg): 282 Walked (new alg): 130 For the Linux repo, I used the following input: v4.18 ^v4.18~10 Objects to pack: 518 Walked (old alg): 4,836 Walked (new alg): 188 The two repos above are rather "wide and flat" compared to other repos that I have used in the past. As a comparison, I tested an old topic branch in the Azure DevOps repo, which has a much deeper folder structure than the Linux repo. Objects to pack: 220 Walked (old alg): 22,804 Walked (new alg): 129 I used the number of walked trees the main metric above because it is consistent across multiple runs. When I ran my tests, the performance of the pack-objects command with the same options could change the end-to-end time by 10x depending on the file system being warm. However, by repeating the same test on repeat I could get more consistent timing results. The git.git and Linux tests were too fast overall (less than 0.5s) to measure an end-to-end difference. The Azure DevOps case was slow enough to see the time improve from 15s to 1s in the warm case. The cold case was 90s to 9s in my testing. These improvements will have even larger benefits in the super- large Windows repository. In our experiments, we see the "Enumerate objects" phase of pack-objects taking 60-80% of the end-to-end time of non-trivial pushes, taking longer than the network time to send the pack and the server time to verify the pack. Signed-off-by: Derrick Stolee <[email protected]>
1 parent ab733da commit c44172c

File tree

2 files changed

+144
-16
lines changed

2 files changed

+144
-16
lines changed

revision.c

Lines changed: 129 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
#include "commit-reach.h"
2828
#include "commit-graph.h"
2929
#include "prio-queue.h"
30+
#include "hashmap.h"
3031

3132
volatile show_early_output_fn_t show_early_output;
3233

@@ -99,29 +100,147 @@ void mark_tree_uninteresting(struct repository *r, struct tree *tree)
99100
mark_tree_contents_uninteresting(r, tree);
100101
}
101102

103+
struct path_and_oids_entry {
104+
struct hashmap_entry ent;
105+
char *path;
106+
struct oidset set;
107+
};
108+
109+
static int path_and_oids_cmp(const void *hashmap_cmp_fn_data,
110+
const struct path_and_oids_entry *e1,
111+
const struct path_and_oids_entry *e2,
112+
const void *keydata)
113+
{
114+
return strcmp(e1->path, e2->path);
115+
}
116+
117+
int map_flags = 0;
118+
static void paths_and_oids_init(struct hashmap *map)
119+
{
120+
hashmap_init(map, (hashmap_cmp_fn) path_and_oids_cmp, &map_flags, 0);
121+
}
122+
123+
static void paths_and_oids_clear(struct hashmap *map)
124+
{
125+
struct hashmap_iter iter;
126+
struct path_and_oids_entry *entry;
127+
hashmap_iter_init(map, &iter);
128+
129+
while ((entry = (struct path_and_oids_entry *)hashmap_iter_next(&iter))) {
130+
oidset_clear(&entry->set);
131+
free(entry->path);
132+
}
133+
134+
hashmap_free(map, 1);
135+
}
136+
137+
static void paths_and_oids_insert(struct hashmap *map,
138+
const char *path,
139+
const struct object_id *oid)
140+
{
141+
int hash = strhash(path);
142+
struct path_and_oids_entry key;
143+
struct path_and_oids_entry *entry;
144+
145+
hashmap_entry_init(&key, hash);
146+
key.path = xstrdup(path);
147+
oidset_init(&key.set, 0);
148+
149+
if (!(entry = (struct path_and_oids_entry *)hashmap_get(map, &key, NULL))) {
150+
entry = xcalloc(1, sizeof(struct path_and_oids_entry));
151+
hashmap_entry_init(entry, hash);
152+
entry->path = key.path;
153+
oidset_init(&entry->set, 16);
154+
hashmap_put(map, entry);
155+
} else {
156+
free(key.path);
157+
}
158+
159+
oidset_insert(&entry->set, oid);
160+
}
161+
162+
static void add_children_by_path(struct repository *r,
163+
struct tree *tree,
164+
struct hashmap *map)
165+
{
166+
struct tree_desc desc;
167+
struct name_entry entry;
168+
169+
if (!tree)
170+
return;
171+
172+
if (parse_tree_gently(tree, 1) < 0)
173+
return;
174+
175+
init_tree_desc(&desc, tree->buffer, tree->size);
176+
while (tree_entry(&desc, &entry)) {
177+
switch (object_type(entry.mode)) {
178+
case OBJ_TREE:
179+
paths_and_oids_insert(map, entry.path, entry.oid);
180+
181+
if (tree->object.flags & UNINTERESTING) {
182+
struct tree *child = lookup_tree(r, entry.oid);
183+
if (child)
184+
child->object.flags |= UNINTERESTING;
185+
}
186+
break;
187+
case OBJ_BLOB:
188+
if (tree->object.flags & UNINTERESTING) {
189+
struct blob *child = lookup_blob(r, entry.oid);
190+
if (child)
191+
child->object.flags |= UNINTERESTING;
192+
}
193+
break;
194+
default:
195+
/* Subproject commit - not in this repository */
196+
break;
197+
}
198+
}
199+
200+
free_tree_buffer(tree);
201+
}
202+
102203
void mark_trees_uninteresting_sparse(struct repository *r,
103204
struct oidset *set)
104205
{
206+
unsigned has_interesting = 0, has_uninteresting = 0;
207+
struct hashmap map;
208+
struct hashmap_iter map_iter;
209+
struct path_and_oids_entry *entry;
105210
struct object_id *oid;
106211
struct oidset_iter iter;
107212

108213
oidset_iter_init(set, &iter);
109-
while ((oid = oidset_iter_next(&iter))) {
214+
while ((!has_interesting || !has_uninteresting) &&
215+
(oid = oidset_iter_next(&iter))) {
110216
struct tree *tree = lookup_tree(r, oid);
111217

112218
if (!tree)
113219
continue;
114220

115-
if (tree->object.flags & UNINTERESTING) {
116-
/*
117-
* Remove the flag so the next call
118-
* is not a no-op. The flag is added
119-
* in mark_tree_unintersting().
120-
*/
121-
tree->object.flags ^= UNINTERESTING;
122-
mark_tree_uninteresting(r, tree);
123-
}
221+
if (tree->object.flags & UNINTERESTING)
222+
has_uninteresting = 1;
223+
else
224+
has_interesting = 1;
225+
}
226+
227+
/* Do not walk unless we have both types of trees. */
228+
if (!has_uninteresting || !has_interesting)
229+
return;
230+
231+
paths_and_oids_init(&map);
232+
233+
oidset_iter_init(set, &iter);
234+
while ((oid = oidset_iter_next(&iter))) {
235+
struct tree *tree = lookup_tree(r, oid);
236+
add_children_by_path(r, tree, &map);
124237
}
238+
239+
hashmap_iter_init(&map, &map_iter);
240+
while ((entry = hashmap_iter_next(&map_iter)))
241+
mark_trees_uninteresting_sparse(r, &entry->set);
242+
243+
paths_and_oids_clear(&map);
125244
}
126245

127246
struct commit_stack {

t/t5322-pack-objects-sparse.sh

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -83,33 +83,42 @@ test_expect_success 'sparse pack-objects' '
8383
test_cmp expect_objects.txt sparse_objects.txt
8484
'
8585

86+
# Demonstrate that the algorithms differ when we copy a tree wholesale
87+
# from one folder to another.
88+
8689
test_expect_success 'duplicate a folder from f1 into f3' '
8790
mkdir f3/f4 &&
8891
cp -r f1/f1/* f3/f4 &&
8992
git add f3/f4 &&
9093
git commit -m "Copied f1/f1 to f3/f4" &&
91-
cat >packinput.txt <<-EOF &&
94+
cat >packinput.txt <<-EOF
9295
topic1
9396
^topic1~1
9497
EOF
95-
git rev-parse \
96-
topic1 \
97-
topic1^{tree} \
98-
topic1:f3 | sort >expect_objects.txt
9998
'
10099

101100
test_expect_success 'non-sparse pack-objects' '
101+
git rev-parse \
102+
topic1 \
103+
topic1^{tree} \
104+
topic1:f3 | sort >expect_objects.txt &&
102105
git pack-objects --stdout --revs <packinput.txt >nonsparse.pack &&
103106
git index-pack -o nonsparse.idx nonsparse.pack &&
104107
git show-index <nonsparse.idx | awk "{print \$2}" >nonsparse_objects.txt &&
105108
test_cmp expect_objects.txt nonsparse_objects.txt
106109
'
107110

108111
test_expect_success 'sparse pack-objects' '
112+
git rev-parse \
113+
topic1 \
114+
topic1^{tree} \
115+
topic1:f3 \
116+
topic1:f3/f4 \
117+
topic1:f3/f4/data.txt | sort >expect_sparse_objects.txt &&
109118
git pack-objects --stdout --revs --sparse <packinput.txt >sparse.pack &&
110119
git index-pack -o sparse.idx sparse.pack &&
111120
git show-index <sparse.idx | awk "{print \$2}" >sparse_objects.txt &&
112-
test_cmp expect_objects.txt sparse_objects.txt
121+
test_cmp expect_sparse_objects.txt sparse_objects.txt
113122
'
114123

115124
test_done

0 commit comments

Comments
 (0)