-
Notifications
You must be signed in to change notification settings - Fork 100
midx: verify: group objects by packfile to speed up object verification #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
midx: verify: group objects by packfile to speed up object verification #124
Conversation
On a repo with 3600 packfiles, run time went from 12 minutes to 26 seconds. |
7f7d840
to
81ba82c
Compare
@@ -1040,15 +1062,32 @@ int verify_midx_file(const char *object_dir) | |||
} | |||
|
|||
progress = start_progress(_("Verifying object offsets"), m->num_objects); | |||
|
|||
/* | |||
* Create an array mapping each object to its packfile id. Sort it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, and follows a pattern that exists elsewhere in this file (ie midx_repack()). Clearly results in a huge perf win by avoiding opening/closing pack files all the time.
nth_midxed_object_oid(&oid, m, i); | ||
if (k > 0 && pairs[k-1].pack_int_id != pairs[k].pack_int_id && | ||
m->packs[pairs[k-1].pack_int_id]) | ||
close_pack_fd(m->packs[pairs[k-1].pack_int_id]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this correctly, this is an (optional) optimization that will keep the pack files open to a minimum. I'm assuming without it, they would start being closed transparently as you reached some max threshold. Since you know they are sorted, makes sense to do the optimization here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the problem with max file descriptors that i fixed yesterday said we were holding 2000+ packfiles open when we started running out. fixing that caused us to still hold 2000+ open, but close and open packfiles as necessary to do the random access.
So yeah, this fix kinda eliminates the need for the previous fix. But i'm keeping that one in for now since it is harmless and just seems like the correct thing to do.
@@ -309,7 +309,7 @@ void close_pack_windows(struct packed_git *p) | |||
} | |||
} | |||
|
|||
static int close_pack_fd(struct packed_git *p) | |||
int close_pack_fd(struct packed_git *p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't be needed without the optimization above but I don't see any problem making this public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. the new verify loop completely verifies all objects in one packfile before moving to the next packfile (because of the sort). But when we hit 2000+ packfiles in the directory, visiting the next packfile requires us to free up a fd, and
this triggers the LRU search in close_one_pack(). So by closing the previous packfile in that loop, we'll only have 1
packfile open and avoid all of the LRU searching (which is O(n^2)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me and a great performance win. Would it be possible to multi-thread it now that we are only going through one pack at a time to speed it up even more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the memory usage of the parallel data structure:
* to group the objects by packfile. Use this ordering to visit each | ||
* of the objects, but only require 1 packfile to be open at a time. | ||
*/ | ||
ALLOC_ARRAY(pairs, m->num_objects); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For very large repos, this could be asking for many GB, right? Should a failure to alloc cause a failure in midx validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeff and I talked offline and I misunderstood the number of objects in play. We should ask for less than 500MB, which should be serviceable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
@kewillford Yeah, that was my thought too. Now that we have the loop looking at a single packfile at a time, we should be able to thread that. I'll save that for another day though. :-) |
Sort the set of objects by packfile so that only one packfile needs to be open at a time. This is a performance improvement. Previously, objects were verified in OID order. This essentially requires all packfiles to be open at the same time. If the number of packfiles exceeds the open file limit, packfiles would be closed and re-opened many times. Signed-off-by: Jeff Hostetler <[email protected]>
81ba82c
to
e7fe2d4
Compare
Log multi-pack-index sub-command (cmd_mode). Log number of objects and number of packfiles. Signed-off-by: Jeff Hostetler <[email protected]>
e7fe2d4
to
4a4f2f4
Compare
midx: verify: group objects by packfile to speed up object verification
Sort the set of objects by packfile so that only one packfile needs
to be open at a time.
This is a performance improvement. Previously, objects were
verified in OID order. This essentially requires all packfiles
to be open at the same time. If the number of packfiles exceeds
the open file limit, packfiles would be closed and re-opened
many times.
Signed-off-by: Jeff Hostetler [email protected]