-
Notifications
You must be signed in to change notification settings - Fork 141
[GSoC] Implement Corrected Commit Date #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC] Implement Corrected Commit Date #676
Conversation
Welcome to GitGitGadgetHi @abhishekkumar2718, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests. Please make sure that your Pull Request has a good description, as it will be used as cover letter. Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:
It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code. Contributing the patchesBefore you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form Both the person who commented An alternative is the channel
Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment If you want to see what email(s) would be sent for a After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail). If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the curl -g --user "<EMailAddress>:<Password>" \
--url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):
To send a new iteration, just add another PR comment with the contents: Need help?New contributors who want advice are encouraged to join [email protected], where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join. You may also be able to find help in real time in the developer IRC channel, |
There is an issue in commit 7193ebf81003b2916fa961bfb6e8600ad51d5d59: |
There is an issue in commit a739d05b697b75bbb42fb544cec85b97cf156b88: |
There is an issue in commit 5a3f76b14423df00b6171734135f6b06fd272513: |
@abhishekkumar2718 thanks for opening this so I can see your progress! Please @-mention me when you have working builds so I can take a quick review before you submit. |
/allow |
User abhishekkumar2718 is now allowed to use GitGitGadget. WARNING: abhishekkumar2718 has no public email address set on GitHub |
@abhishekkumar2718 please fix the public email address so GitGitGadget can CC you properly. |
5a3f76b
to
dd7a19b
Compare
There is an issue in commit dd7a19b322302173f2e03921a9fcc51f85caa864: |
e151734
to
171c05e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think updating your struct to make the generation comparisons simpler will help. You will need to update a lot of code in commit-reach.c
to use a uint64_t min_generation
whenever you see a uint32_t min_generation
or similar.
b784e44
to
647290d
Compare
/submit |
Submitted as [email protected] |
This patch series was integrated into seen via git@66841f6. |
This patch series was integrated into seen via git@218525a. |
Git uses topological levels in the commit-graph file for commit-graph traversal operations like 'git log --graph'. Unfortunately, topological levels can perform worse than committer date when parents of a commit differ greatly in generation numbers [1]. For example, 'git merge-base v4.8 v4.9' on the Linux repository walks 635,579 commits using topological levels and walks 167,468 using committer date. Since 091f4cf (commit: don't use generation numbers if not needed, 2018-08-30), 'git merge-base' uses committer date heuristic unless there is a cutoff because of the performance hit. [1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/ Thus, the need for generation number v2 was born. As Git used to die when graph version understood by it and in the commit-graph file are different [2], we needed a way to distinguish between the old and new generation number without incrementing the graph version. [2] https://lore.kernel.org/git/[email protected]/ The following candidates were proposed (https://github.com/derrickstolee/gen-test, #1): - (Epoch, Date) Pairs. - Maximum Generation Numbers. - Corrected Commit Date. - FELINE Index. - Corrected Commit Date with Monotonically Increasing Offsets. Based on performance, local computability, and immutability (along with the introduction of an additional commit-graph chunk which relieved the requirement of backwards-compatibility) Corrected Commit Date was chosen as generation number v2 and is defined as follows: For a commit C, let its corrected commit date be the maximum of the commit date of C and the corrected commit dates of its parents plus 1. Then corrected commit date offset is the difference between corrected commit date of C and commit date of C. As a special case, a root commit with the timestamp zero has corrected commit date of 1 to distinguish it from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number). While it was proposed initially to store corrected commit date offsets within Commit Data Chunk, storing the offsets in a new chunk did not affect the performance measurably. The new chunk is "Generation DATa (GDAT) chunk" and it stores corrected commit date offsets while CDAT chunk stores topological level. The old versions of Git would ignore GDAT chunk, using topological levels from CDAT chunk. In contrast, new versions of Git would use corrected commit dates, falling back to topological level if the generation data chunk is absent in the commit-graph file. While storing corrected commit date offsets saves us 4 bytes per commit (as compared with storing corrected commit dates directly), it's however possible for the offset to overflow the space allocated. To handle such cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that stores the corrected commit date. For overflowing offsets, we set MSB and store the position into the GDOV chunk, in a mechanism similar to the Extra Edges list chunk. For mixed generation number environment (for example new Git on the command line, old Git used by GUI client), we can encounter a mixed-chain commit-graph (a commit-graph chain where some of split commit-graph files have GDAT chunk and others do not). As backward compatibility is one of the goals, we can define the following behavior: While reading a mixed-chain commit-graph version, we fall back on topological levels as corrected commit dates and topological levels cannot be compared directly. When adding new layer to the split commit-graph file, and when merging some or all layers (replacing them in the latter case), the new layer will have GDAT chunk if and only if in the final result there would be no layer without GDAT chunk just below it. Signed-off-by: Abhishek Kumar <[email protected]>
With most of preparations done, let's implement corrected commit date. The corrected commit date for a commit is defined as: * A commit with no parents (a root commit) has corrected commit date equal to its committer date. * A commit with at least one parent has corrected commit date equal to the maximum of its commit date and one more than the largest corrected commit date among its parents. As a special case, a root commit with timestamp of zero (01.01.1970 00:00:00Z) has corrected commit date of one, to be able to distinguish from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date). To minimize the space required to store corrected commit date, Git stores corrected commit date offsets into the commit-graph file. The corrected commit date offset for a commit is defined as the difference between its corrected commit date and actual commit date. Storing corrected commit date requires sizeof(timestamp_t) bytes, which in most cases is 64 bits (uintmax_t). However, corrected commit date offsets can be safely stored using only 32-bits. This halves the size of GDAT chunk, which is a reduction of around 6% in the size of commit-graph file. However, using offsets be problematic if a commit is malformed but valid and has committer date of 0 Unix time, as the offset would be the same as corrected commit date and thus require 64-bits to be stored properly. While Git does not write out offsets at this stage, Git stores the corrected commit dates in member generation of struct commit_graph_data. It will begin writing commit date offsets with the introduction of generation data chunk. Signed-off-by: Abhishek Kumar <[email protected]>
As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/[email protected]/ Signed-off-by: Abhishek Kumar <[email protected]>
Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <[email protected]> Signed-off-by: Abhishek Kumar <[email protected]>
091f4cf (commit: don't use generation numbers if not needed, 2018-08-30) changed paint_down_to_common() to use commit dates instead of generation numbers v1 (topological levels) as the performance regressed on certain topologies. With generation number v2 (corrected commit dates) implemented, we no longer have to rely on commit dates and can use generation numbers. For example, the command `git merge-base v4.8 v4.9` on the Linux repository walks 167468 commits, taking 0.135s for committer date and 167496 commits, taking 0.157s for corrected committer date respectively. While using corrected commit dates, Git walks nearly the same number of commits as commit date, the process is slower as for each comparision we have to access a commit-slab (for corrected committer date) instead of accessing struct member (for committer date). This change incidentally broke the fragile t6404-recursive-merge test. t6404-recursive-merge sets up a unique repository where all commits have the same committer date without a well-defined merge-base. While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer date as a heuristic in paint_down_to_common(). 6404.1 'combined merge conflicts' merges commits in the order: - Merge C with B to form an intermediate commit. - Merge the intermediate commit with A. With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently use the corrected committer date, which changes the order in which commits are merged: - Merge A with B to form an intermediate commit. - Merge the intermediate commit with C. While resulting repositories are equivalent, 6404.4 'virtual trees were processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting different merge-bases and thus have different object ids for the intermediate commits. As this has already causes problems (as noted in 859fdc0 (commit-graph: define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within t6404-recursive-merge. Signed-off-by: Abhishek Kumar <[email protected]>
9b40d5d
to
523e2d4
Compare
/submit |
Submitted as [email protected] To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
Ummm. @abhishekkumar2718 did you miss what I wrote about |
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
This patch series was integrated into seen via git@deae33e. |
This patch series was integrated into seen via git@7862ac8. |
This patch series was integrated into seen via git@cd2bfee. |
This patch series was integrated into seen via git@e502eab. |
This patch series was integrated into seen via git@5fb6127. |
This patch series was integrated into seen via git@deae33e. |
This patch series was integrated into seen via git@59d4dae. |
This patch series was integrated into seen via git@7943c06. |
This patch series was integrated into seen via git@cf9fbab. |
This patch series was integrated into seen via git@2e3ad28. |
This patch series was integrated into seen via git@0cf0dc9. |
This patch series was integrated into seen via git@8b4701a. |
This patch series was integrated into next via git@8b4701a. |
This patch series was integrated into master via git@8b4701a. |
Closed via 8b4701a. |
This patch series implements the corrected commit date offsets as generation number v2, along with other pre-requisites.
Git uses topological levels in the commit-graph file for commit-graph traversal operations like 'git log --graph'. Unfortunately, topological levels can perform worse than committer date when parents of a commit differ greatly in generation numbers [1]. For example, 'git merge-base v4.8 v4.9' on the Linux repository walks 635,579 commits using topological levels and walks 167,468 using committer date. Since 091f4cf (commit: don't use generation numbers if not needed, 2018-08-30), 'git merge-base' uses committer date heuristic unless there is a cutoff because of the performance hit.
[1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
Thus, the need for generation number v2 was born. As Git used to die when graph version understood by it and in the commit-graph file are different [2], we needed a way to distinguish between the old and new generation number without incrementing the graph version.
[2] https://lore.kernel.org/git/[email protected]/
The following candidates were proposed (https://github.com/derrickstolee/gen-test, abhishekkumar2718#1):
Based on performance, local computability, and immutability (along with the introduction of an additional commit-graph chunk which relieved the requirement of backwards-compatibility) Corrected Commit Date was chosen as generation number v2 and is defined as follows:
For a commit C, let its corrected commit date be the maximum of the commit date of C and the corrected commit dates of its parents plus 1. Then corrected commit date offset is the difference between corrected commit date of C and commit date of C. As a special case, a root commit with the timestamp zero has corrected commit date of 1 to distinguish it from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
While it was proposed initially to store corrected commit date offsets within Commit Data Chunk, storing the offsets in a new chunk did not affect the performance measurably. The new chunk is "Generation DATa (GDAT) chunk" and it stores corrected commit date offsets while CDAT chunk stores topological level. The old versions of Git would ignore GDAT chunk, using topological levels from CDAT chunk. In contrast, new versions of Git would use corrected commit dates, falling back to topological level if the generation data chunk is absent in the commit-graph file.
While storing corrected commit date offsets saves us 4 bytes per commit (as compared with storing corrected commit dates directly), it's however possible for the offset to overflow the space allocated. To handle such cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores the corrected commit date. For overflowing offsets, we set MSB and store the position into the GDOV chunk, in a mechanism similar to the Extra Edges list chunk.
For mixed generation number environment (for example new Git on the command line, old Git used by GUI client), we can encounter a mixed-chain commit-graph (a commit-graph chain where some of split commit-graph files have GDAT chunk and others do not). As backward compatibility is one of the goals, we can define the following behavior:
While reading a mixed-chain commit-graph version, we fall back on topological levels as corrected commit dates and topological levels cannot be compared directly.
When adding new layer to the split commit-graph file, and when merging some or all layers (replacing them in the latter case), the new layer will have GDAT chunk if and only if in the final result there would be no layer without GDAT chunk just below it.
Thanks to Dr. Stolee, Dr. Narębski, Taylor Blau and SZEDER Gábor for their reviews.
I look forward to everyone's reviews!
Thanks
Improvements left for a future series:
Changes in version 7:
Changes in version 6:
Changes in version 5:
GENERATION_NUMBER_INFINITY
that can never be true in "commit-graph: add a slab to store topological levels".Changes in version 4:
Documentation/technical/commit-graph-format.txt
Changes in version 3:
[3] https://lore.kernel.org/git/[email protected]/
Changes in version 2:
GIT_TEST_COMMIT_GRAPH_NO_GDAT
to control whether to write generation data chunk.cc: Derrick Stolee [email protected], Jakub Narębski [email protected]
cc: Abhishek Kumar [email protected]
cc: SZEDER Gábor [email protected]
cc: Taylor Blau [email protected]