Skip to content

[GSoC] Implement Corrected Commit Date #676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

abhishekkumar2718
Copy link

@abhishekkumar2718 abhishekkumar2718 commented Jul 14, 2020

This patch series implements the corrected commit date offsets as generation number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph traversal operations like 'git log --graph'. Unfortunately, topological levels can perform worse than committer date when parents of a commit differ greatly in generation numbers [1]. For example, 'git merge-base v4.8 v4.9' on the Linux repository walks 635,579 commits using topological levels and walks 167,468 using committer date. Since 091f4cf (commit: don't use generation numbers if not needed, 2018-08-30), 'git merge-base' uses committer date heuristic unless there is a cutoff because of the performance hit.

[1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/

Thus, the need for generation number v2 was born. As Git used to die when graph version understood by it and in the commit-graph file are different [2], we needed a way to distinguish between the old and new generation number without incrementing the graph version.

[2] https://lore.kernel.org/git/[email protected]/

The following candidates were proposed (https://github.com/derrickstolee/gen-test, abhishekkumar2718#1):

  • (Epoch, Date) Pairs.
  • Maximum Generation Numbers.
  • Corrected Commit Date.
  • FELINE Index.
  • Corrected Commit Date with Monotonically Increasing Offsets.

Based on performance, local computability, and immutability (along with the introduction of an additional commit-graph chunk which relieved the requirement of backwards-compatibility) Corrected Commit Date was chosen as generation number v2 and is defined as follows:

For a commit C, let its corrected commit date be the maximum of the commit date of C and the corrected commit dates of its parents plus 1. Then corrected commit date offset is the difference between corrected commit date of C and commit date of C. As a special case, a root commit with the timestamp zero has corrected commit date of 1 to distinguish it from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).

While it was proposed initially to store corrected commit date offsets within Commit Data Chunk, storing the offsets in a new chunk did not affect the performance measurably. The new chunk is "Generation DATa (GDAT) chunk" and it stores corrected commit date offsets while CDAT chunk stores topological level. The old versions of Git would ignore GDAT chunk, using topological levels from CDAT chunk. In contrast, new versions of Git would use corrected commit dates, falling back to topological level if the generation data chunk is absent in the commit-graph file.

While storing corrected commit date offsets saves us 4 bytes per commit (as compared with storing corrected commit dates directly), it's however possible for the offset to overflow the space allocated. To handle such cases, we introduce a new chunk, Generation Data Overflow (GDOV) that stores the corrected commit date. For overflowing offsets, we set MSB and store the position into the GDOV chunk, in a mechanism similar to the Extra Edges list chunk.

For mixed generation number environment (for example new Git on the command line, old Git used by GUI client), we can encounter a mixed-chain commit-graph (a commit-graph chain where some of split commit-graph files have GDAT chunk and others do not). As backward compatibility is one of the goals, we can define the following behavior:

While reading a mixed-chain commit-graph version, we fall back on topological levels as corrected commit dates and topological levels cannot be compared directly.

When adding new layer to the split commit-graph file, and when merging some or all layers (replacing them in the latter case), the new layer will have GDAT chunk if and only if in the final result there would be no layer without GDAT chunk just below it.

Thanks to Dr. Stolee, Dr. Narębski, Taylor Blau and SZEDER Gábor for their reviews.

I look forward to everyone's reviews!

Thanks

  • Abhishek

Improvements left for a future series:

  • Save commits with generation data overflow and extra edge commits instead of looping over all commits. cf. [email protected]
  • Verify both topological levels and corrected commit dates when present. cf. [email protected]

Changes in version 7:

  • Moved the documentation patch ahead of "commit-graph: implement corrected commit date" and elaborated on the introduction of generation number v2.

Changes in version 6:

  • Fixed typos in commit message for "commit-graph: implement corrected commit date".
  • Removed an unnecessary else-block in "commit-graph: implement corrected commit date".
  • Validate mixed generation chain correctly while writing in "commit-graph: use generation v2 only if the entire chain does".
  • Die if the GDAT chunk indicates data has overflown but there are is no generation data overflow chunk.

Changes in version 5:

  • Explained a possible reason for no change in performance for "commit-graph: fix regression when computing bloom-filters"
  • Clarified about the addition of a new test for 11-digit octal implementations of ustar.
  • Fixed duplicate test names in "commit-graph: consolidate fill_commit_graph_info".
  • Swapped the order "commit-graph: return 64-bit generation number", "commit-graph: add a slab to store topological levels" to minimize lines changed.
  • Fixed the mismerge in "commit-graph: return 64-bit generation number"
  • Clarified the preparatory steps are for the larger goal of implementing generation number v2 in "commit-graph: return 64-bit generation number".
  • Moved the rename of "run_three_modes()" to "run_all_modes()" into a new patch "t6600-test-reach: generalize *_three_modes".
  • Explained and removed the checks for GENERATION_NUMBER_INFINITY that can never be true in "commit-graph: add a slab to store topological levels".
  • Fixed incorrect logic for verifying commit-graph in "commit-graph: implement corrected commit date".
  • Added minor improvements to commit message of "commit-graph: implement generation data chunk".
  • Added '--date ' option to test_commit() in 'test-lib-functions.sh' in "commit-graph: implement generation data chunk".
  • Improved coding style (also in tests) for "commit-graph: use generation v2 only if entire chain does".
  • Simplified test repository structure in "commit-graph: use generation v2 only if entire chain does" as only the number of commits in a split commit-graph layer are relevant.
  • Added a new test in "commit-graph: use generation v2 only if entire chain does" to check if the layers are merged correctly.
  • Explicitly mentioned commit "091f4cf3" in the commit-message of "commit-graph: use corrected commit dates in paint_down_to_common()".
  • Minor corrections to documentation in "doc: add corrected commit date info".
  • Minor corrections to coding style.

Changes in version 4:

  • Added GDOV to handle overflows in generation data.
  • Added a test for writing tip graph for a generation number v2 graph chain in t5324-split-commit-graph.sh
  • Added a section on how mixed generation number chains are handled in Documentation/technical/commit-graph-format.txt
  • Reverted unimportant whitespace, style changes in commit-graph.c
  • Added header comments about the order of comparision for compare_commits_by_gen_then_commit_date in commit.h, compare_commits_by_gen in commit-graph.h
  • Elaborated on why t6404 fails with corrected commit date and must be run with GIT_TEST_COMMIT_GRAPH=1in the commit "commit-reach: use corrected commit dates in paint_down_to_common()"
  • Elaborated on write behavior for mixed generation number chains in the commit "commit-graph: use generation v2 only if entire chain does"
  • Added notes about adding the topo_level slab to struct write_commit_graph_context as well as struct commit_graph.
  • Clarified commit message for "commit-graph: consolidate fill_commit_graph_info"
  • Removed the claim "GDAT can store future generation numbers" because it hasn't been tested yet.

Changes in version 3:

  • Reordered patches to implement corrected commit date before generation data chunk [3].
  • Split "implement corrected commit date" into two patches - one introducing the topo level slab and other implementing corrected commit dates.
  • Extended split-commit-graph tests to verify at the end of test.
  • Use topological levels as generation number if any of split commit-graph files do not have generation data chunk.

[3] https://lore.kernel.org/git/[email protected]/

Changes in version 2:

  • Add tests for generation data chunk.
  • Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write generation data chunk.
  • Compare commits with corrected commit dates if present in paint_down_to_common().
  • Update technical documentation.
  • Handle mixed generation commit chains.
  • Improve commit messages for "commit-graph: fix regression when computing bloom filter", "commit-graph: consolidate fill_commit_graph_info",
  • Revert unnecessary whitespace changes.
  • Split uint_32 -> timestamp_t change into a new commit.

cc: Derrick Stolee [email protected], Jakub Narębski [email protected]
cc: Abhishek Kumar [email protected]
cc: SZEDER Gábor [email protected]
cc: Taylor Blau [email protected]

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2020

Welcome to GitGitGadget

Hi @abhishekkumar2718, and welcome to GitGitGadget, the GitHub App to send patch series to the Git mailing list from GitHub Pull Requests.

Please make sure that your Pull Request has a good description, as it will be used as cover letter.

Also, it is a good idea to review the commit messages one last time, as the Git project expects them in a quite specific form:

  • the lines should not exceed 76 columns,
  • the first line should be like a header and typically start with a prefix like "tests:" or "commit:", and
  • the commit messages' body should be describing the "why?" of the change.
  • Finally, the commit messages should end in a Signed-off-by: line matching the commits' author.

It is in general a good idea to await the automated test ("Checks") in this Pull Request before contributing the patches, e.g. to avoid trivial issues such as unportable code.

Contributing the patches

Before you can contribute the patches, your GitHub username needs to be added to the list of permitted users. Any already-permitted user can do that, by adding a comment to your PR of the form /allow. A good way to find other contributors is to locate recent pull requests where someone has been /allowed:

Both the person who commented /allow and the PR author are able to /allow you.

An alternative is the channel #git-devel on the FreeNode IRC network:

<newcontributor> I've just created my first PR, could someone please /allow me? https://github.com/gitgitgadget/git/pull/12345
<veteran> newcontributor: it is done
<newcontributor> thanks!

Once on the list of permitted usernames, you can contribute the patches to the Git mailing list by adding a PR comment /submit.

If you want to see what email(s) would be sent for a /submit request, add a PR comment /preview to have the email(s) sent to you. You must have a public GitHub email address for this.

After you submit, GitGitGadget will respond with another comment that contains the link to the cover letter mail in the Git mailing list archive. Please make sure to monitor the discussion in that thread and to address comments and suggestions (while the comments and suggestions will be mirrored into the PR by GitGitGadget, you will still want to reply via mail).

If you do not want to subscribe to the Git mailing list just to be able to respond to a mail, you can download the mbox from the Git mailing list archive (click the (raw) link), then import it into your mail program. If you use GMail, you can do this via:

curl -g --user "<EMailAddress>:<Password>" \
    --url "imaps://imap.gmail.com/INBOX" -T /path/to/raw.txt

To iterate on your change, i.e. send a revised patch or patch series, you will first want to (force-)push to the same branch. You probably also want to modify your Pull Request description (or title). It is a good idea to summarize the revision by adding something like this to the cover letter (read: by editing the first comment on the PR, i.e. the PR description):

Changes since v1:
- Fixed a typo in the commit message (found by ...)
- Added a code comment to ... as suggested by ...
...

To send a new iteration, just add another PR comment with the contents: /submit.

Need help?

New contributors who want advice are encouraged to join [email protected], where volunteers who regularly contribute to Git are willing to answer newbie questions, give advice, or otherwise provide mentoring to interested contributors. You must join in order to post or view messages, but anyone can join.

You may also be able to find help in real time in the developer IRC channel, #git-devel on Freenode. Remember that IRC does not support offline messaging, so if you send someone a private message and log out, they cannot respond to you. The scrollback of #git-devel is archived, though.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2020

There is an issue in commit 7193ebf81003b2916fa961bfb6e8600ad51d5d59:
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2020

There is an issue in commit a739d05b697b75bbb42fb544cec85b97cf156b88:
Commit not signed off

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2020

There is an issue in commit 5a3f76b14423df00b6171734135f6b06fd272513:
Commit checks stopped - the message is too short

@derrickstolee derrickstolee self-requested a review July 14, 2020 13:28
@derrickstolee
Copy link

@abhishekkumar2718 thanks for opening this so I can see your progress! Please @-mention me when you have working builds so I can take a quick review before you submit.

@dscho
Copy link
Member

dscho commented Jul 14, 2020

/allow

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 14, 2020

User abhishekkumar2718 is now allowed to use GitGitGadget.

WARNING: abhishekkumar2718 has no public email address set on GitHub

@derrickstolee
Copy link

User abhishekkumar2718 is now allowed to use GitGitGadget.

WARNING: abhishekkumar2718 has no public email address set on GitHub

@abhishekkumar2718 please fix the public email address so GitGitGadget can CC you properly.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 16, 2020

There is an issue in commit dd7a19b322302173f2e03921a9fcc51f85caa864:
Commit not signed off

@abhishekkumar2718 abhishekkumar2718 force-pushed the corrected_commit_date branch 6 times, most recently from e151734 to 171c05e Compare July 20, 2020 17:03
Copy link

@derrickstolee derrickstolee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think updating your struct to make the generation comparisons simpler will help. You will need to update a lot of code in commit-reach.c to use a uint64_t min_generation whenever you see a uint32_t min_generation or similar.

@abhishekkumar2718 abhishekkumar2718 force-pushed the corrected_commit_date branch 7 times, most recently from b784e44 to 647290d Compare July 28, 2020 07:17
@abhishekkumar2718
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 28, 2020

Submitted as [email protected]

@abhishekkumar2718 abhishekkumar2718 marked this pull request as ready for review July 28, 2020 09:14
@gitgitgadget
Copy link

gitgitgadget bot commented Jan 31, 2021

This patch series was integrated into seen via git@66841f6.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 1, 2021

This patch series was integrated into seen via git@218525a.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like 'git log --graph'. Unfortunately, topological
levels can perform worse than committer date when parents of a commit
differ greatly in generation numbers [1]. For example, 'git merge-base
v4.8 v4.9' on the Linux repository walks 635,579 commits using
topological levels and walks 167,468 using committer date. Since
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30), 'git merge-base' uses committer date heuristic unless there
is a cutoff because of the performance hit.

[1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/

Thus, the need for generation number v2 was born. As Git used to die
when graph version understood by it and in the commit-graph file are
different [2], we needed a way to distinguish between the old and new
generation number without incrementing the graph version.

[2] https://lore.kernel.org/git/[email protected]/

The following candidates were proposed (https://github.com/derrickstolee/gen-test,
#1):
- (Epoch, Date) Pairs.
- Maximum Generation Numbers.
- Corrected Commit Date.
- FELINE Index.
- Corrected Commit Date with Monotonically Increasing Offsets.

Based on performance, local computability, and immutability (along with
the introduction of an additional commit-graph chunk which relieved the
requirement of backwards-compatibility) Corrected Commit Date was chosen
as generation number v2 and is defined as follows:

For a commit C, let its corrected commit date  be the maximum of the
commit date of C and the corrected commit dates of its parents plus 1.
Then corrected commit date offset is the difference between corrected
commit date of C and commit date of C. As a special case, a root commit
with the timestamp zero has corrected commit date of 1 to distinguish it
from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).

While it was proposed initially to store corrected commit date offsets
within Commit Data Chunk, storing the offsets in a new chunk did not
affect the performance measurably. The new chunk is "Generation DATa
(GDAT) chunk" and it stores corrected commit date offsets while CDAT
chunk stores topological level. The old versions of Git would ignore
GDAT chunk, using topological levels from CDAT chunk. In contrast, new
versions of Git would use corrected commit dates, falling back to
topological level if the generation data chunk is absent in the
commit-graph file.

While storing corrected commit date offsets saves us 4 bytes per commit
(as compared with storing corrected commit dates directly), it's however
possible for the offset to overflow the space allocated. To handle such
cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
stores the corrected commit date. For overflowing offsets, we set MSB
and store the position into the GDOV chunk, in a mechanism similar to
the Extra Edges list chunk.

For mixed generation number environment (for example new Git on the
command line, old Git used by GUI client), we can encounter a
mixed-chain commit-graph (a commit-graph chain where some of split
commit-graph files have GDAT chunk and others do not). As backward
compatibility is one of the goals, we can define the following behavior:

While reading a mixed-chain commit-graph version, we fall back on
topological levels as corrected commit dates and topological levels
cannot be compared directly.

When adding new layer to the split commit-graph file, and when merging
some or all layers (replacing them in the latter case), the new layer
will have GDAT chunk if and only if in the final result there would be
no layer without GDAT chunk just below it.

Signed-off-by: Abhishek Kumar <[email protected]>
With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <[email protected]>
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/[email protected]/

Signed-off-by: Abhishek Kumar <[email protected]>
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <[email protected]>
@abhishekkumar2718
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 1, 2021

Submitted as [email protected]

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v7

To fetch this version to local tag pr-676/abhishekkumar2718/corrected_commit_date-v7:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-676/abhishekkumar2718/corrected_commit_date-v7

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 1, 2021

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/1/2021 1:58 AM, Abhishek Kumar via GitGitGadget wrote:

> Changes in version 7:
> 
>  * Moved the documentation patch ahead of "commit-graph: implement corrected
>    commit date" and elaborated on the introduction of generation number v2.

The only change in this version is this commit message:

>  11:  e571f03d8bd !  7:  8647b5d2e38 doc: add corrected commit date info
>      @@ Metadata
>       Author: Abhishek Kumar <[email protected]>
>       
>        ## Commit message ##
>      -    doc: add corrected commit date info
>      +    commit-graph: document generation number v2
>       
>      -    With generation data chunk and corrected commit dates implemented, let's
>      -    update the technical documentation for commit-graph.
>      +    Git uses topological levels in the commit-graph file for commit-graph
>      +    traversal operations like 'git log --graph'. Unfortunately, topological
>      +    levels can perform worse than committer date when parents of a commit
>      +    differ greatly in generation numbers [1]. For example, 'git merge-base
>      +    v4.8 v4.9' on the Linux repository walks 635,579 commits using
>      +    topological levels and walks 167,468 using committer date. Since
>      +    091f4cf3 (commit: don't use generation numbers if not needed,
>      +    2018-08-30), 'git merge-base' uses committer date heuristic unless there
>      +    is a cutoff because of the performance hit.
>      +
>      +    [1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/
>      +
>      +    Thus, the need for generation number v2 was born. As Git used to die
>      +    when graph version understood by it and in the commit-graph file are
>      +    different [2], we needed a way to distinguish between the old and new
>      +    generation number without incrementing the graph version.
>      +
>      +    [2] https://lore.kernel.org/git/[email protected]/
>      +
>      +    The following candidates were proposed (https://github.com/derrickstolee/gen-test,
>      +    https://github.com/abhishekkumar2718/git/pull/1):
>      +    - (Epoch, Date) Pairs.
>      +    - Maximum Generation Numbers.
>      +    - Corrected Commit Date.
>      +    - FELINE Index.
>      +    - Corrected Commit Date with Monotonically Increasing Offsets.
>      +
>      +    Based on performance, local computability, and immutability (along with
>      +    the introduction of an additional commit-graph chunk which relieved the
>      +    requirement of backwards-compatibility) Corrected Commit Date was chosen
>      +    as generation number v2 and is defined as follows:
>      +
>      +    For a commit C, let its corrected commit date  be the maximum of the
>      +    commit date of C and the corrected commit dates of its parents plus 1.
>      +    Then corrected commit date offset is the difference between corrected
>      +    commit date of C and commit date of C. As a special case, a root commit
>      +    with the timestamp zero has corrected commit date of 1 to distinguish it
>      +    from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number).
>      +
>      +    While it was proposed initially to store corrected commit date offsets
>      +    within Commit Data Chunk, storing the offsets in a new chunk did not
>      +    affect the performance measurably. The new chunk is "Generation DATa
>      +    (GDAT) chunk" and it stores corrected commit date offsets while CDAT
>      +    chunk stores topological level. The old versions of Git would ignore
>      +    GDAT chunk, using topological levels from CDAT chunk. In contrast, new
>      +    versions of Git would use corrected commit dates, falling back to
>      +    topological level if the generation data chunk is absent in the
>      +    commit-graph file.
>      +
>      +    While storing corrected commit date offsets saves us 4 bytes per commit
>      +    (as compared with storing corrected commit dates directly), it's however
>      +    possible for the offset to overflow the space allocated. To handle such
>      +    cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that
>      +    stores the corrected commit date. For overflowing offsets, we set MSB
>      +    and store the position into the GDOV chunk, in a mechanism similar to
>      +    the Extra Edges list chunk.
>      +
>      +    For mixed generation number environment (for example new Git on the
>      +    command line, old Git used by GUI client), we can encounter a
>      +    mixed-chain commit-graph (a commit-graph chain where some of split
>      +    commit-graph files have GDAT chunk and others do not). As backward
>      +    compatibility is one of the goals, we can define the following behavior:
>      +
>      +    While reading a mixed-chain commit-graph version, we fall back on
>      +    topological levels as corrected commit dates and topological levels
>      +    cannot be compared directly.
>      +
>      +    When adding new layer to the split commit-graph file, and when merging
>      +    some or all layers (replacing them in the latter case), the new layer
>      +    will have GDAT chunk if and only if in the final result there would be
>      +    no layer without GDAT chunk just below it.

While that is a quality message, v6 has landed in 'next' and I've begun
working off of that version. As Taylor attempted to say [1], this topic
should be considered final and updates should be follow-ups on top.

[1] https://lore.kernel.org/git/[email protected]/

(Of course, if Junio says differently, then listen to him.)

Thanks,
-Stolee

@dscho
Copy link
Member

dscho commented Feb 1, 2021

/submit

Ummm. @abhishekkumar2718 did you miss what I wrote about next?

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 1, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <[email protected]> writes:

>>      +    When adding new layer to the split commit-graph file, and when merging
>>      +    some or all layers (replacing them in the latter case), the new layer
>>      +    will have GDAT chunk if and only if in the final result there would be
>>      +    no layer without GDAT chunk just below it.
>
> While that is a quality message, v6 has landed in 'next' and I've begun
> working off of that version. As Taylor attempted to say [1], this topic
> should be considered final and updates should be follow-ups on top.
>
> [1] https://lore.kernel.org/git/[email protected]/

Sounds sensible, modulo s/final/solid enough/ ;-)

I would imagine that the "quality message" has something of value to
keep to help future developers, and if that is the case, a follow-up
patch to add to the Documentation/technical/ would be appropriate.

Thanks all, for a quality series.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 2, 2021

This patch series was integrated into seen via git@deae33e.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 2, 2021

This patch series was integrated into seen via git@7862ac8.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 3, 2021

This patch series was integrated into seen via git@cd2bfee.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 6, 2021

This patch series was integrated into seen via git@e502eab.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 8, 2021

This patch series was integrated into seen via git@5fb6127.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

This patch series was integrated into seen via git@deae33e.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

This patch series was integrated into seen via git@59d4dae.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 10, 2021

This patch series was integrated into seen via git@7943c06.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 11, 2021

This patch series was integrated into seen via git@cf9fbab.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 11, 2021

This patch series was integrated into seen via git@2e3ad28.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 12, 2021

This patch series was integrated into seen via git@0cf0dc9.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

This patch series was integrated into seen via git@8b4701a.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

This patch series was integrated into next via git@8b4701a.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

This patch series was integrated into master via git@8b4701a.

@gitgitgadget gitgitgadget bot added the master label Feb 19, 2021
@gitgitgadget gitgitgadget bot closed this Feb 19, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

Closed via 8b4701a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants