Skip to content

[GSoC] Implement Corrected Commit Date #676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 22 additions & 6 deletions Documentation/technical/commit-graph-format.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,7 @@ Git commit graph format
The Git commit graph stores a list of commit OIDs and some associated
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Sat, Jan 16, 2021 at 06:11:18PM +0000, Abhishek Kumar via GitGitGadget wrote:
> With generation data chunk and corrected commit dates implemented, let's
> update the technical documentation for commit-graph.

This patch should come much earlier in this series, before patch 07/11
(commit-graph: implement corrected commit date), or perhaps even
earlier.  That way if someone were to investigate an issue in this
series and checks out one of its commits, then the specification and
the will be right there under 'Documentation/technical/'.

Furthermore, a patch introducing a new chunk format is the right place
to justify the introduction of said new chunk.  What problems does a
chunk of corrected commit dates solve?  Why does it solve them?  Why
do we need corrected commit dates instead of simple commit dates?
What alternatives were considered [1]?  Any other design considerations
worth mentioning for the benefit of future readers?

None of the patches' log messages properly explain these, and while
much of these is indeed explained in the cover letter, the cover
letter will not be part of the history.  Requiring to look up mailing
list archives for the justification puts unnecessary burden on other
developers who might get interested in this feature in the future.

You might want to take
https://public-inbox.org/git/[email protected]/
as an inspiration.


[1] Please remember the following snippet from SubmittingPatches:
    "Try to make sure your explanation can be understood without
    external resources. Instead of giving a URL to a mailing list
    archive, summarize the relevant points of the discussion."

> Signed-off-by: Abhishek Kumar <[email protected]>
> ---
>  .../technical/commit-graph-format.txt         | 28 +++++--
>  Documentation/technical/commit-graph.txt      | 77 +++++++++++++++----
>  2 files changed, 86 insertions(+), 19 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index b3b58880b92..b6658eff188 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -4,11 +4,7 @@ Git commit graph format
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
>  
> -- The generation number of the commit. Commits with no parents have
> -  generation number 1; commits with parents have generation number
> -  one more than the maximum generation number of its parents. We
> -  reserve zero as special, and can be used to mark a generation
> -  number invalid or as "not computed".
> +- The generation number of the commit.
>  
>  - The root tree OID.
>  
> @@ -86,13 +82,33 @@ CHUNK DATA:
>        position. If there are more than two parents, the second value
>        has its most-significant bit on and the other bits store an array
>        position into the Extra Edge List chunk.
> -    * The next 8 bytes store the generation number of the commit and
> +    * The next 8 bytes store the topological level (generation number v1)
> +      of the commit and
>        the commit time in seconds since EPOCH. The generation number
>        uses the higher 30 bits of the first 4 bytes, while the commit
>        time uses the 32 bits of the second 4 bytes, along with the lowest
>        2 bits of the lowest byte, storing the 33rd and 34th bit of the
>        commit time.
>  
> +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> +    * This list of 4-byte values store corrected commit date offsets for the
> +      commits, arranged in the same order as commit data chunk.
> +    * If the corrected commit date offset cannot be stored within 31 bits,
> +      the value has its most-significant bit on and the other bits store
> +      the position of corrected commit date into the Generation Data Overflow
> +      chunk.
> +    * Generation Data chunk is present only when commit-graph file is written
> +      by compatible versions of Git and in case of split commit-graph chains,
> +      the topmost layer also has Generation Data chunk.
> +
> +  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> +    * This list of 8-byte values stores the corrected commit date offsets
> +      for commits with corrected commit date offsets that cannot be
> +      stored within 31 bits.
> +    * Generation Data Overflow chunk is present only when Generation Data
> +      chunk is present and atleast one corrected commit date offset cannot
> +      be stored within 31 bits.
> +
>    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
>        This list of 4-byte values store the second through nth parents for
>        all octopus merges. The second parent value in the commit data stores
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index f14a7659aa8..f05e7bda1a9 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
>  
>  Values 1-4 satisfy the requirements of parse_commit_gently().
>  
> -Define the "generation number" of a commit recursively as follows:
> +There are two definitions of generation number:
> +1. Corrected committer dates (generation number v2)
> +2. Topological levels (generation nummber v1)
>  
> - * A commit with no parents (a root commit) has generation number one.
> +Define "corrected committer date" of a commit recursively as follows:
>  
> - * A commit with at least one parent has generation number one more than
> -   the largest generation number among its parents.
> + * A commit with no parents (a root commit) has corrected committer date
> +    equal to its committer date.
>  
> -Equivalently, the generation number of a commit A is one more than the
> + * A commit with at least one parent has corrected committer date equal to
> +    the maximum of its commiter date and one more than the largest corrected
> +    committer date among its parents.
> +
> + * As a special case, a root commit with timestamp zero has corrected commit
> +    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> +    (that is, an uncomputed corrected commit date).
> +
> +Define the "topological level" of a commit recursively as follows:
> +
> + * A commit with no parents (a root commit) has topological level of one.
> +
> + * A commit with at least one parent has topological level one more than
> +   the largest topological level among its parents.
> +
> +Equivalently, the topological level of a commit A is one more than the
>  length of a longest path from A to a root commit. The recursive definition
>  is easier to use for computation and observing the following property:
>  
> @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
>      generation numbers, then we always expand the boundary commit with highest
>      generation number and can easily detect the stopping condition.
>  
> +The property applies to both versions of generation number, that is both
> +corrected committer dates and topological levels.
> +
>  This property can be used to significantly reduce the time it takes to
>  walk commits and determine topological relationships. Without generation
>  numbers, the general heuristic is the following:
> @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
>      If A and B are commits with commit time X and Y, respectively, and
>      X < Y, then A _probably_ cannot reach B.
>  
> -This heuristic is currently used whenever the computation is allowed to
> +In absence of corrected commit dates (for example, old versions of Git or
> +mixed generation graph chains),
> +this heuristic is currently used whenever the computation is allowed to
>  violate topological relationships due to clock skew (such as "git log"
>  with default order), but is not used when the topological order is
>  required (such as merge base calculations, "git log --graph").
> @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
>  generation number and walk until reaching commits with known generation
>  number.
>  
> -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
>  in the commit-graph file. If a commit-graph file was written by a version
>  of Git that did not compute generation numbers, then those commits will
>  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> @@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
>  walking a few extra commits, but the simplicity in dealing with commits
>  with generation number *_INFINITY or *_ZERO is valuable.
>  
> -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> -generation numbers are computed to be at least this value. We limit at
> -this value since it is the largest value that can be stored in the
> -commit-graph file using the 30 bits available to generation numbers. This
> -presents another case where a commit can have generation number equal to
> -that of a parent.
> +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
> +topological levels (generation number v1) are computed to be at least
> +this value. We limit at this value since it is the largest value that
> +can be stored in the commit-graph file using the 30 bits available
> +to topological levels. This presents another case where a commit can
> +have generation number equal to that of a parent.
>  
>  Design Details
>  --------------
> @@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
>  number of commits) could be extracted into config settings for full
>  flexibility.
>  
> +## Handling Mixed Generation Number Chains
> +
> +With the introduction of generation number v2 and generation data chunk, the
> +following scenario is possible:
> +
> +1. "New" Git writes a commit-graph with the corrected commit dates.
> +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> +
> +A naive approach of using the newest available generation number from
> +each layer would lead to violated expectations: the lower layer would
> +use corrected commit dates which are much larger than the topological
> +levels of the higher layer. For this reason, Git inspects the topmost
> +layer to see if the layer is missing corrected commit dates. In such a case
> +Git only uses topological level for generation numbers.
> +
> +When writing a new layer in split commit-graph, we write corrected commit
> +dates if the topmost layer has corrected commit dates written. This
> +guarantees that if a layer has corrected commit dates, all lower layers
> +must have corrected commit dates as well.
> +
> +When merging layers, we do not consider whether the merged layers had corrected
> +commit dates. Instead, the new layer will have corrected commit dates if the
> +layer below the new layer has corrected commit dates.
> +
> +While writing or merging layers, if the new layer is the only layer, it will
> +have corrected commit dates when written by compatible versions of Git. Thus,
> +rewriting split commit-graph as a single file (`--split=replace`) creates a
> +single layer with corrected commit dates.
> +
>  ## Deleting graph-{hash} files
>  
>  After a new tip file is written, some `graph-{hash}` files may no longer
> -- 
> gitgitgadget

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Abhishek Kumar wrote (reply to this):

On Wed, Jan 27, 2021 at 01:04:54AM +0100, SZEDER G�bor wrote:
> On Sat, Jan 16, 2021 at 06:11:18PM +0000, Abhishek Kumar via GitGitGadget wrote:
> > With generation data chunk and corrected commit dates implemented, let's
> > update the technical documentation for commit-graph.
> 
> This patch should come much earlier in this series, before patch 07/11
> (commit-graph: implement corrected commit date), or perhaps even
> earlier.  That way if someone were to investigate an issue in this
> series and checks out one of its commits, then the specification and
> the will be right there under 'Documentation/technical/'.
> 
> Furthermore, a patch introducing a new chunk format is the right place
> to justify the introduction of said new chunk.  What problems does a
> chunk of corrected commit dates solve?  Why does it solve them?  Why
> do we need corrected commit dates instead of simple commit dates?
> What alternatives were considered [1]?  Any other design considerations
> worth mentioning for the benefit of future readers?
> 
> None of the patches' log messages properly explain these, and while
> much of these is indeed explained in the cover letter, the cover
> letter will not be part of the history.  Requiring to look up mailing
> list archives for the justification puts unnecessary burden on other
> developers who might get interested in this feature in the future.
> 
> You might want to take
> https://public-inbox.org/git/[email protected]/
> as an inspiration.
> 

Alright, the suggestion makes a lot of sense and the patch introducing
documentation is the perfect place to justify the introduction of new
chunk format.

> 
> [1] Please remember the following snippet from SubmittingPatches:
>     "Try to make sure your explanation can be understood without
>     external resources. Instead of giving a URL to a mailing list
>     archive, summarize the relevant points of the discussion."
> 
> > Signed-off-by: Abhishek Kumar <[email protected]>
> > ---
> >  .../technical/commit-graph-format.txt         | 28 +++++--
> >  Documentation/technical/commit-graph.txt      | 77 +++++++++++++++----
> >  2 files changed, 86 insertions(+), 19 deletions(-)
> > 
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index b3b58880b92..b6658eff188 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -4,11 +4,7 @@ Git commit graph format
> >  The Git commit graph stores a list of commit OIDs and some associated
> >  metadata, including:
> >  
> > -- The generation number of the commit. Commits with no parents have
> > -  generation number 1; commits with parents have generation number
> > -  one more than the maximum generation number of its parents. We
> > -  reserve zero as special, and can be used to mark a generation
> > -  number invalid or as "not computed".
> > +- The generation number of the commit.
> >  
> >  - The root tree OID.
> >  
> > @@ -86,13 +82,33 @@ CHUNK DATA:
> >        position. If there are more than two parents, the second value
> >        has its most-significant bit on and the other bits store an array
> >        position into the Extra Edge List chunk.
> > -    * The next 8 bytes store the generation number of the commit and
> > +    * The next 8 bytes store the topological level (generation number v1)
> > +      of the commit and
> >        the commit time in seconds since EPOCH. The generation number
> >        uses the higher 30 bits of the first 4 bytes, while the commit
> >        time uses the 32 bits of the second 4 bytes, along with the lowest
> >        2 bits of the lowest byte, storing the 33rd and 34th bit of the
> >        commit time.
> >  
> > +  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
> > +    * This list of 4-byte values store corrected commit date offsets for the
> > +      commits, arranged in the same order as commit data chunk.
> > +    * If the corrected commit date offset cannot be stored within 31 bits,
> > +      the value has its most-significant bit on and the other bits store
> > +      the position of corrected commit date into the Generation Data Overflow
> > +      chunk.
> > +    * Generation Data chunk is present only when commit-graph file is written
> > +      by compatible versions of Git and in case of split commit-graph chains,
> > +      the topmost layer also has Generation Data chunk.
> > +
> > +  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
> > +    * This list of 8-byte values stores the corrected commit date offsets
> > +      for commits with corrected commit date offsets that cannot be
> > +      stored within 31 bits.
> > +    * Generation Data Overflow chunk is present only when Generation Data
> > +      chunk is present and atleast one corrected commit date offset cannot
> > +      be stored within 31 bits.
> > +
> >    Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
> >        This list of 4-byte values store the second through nth parents for
> >        all octopus merges. The second parent value in the commit data stores
> > diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> > index f14a7659aa8..f05e7bda1a9 100644
> > --- a/Documentation/technical/commit-graph.txt
> > +++ b/Documentation/technical/commit-graph.txt
> > @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
> >  
> >  Values 1-4 satisfy the requirements of parse_commit_gently().
> >  
> > -Define the "generation number" of a commit recursively as follows:
> > +There are two definitions of generation number:
> > +1. Corrected committer dates (generation number v2)
> > +2. Topological levels (generation nummber v1)
> >  
> > - * A commit with no parents (a root commit) has generation number one.
> > +Define "corrected committer date" of a commit recursively as follows:
> >  
> > - * A commit with at least one parent has generation number one more than
> > -   the largest generation number among its parents.
> > + * A commit with no parents (a root commit) has corrected committer date
> > +    equal to its committer date.
> >  
> > -Equivalently, the generation number of a commit A is one more than the
> > + * A commit with at least one parent has corrected committer date equal to
> > +    the maximum of its commiter date and one more than the largest corrected
> > +    committer date among its parents.
> > +
> > + * As a special case, a root commit with timestamp zero has corrected commit
> > +    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
> > +    (that is, an uncomputed corrected commit date).
> > +
> > +Define the "topological level" of a commit recursively as follows:
> > +
> > + * A commit with no parents (a root commit) has topological level of one.
> > +
> > + * A commit with at least one parent has topological level one more than
> > +   the largest topological level among its parents.
> > +
> > +Equivalently, the topological level of a commit A is one more than the
> >  length of a longest path from A to a root commit. The recursive definition
> >  is easier to use for computation and observing the following property:
> >  
> > @@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
> >      generation numbers, then we always expand the boundary commit with highest
> >      generation number and can easily detect the stopping condition.
> >  
> > +The property applies to both versions of generation number, that is both
> > +corrected committer dates and topological levels.
> > +
> >  This property can be used to significantly reduce the time it takes to
> >  walk commits and determine topological relationships. Without generation
> >  numbers, the general heuristic is the following:
> > @@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
> >      If A and B are commits with commit time X and Y, respectively, and
> >      X < Y, then A _probably_ cannot reach B.
> >  
> > -This heuristic is currently used whenever the computation is allowed to
> > +In absence of corrected commit dates (for example, old versions of Git or
> > +mixed generation graph chains),
> > +this heuristic is currently used whenever the computation is allowed to
> >  violate topological relationships due to clock skew (such as "git log"
> >  with default order), but is not used when the topological order is
> >  required (such as merge base calculations, "git log --graph").
> > @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
> >  generation number and walk until reaching commits with known generation
> >  number.
> >  
> > -We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
> > +We use the macro GENERATION_NUMBER_INFINITY to mark commits not
> >  in the commit-graph file. If a commit-graph file was written by a version
> >  of Git that did not compute generation numbers, then those commits will
> >  have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
> > @@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
> >  walking a few extra commits, but the simplicity in dealing with commits
> >  with generation number *_INFINITY or *_ZERO is valuable.
> >  
> > -We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
> > -generation numbers are computed to be at least this value. We limit at
> > -this value since it is the largest value that can be stored in the
> > -commit-graph file using the 30 bits available to generation numbers. This
> > -presents another case where a commit can have generation number equal to
> > -that of a parent.
> > +We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
> > +topological levels (generation number v1) are computed to be at least
> > +this value. We limit at this value since it is the largest value that
> > +can be stored in the commit-graph file using the 30 bits available
> > +to topological levels. This presents another case where a commit can
> > +have generation number equal to that of a parent.
> >  
> >  Design Details
> >  --------------
> > @@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
> >  number of commits) could be extracted into config settings for full
> >  flexibility.
> >  
> > +## Handling Mixed Generation Number Chains
> > +
> > +With the introduction of generation number v2 and generation data chunk, the
> > +following scenario is possible:
> > +
> > +1. "New" Git writes a commit-graph with the corrected commit dates.
> > +2. "Old" Git writes a split commit-graph on top without corrected commit dates.
> > +
> > +A naive approach of using the newest available generation number from
> > +each layer would lead to violated expectations: the lower layer would
> > +use corrected commit dates which are much larger than the topological
> > +levels of the higher layer. For this reason, Git inspects the topmost
> > +layer to see if the layer is missing corrected commit dates. In such a case
> > +Git only uses topological level for generation numbers.
> > +
> > +When writing a new layer in split commit-graph, we write corrected commit
> > +dates if the topmost layer has corrected commit dates written. This
> > +guarantees that if a layer has corrected commit dates, all lower layers
> > +must have corrected commit dates as well.
> > +
> > +When merging layers, we do not consider whether the merged layers had corrected
> > +commit dates. Instead, the new layer will have corrected commit dates if the
> > +layer below the new layer has corrected commit dates.
> > +
> > +While writing or merging layers, if the new layer is the only layer, it will
> > +have corrected commit dates when written by compatible versions of Git. Thus,
> > +rewriting split commit-graph as a single file (`--split=replace`) creates a
> > +single layer with corrected commit dates.
> > +
> >  ## Deleting graph-{hash} files
> >  
> >  After a new tip file is written, some `graph-{hash}` files may no longer
> > -- 
> > gitgitgadget

Thanks
- Abhishek

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Sat, Jan 30, 2021 at 10:59:05AM +0530, Abhishek Kumar wrote:
> > You might want to take
> > https://public-inbox.org/git/[email protected]/
> > as an inspiration.
> >
> Alright, the suggestion makes a lot of sense and the patch introducing
> documentation is the perfect place to justify the introduction of new
> chunk format.

I don't have any strong feelings about Gábor's suggestion itself, but
note that there isn't any work for you to do in this series, since the
patches are on track to be merged to master.

Thanks,
Taylor

metadata, including:

- The generation number of the commit. Commits with no parents have
generation number 1; commits with parents have generation number
one more than the maximum generation number of its parents. We
reserve zero as special, and can be used to mark a generation
number invalid or as "not computed".
- The generation number of the commit.

- The root tree OID.

Expand Down Expand Up @@ -86,13 +82,33 @@ CHUNK DATA:
position. If there are more than two parents, the second value
has its most-significant bit on and the other bits store an array
position into the Extra Edge List chunk.
* The next 8 bytes store the generation number of the commit and
* The next 8 bytes store the topological level (generation number v1)
of the commit and
the commit time in seconds since EPOCH. The generation number
uses the higher 30 bits of the first 4 bytes, while the commit
time uses the 32 bits of the second 4 bytes, along with the lowest
2 bits of the lowest byte, storing the 33rd and 34th bit of the
commit time.

Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
* This list of 4-byte values store corrected commit date offsets for the
commits, arranged in the same order as commit data chunk.
* If the corrected commit date offset cannot be stored within 31 bits,
the value has its most-significant bit on and the other bits store
the position of corrected commit date into the Generation Data Overflow
chunk.
* Generation Data chunk is present only when commit-graph file is written
by compatible versions of Git and in case of split commit-graph chains,
the topmost layer also has Generation Data chunk.

Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
* This list of 8-byte values stores the corrected commit date offsets
for commits with corrected commit date offsets that cannot be
stored within 31 bits.
* Generation Data Overflow chunk is present only when Generation Data
chunk is present and atleast one corrected commit date offset cannot
be stored within 31 bits.

Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
This list of 4-byte values store the second through nth parents for
all octopus merges. The second parent value in the commit data stores
Expand Down
77 changes: 64 additions & 13 deletions Documentation/technical/commit-graph.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:

Values 1-4 satisfy the requirements of parse_commit_gently().

Define the "generation number" of a commit recursively as follows:
There are two definitions of generation number:
1. Corrected committer dates (generation number v2)
2. Topological levels (generation nummber v1)

* A commit with no parents (a root commit) has generation number one.
Define "corrected committer date" of a commit recursively as follows:

* A commit with at least one parent has generation number one more than
the largest generation number among its parents.
* A commit with no parents (a root commit) has corrected committer date
equal to its committer date.

Equivalently, the generation number of a commit A is one more than the
* A commit with at least one parent has corrected committer date equal to
the maximum of its commiter date and one more than the largest corrected
committer date among its parents.

* As a special case, a root commit with timestamp zero has corrected commit
date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
(that is, an uncomputed corrected commit date).

Define the "topological level" of a commit recursively as follows:

* A commit with no parents (a root commit) has topological level of one.

* A commit with at least one parent has topological level one more than
the largest topological level among its parents.

Equivalently, the topological level of a commit A is one more than the
length of a longest path from A to a root commit. The recursive definition
is easier to use for computation and observing the following property:

Expand All @@ -60,14 +77,19 @@ is easier to use for computation and observing the following property:
generation numbers, then we always expand the boundary commit with highest
generation number and can easily detect the stopping condition.

The property applies to both versions of generation number, that is both
corrected committer dates and topological levels.

This property can be used to significantly reduce the time it takes to
walk commits and determine topological relationships. Without generation
numbers, the general heuristic is the following:

If A and B are commits with commit time X and Y, respectively, and
X < Y, then A _probably_ cannot reach B.

This heuristic is currently used whenever the computation is allowed to
In absence of corrected commit dates (for example, old versions of Git or
mixed generation graph chains),
this heuristic is currently used whenever the computation is allowed to
violate topological relationships due to clock skew (such as "git log"
with default order), but is not used when the topological order is
required (such as merge base calculations, "git log --graph").
Expand All @@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
generation number and walk until reaching commits with known generation
number.

We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
We use the macro GENERATION_NUMBER_INFINITY to mark commits not
in the commit-graph file. If a commit-graph file was written by a version
of Git that did not compute generation numbers, then those commits will
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
Expand All @@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
walking a few extra commits, but the simplicity in dealing with commits
with generation number *_INFINITY or *_ZERO is valuable.

We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
generation numbers are computed to be at least this value. We limit at
this value since it is the largest value that can be stored in the
commit-graph file using the 30 bits available to generation numbers. This
presents another case where a commit can have generation number equal to
that of a parent.
We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
topological levels (generation number v1) are computed to be at least
this value. We limit at this value since it is the largest value that
can be stored in the commit-graph file using the 30 bits available
to topological levels. This presents another case where a commit can
have generation number equal to that of a parent.

Design Details
--------------
Expand Down Expand Up @@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
number of commits) could be extracted into config settings for full
flexibility.

## Handling Mixed Generation Number Chains

With the introduction of generation number v2 and generation data chunk, the
following scenario is possible:

1. "New" Git writes a commit-graph with the corrected commit dates.
2. "Old" Git writes a split commit-graph on top without corrected commit dates.

A naive approach of using the newest available generation number from
each layer would lead to violated expectations: the lower layer would
use corrected commit dates which are much larger than the topological
levels of the higher layer. For this reason, Git inspects the topmost
layer to see if the layer is missing corrected commit dates. In such a case
Git only uses topological level for generation numbers.

When writing a new layer in split commit-graph, we write corrected commit
dates if the topmost layer has corrected commit dates written. This
guarantees that if a layer has corrected commit dates, all lower layers
must have corrected commit dates as well.

When merging layers, we do not consider whether the merged layers had corrected
commit dates. Instead, the new layer will have corrected commit dates if the
layer below the new layer has corrected commit dates.

While writing or merging layers, if the new layer is the only layer, it will
have corrected commit dates when written by compatible versions of Git. Thus,
rewriting split commit-graph as a single file (`--split=replace`) creates a
single layer with corrected commit dates.

## Deleting graph-{hash} files

After a new tip file is written, some `graph-{hash}` files may no longer
Expand Down
Loading