Skip to content

Commit 8647b5d

Browse files
commit-graph: document generation number v2
Git uses topological levels in the commit-graph file for commit-graph traversal operations like 'git log --graph'. Unfortunately, topological levels can perform worse than committer date when parents of a commit differ greatly in generation numbers [1]. For example, 'git merge-base v4.8 v4.9' on the Linux repository walks 635,579 commits using topological levels and walks 167,468 using committer date. Since 091f4cf (commit: don't use generation numbers if not needed, 2018-08-30), 'git merge-base' uses committer date heuristic unless there is a cutoff because of the performance hit. [1] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/ Thus, the need for generation number v2 was born. As Git used to die when graph version understood by it and in the commit-graph file are different [2], we needed a way to distinguish between the old and new generation number without incrementing the graph version. [2] https://lore.kernel.org/git/[email protected]/ The following candidates were proposed (https://github.com/derrickstolee/gen-test, #1): - (Epoch, Date) Pairs. - Maximum Generation Numbers. - Corrected Commit Date. - FELINE Index. - Corrected Commit Date with Monotonically Increasing Offsets. Based on performance, local computability, and immutability (along with the introduction of an additional commit-graph chunk which relieved the requirement of backwards-compatibility) Corrected Commit Date was chosen as generation number v2 and is defined as follows: For a commit C, let its corrected commit date be the maximum of the commit date of C and the corrected commit dates of its parents plus 1. Then corrected commit date offset is the difference between corrected commit date of C and commit date of C. As a special case, a root commit with the timestamp zero has corrected commit date of 1 to distinguish it from GENERATION_NUMBER_ZERO (that is, an uncomputed generation number). While it was proposed initially to store corrected commit date offsets within Commit Data Chunk, storing the offsets in a new chunk did not affect the performance measurably. The new chunk is "Generation DATa (GDAT) chunk" and it stores corrected commit date offsets while CDAT chunk stores topological level. The old versions of Git would ignore GDAT chunk, using topological levels from CDAT chunk. In contrast, new versions of Git would use corrected commit dates, falling back to topological level if the generation data chunk is absent in the commit-graph file. While storing corrected commit date offsets saves us 4 bytes per commit (as compared with storing corrected commit dates directly), it's however possible for the offset to overflow the space allocated. To handle such cases, we introduce a new chunk, _Generation Data Overflow_ (GDOV) that stores the corrected commit date. For overflowing offsets, we set MSB and store the position into the GDOV chunk, in a mechanism similar to the Extra Edges list chunk. For mixed generation number environment (for example new Git on the command line, old Git used by GUI client), we can encounter a mixed-chain commit-graph (a commit-graph chain where some of split commit-graph files have GDAT chunk and others do not). As backward compatibility is one of the goals, we can define the following behavior: While reading a mixed-chain commit-graph version, we fall back on topological levels as corrected commit dates and topological levels cannot be compared directly. When adding new layer to the split commit-graph file, and when merging some or all layers (replacing them in the latter case), the new layer will have GDAT chunk if and only if in the final result there would be no layer without GDAT chunk just below it. Signed-off-by: Abhishek Kumar <[email protected]>
1 parent ea32cba commit 8647b5d

File tree

2 files changed

+86
-19
lines changed

2 files changed

+86
-19
lines changed

Documentation/technical/commit-graph-format.txt

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,7 @@ Git commit graph format
44
The Git commit graph stores a list of commit OIDs and some associated
55
metadata, including:
66

7-
- The generation number of the commit. Commits with no parents have
8-
generation number 1; commits with parents have generation number
9-
one more than the maximum generation number of its parents. We
10-
reserve zero as special, and can be used to mark a generation
11-
number invalid or as "not computed".
7+
- The generation number of the commit.
128

139
- The root tree OID.
1410

@@ -86,13 +82,33 @@ CHUNK DATA:
8682
position. If there are more than two parents, the second value
8783
has its most-significant bit on and the other bits store an array
8884
position into the Extra Edge List chunk.
89-
* The next 8 bytes store the generation number of the commit and
85+
* The next 8 bytes store the topological level (generation number v1)
86+
of the commit and
9087
the commit time in seconds since EPOCH. The generation number
9188
uses the higher 30 bits of the first 4 bytes, while the commit
9289
time uses the 32 bits of the second 4 bytes, along with the lowest
9390
2 bits of the lowest byte, storing the 33rd and 34th bit of the
9491
commit time.
9592

93+
Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
94+
* This list of 4-byte values store corrected commit date offsets for the
95+
commits, arranged in the same order as commit data chunk.
96+
* If the corrected commit date offset cannot be stored within 31 bits,
97+
the value has its most-significant bit on and the other bits store
98+
the position of corrected commit date into the Generation Data Overflow
99+
chunk.
100+
* Generation Data chunk is present only when commit-graph file is written
101+
by compatible versions of Git and in case of split commit-graph chains,
102+
the topmost layer also has Generation Data chunk.
103+
104+
Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
105+
* This list of 8-byte values stores the corrected commit date offsets
106+
for commits with corrected commit date offsets that cannot be
107+
stored within 31 bits.
108+
* Generation Data Overflow chunk is present only when Generation Data
109+
chunk is present and atleast one corrected commit date offset cannot
110+
be stored within 31 bits.
111+
96112
Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
97113
This list of 4-byte values store the second through nth parents for
98114
all octopus merges. The second parent value in the commit data stores

Documentation/technical/commit-graph.txt

Lines changed: 64 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
3838

3939
Values 1-4 satisfy the requirements of parse_commit_gently().
4040

41-
Define the "generation number" of a commit recursively as follows:
41+
There are two definitions of generation number:
42+
1. Corrected committer dates (generation number v2)
43+
2. Topological levels (generation nummber v1)
4244

43-
* A commit with no parents (a root commit) has generation number one.
45+
Define "corrected committer date" of a commit recursively as follows:
4446

45-
* A commit with at least one parent has generation number one more than
46-
the largest generation number among its parents.
47+
* A commit with no parents (a root commit) has corrected committer date
48+
equal to its committer date.
4749

48-
Equivalently, the generation number of a commit A is one more than the
50+
* A commit with at least one parent has corrected committer date equal to
51+
the maximum of its commiter date and one more than the largest corrected
52+
committer date among its parents.
53+
54+
* As a special case, a root commit with timestamp zero has corrected commit
55+
date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
56+
(that is, an uncomputed corrected commit date).
57+
58+
Define the "topological level" of a commit recursively as follows:
59+
60+
* A commit with no parents (a root commit) has topological level of one.
61+
62+
* A commit with at least one parent has topological level one more than
63+
the largest topological level among its parents.
64+
65+
Equivalently, the topological level of a commit A is one more than the
4966
length of a longest path from A to a root commit. The recursive definition
5067
is easier to use for computation and observing the following property:
5168

@@ -60,14 +77,19 @@ is easier to use for computation and observing the following property:
6077
generation numbers, then we always expand the boundary commit with highest
6178
generation number and can easily detect the stopping condition.
6279

80+
The property applies to both versions of generation number, that is both
81+
corrected committer dates and topological levels.
82+
6383
This property can be used to significantly reduce the time it takes to
6484
walk commits and determine topological relationships. Without generation
6585
numbers, the general heuristic is the following:
6686

6787
If A and B are commits with commit time X and Y, respectively, and
6888
X < Y, then A _probably_ cannot reach B.
6989

70-
This heuristic is currently used whenever the computation is allowed to
90+
In absence of corrected commit dates (for example, old versions of Git or
91+
mixed generation graph chains),
92+
this heuristic is currently used whenever the computation is allowed to
7193
violate topological relationships due to clock skew (such as "git log"
7294
with default order), but is not used when the topological order is
7395
required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
7799
generation number and walk until reaching commits with known generation
78100
number.
79101

80-
We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
102+
We use the macro GENERATION_NUMBER_INFINITY to mark commits not
81103
in the commit-graph file. If a commit-graph file was written by a version
82104
of Git that did not compute generation numbers, then those commits will
83105
have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
93115
walking a few extra commits, but the simplicity in dealing with commits
94116
with generation number *_INFINITY or *_ZERO is valuable.
95117

96-
We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
97-
generation numbers are computed to be at least this value. We limit at
98-
this value since it is the largest value that can be stored in the
99-
commit-graph file using the 30 bits available to generation numbers. This
100-
presents another case where a commit can have generation number equal to
101-
that of a parent.
118+
We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
119+
topological levels (generation number v1) are computed to be at least
120+
this value. We limit at this value since it is the largest value that
121+
can be stored in the commit-graph file using the 30 bits available
122+
to topological levels. This presents another case where a commit can
123+
have generation number equal to that of a parent.
102124

103125
Design Details
104126
--------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
267289
number of commits) could be extracted into config settings for full
268290
flexibility.
269291

292+
## Handling Mixed Generation Number Chains
293+
294+
With the introduction of generation number v2 and generation data chunk, the
295+
following scenario is possible:
296+
297+
1. "New" Git writes a commit-graph with the corrected commit dates.
298+
2. "Old" Git writes a split commit-graph on top without corrected commit dates.
299+
300+
A naive approach of using the newest available generation number from
301+
each layer would lead to violated expectations: the lower layer would
302+
use corrected commit dates which are much larger than the topological
303+
levels of the higher layer. For this reason, Git inspects the topmost
304+
layer to see if the layer is missing corrected commit dates. In such a case
305+
Git only uses topological level for generation numbers.
306+
307+
When writing a new layer in split commit-graph, we write corrected commit
308+
dates if the topmost layer has corrected commit dates written. This
309+
guarantees that if a layer has corrected commit dates, all lower layers
310+
must have corrected commit dates as well.
311+
312+
When merging layers, we do not consider whether the merged layers had corrected
313+
commit dates. Instead, the new layer will have corrected commit dates if the
314+
layer below the new layer has corrected commit dates.
315+
316+
While writing or merging layers, if the new layer is the only layer, it will
317+
have corrected commit dates when written by compatible versions of Git. Thus,
318+
rewriting split commit-graph as a single file (`--split=replace`) creates a
319+
single layer with corrected commit dates.
320+
270321
## Deleting graph-{hash} files
271322

272323
After a new tip file is written, some `graph-{hash}` files may no longer

0 commit comments

Comments
 (0)