-
Notifications
You must be signed in to change notification settings - Fork 142
Fix REF_DELTA chain bug in 'git index-pack' #1906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -821,6 +821,7 @@ TEST_BUILTINS_OBJS += test-mergesort.o | |
TEST_BUILTINS_OBJS += test-mktemp.o | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): I needed this to make
$ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make
$ cd t && sh t5309-pack-delta-cycles.sh
pass.
--- >8 ------ >8 ------ >8 ---
Subject: [PATCH] fixup! test-tool: add pack-deltas helper
t/helper/test-pack-deltas.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
index db7d1c3cd1..c8e837ea06 100644
--- a/t/helper/test-pack-deltas.c
+++ b/t/helper/test-pack-deltas.c
@@ -122,6 +122,7 @@ int cmd__pack_deltas(int argc, const char **argv)
if (get_oid_hex(base_oid_str, &base_oid))
die("invalid object: %s", base_oid_str);
}
+ string_list_clear(&items, 0);
if (!strcmp(type_str, "REF_DELTA"))
write_ref_delta(f, &content_oid, &base_oid);
--
2.49.0-555-g43235db9c8
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Derrick Stolee wrote (reply to this): On 4/24/2025 3:41 PM, Junio C Hamano wrote:
> I needed this to make
>
> $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make
> $ cd t && sh t5309-pack-delta-cycles.sh
>
> pass.
> --- >8 ------ >8 ------ >8 ---
> Subject: [PATCH] fixup! test-tool: add pack-deltas helper
>
> t/helper/test-pack-deltas.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
> index db7d1c3cd1..c8e837ea06 100644
> --- a/t/helper/test-pack-deltas.c
> +++ b/t/helper/test-pack-deltas.c
> @@ -122,6 +122,7 @@ int cmd__pack_deltas(int argc, const char **argv)
> if (get_oid_hex(base_oid_str, &base_oid))
> die("invalid object: %s", base_oid_str);
> }
> + string_list_clear(&items, 0);
Thanks. I'll make sure to apply it. My GGG PR validation was broken
top-to-bottom due to other environmental issues so I had not seen
this failure myself.
Thanks,
-Stolee
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): Derrick Stolee <[email protected]> writes:
> On 4/24/2025 3:41 PM, Junio C Hamano wrote:
>> I needed this to make
>>
>> $ SANITIZE=leak GIT_TEST_PASSING_SANITIZE_LEAK=true make
>> $ cd t && sh t5309-pack-delta-cycles.sh
>>
>> pass.
>> --- >8 ------ >8 ------ >8 ---
>> Subject: [PATCH] fixup! test-tool: add pack-deltas helper
>>
>> t/helper/test-pack-deltas.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
>> index db7d1c3cd1..c8e837ea06 100644
>> --- a/t/helper/test-pack-deltas.c
>> +++ b/t/helper/test-pack-deltas.c
>> @@ -122,6 +122,7 @@ int cmd__pack_deltas(int argc, const char **argv)
>> if (get_oid_hex(base_oid_str, &base_oid))
>> die("invalid object: %s", base_oid_str);
>> }
>> + string_list_clear(&items, 0);
>
> Thanks. I'll make sure to apply it. My GGG PR validation was broken
> top-to-bottom due to other environmental issues so I had not seen
> this failure myself.
I squashed this in so unless there are other things you need to
change, this alone does not make it necessary to reroll the series.
Thanks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Patrick Steinhardt wrote (reply to this): On Wed, Apr 23, 2025 at 05:40:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
> new file mode 100644
> index 00000000000..db7d1c3cd1f
> --- /dev/null
> +++ b/t/helper/test-pack-deltas.c
> @@ -0,0 +1,140 @@
[snip]
> +int cmd__pack_deltas(int argc, const char **argv)
> +{
> + int N;
> + struct hashfile *f;
> + struct strbuf line = STRBUF_INIT;
> +
> + if (argc != 2) {
> + usage(usage_str);
> + return -1;
> + }
> +
> + N = atoi(argv[1]);
Is there a reason why we don't use `parse_options()` here? It might make
this tool easier to use and extend going forward, and we wouldn't have
to care about invalid arguments. Right now, we silently accept a
non-integer argument and do the wrong thing.
Patrick There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Johannes Schindelin wrote (reply to this): Hi Patrick,
On Fri, 25 Apr 2025, Patrick Steinhardt wrote:
> On Wed, Apr 23, 2025 at 05:40:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> > diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
> > new file mode 100644
> > index 00000000000..db7d1c3cd1f
> > --- /dev/null
> > +++ b/t/helper/test-pack-deltas.c
> > @@ -0,0 +1,140 @@
> [snip]
> > +int cmd__pack_deltas(int argc, const char **argv)
> > +{
> > + int N;
> > + struct hashfile *f;
> > + struct strbuf line = STRBUF_INIT;
> > +
> > + if (argc != 2) {
> > + usage(usage_str);
> > + return -1;
> > + }
> > +
> > + N = atoi(argv[1]);
>
> Is there a reason why we don't use `parse_options()` here? It might make
> this tool easier to use and extend going forward, and we wouldn't have
> to care about invalid arguments. Right now, we silently accept a
> non-integer argument and do the wrong thing.
I think that `parse_options()` would be overkill here because:
- This is a _mandatory_ argument, not an optional one.
- The required data type is `uint32_t`, and `parse_options()` has no
support for that.
But you do have a good point in that we may want to validate the data type
(even if technically, this is not a user-facing program, it's a test
helper that is used under tight control by Git's own test suite).
Consequently, I would suggest this fixup instead:
-- snipsnap --
Subject: [PATCH] fixup! test-tool: add pack-deltas helper
Let's make the command-line parsing a bit more stringent. We _could_
use `parse_options()`, but that would be overkill for a single,
non-optional argument. Besides, it would not bring any benefit, as the
parsed value needs to fit in the `uint32_t` type, and `parse_options()`
has no provision to ensure that.
Signed-off-by: Johannes Schindelin <[email protected]>
---
t/helper/test-pack-deltas.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
index 4af69bdc05d3..f95d8ee16768 100644
--- a/t/helper/test-pack-deltas.c
+++ b/t/helper/test-pack-deltas.c
@@ -8,11 +8,12 @@
#include "hex.h"
#include "pack.h"
#include "pack-objects.h"
+#include "parse.h"
#include "setup.h"
#include "strbuf.h"
#include "string-list.h"
-static const char usage_str[] = "test-tool pack-deltas <n>";
+static const char usage_str[] = "test-tool pack-deltas <nr_entries>";
static unsigned long do_compress(void **pptr, unsigned long size)
{
@@ -79,7 +80,7 @@ static void write_ref_delta(struct hashfile *f,
int cmd__pack_deltas(int argc, const char **argv)
{
- int N;
+ unsigned long n;
struct hashfile *f;
struct strbuf line = STRBUF_INIT;
@@ -88,12 +89,13 @@ int cmd__pack_deltas(int argc, const char **argv)
return -1;
}
- N = atoi(argv[1]);
+ if (!git_parse_ulong(argv[1], &n) || n != (uint32_t)n)
+ die("invalid number of objects: %s", argv[1]);
setup_git_directory();
f = hashfd(1, "<stdout>");
- write_pack_header(f, N);
+ write_pack_header(f, n);
/* Read each line from stdin into 'line' */
while (strbuf_getline_lf(&line, stdin) != EOF) {
--
2.49.0.windows.1
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Patrick Steinhardt wrote (reply to this): On Fri, Apr 25, 2025 at 11:34:01AM +0200, Johannes Schindelin wrote:
> Hi Patrick,
>
> On Fri, 25 Apr 2025, Patrick Steinhardt wrote:
>
> > On Wed, Apr 23, 2025 at 05:40:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> > > diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
> > > new file mode 100644
> > > index 00000000000..db7d1c3cd1f
> > > --- /dev/null
> > > +++ b/t/helper/test-pack-deltas.c
> > > @@ -0,0 +1,140 @@
> > [snip]
> > > +int cmd__pack_deltas(int argc, const char **argv)
> > > +{
> > > + int N;
> > > + struct hashfile *f;
> > > + struct strbuf line = STRBUF_INIT;
> > > +
> > > + if (argc != 2) {
> > > + usage(usage_str);
> > > + return -1;
> > > + }
> > > +
> > > + N = atoi(argv[1]);
> >
> > Is there a reason why we don't use `parse_options()` here? It might make
> > this tool easier to use and extend going forward, and we wouldn't have
> > to care about invalid arguments. Right now, we silently accept a
> > non-integer argument and do the wrong thing.
>
> I think that `parse_options()` would be overkill here because:
>
> - This is a _mandatory_ argument, not an optional one.
>
> - The required data type is `uint32_t`, and `parse_options()` has no
> support for that.
Support for that has been merged just this week via 2bc5414c411 (Merge
branch 'ps/parse-options-integers', 2025-04-24).
> But you do have a good point in that we may want to validate the data type
> (even if technically, this is not a user-facing program, it's a test
> helper that is used under tight control by Git's own test suite).
>
> Consequently, I would suggest this fixup instead:
But in any case, I'd be equally fine with your suggestion.
Patrick There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Johannes Schindelin wrote (reply to this): Hi Patrick,
On Fri, 25 Apr 2025, Patrick Steinhardt wrote:
> On Fri, Apr 25, 2025 at 11:34:01AM +0200, Johannes Schindelin wrote:
> >
> > On Fri, 25 Apr 2025, Patrick Steinhardt wrote:
> >
> > > On Wed, Apr 23, 2025 at 05:40:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> > > > diff --git a/t/helper/test-pack-deltas.c b/t/helper/test-pack-deltas.c
> > > > new file mode 100644
> > > > index 00000000000..db7d1c3cd1f
> > > > --- /dev/null
> > > > +++ b/t/helper/test-pack-deltas.c
> > > > @@ -0,0 +1,140 @@
> > > [snip]
> > > > +int cmd__pack_deltas(int argc, const char **argv)
> > > > +{
> > > > + int N;
> > > > + struct hashfile *f;
> > > > + struct strbuf line = STRBUF_INIT;
> > > > +
> > > > + if (argc != 2) {
> > > > + usage(usage_str);
> > > > + return -1;
> > > > + }
> > > > +
> > > > + N = atoi(argv[1]);
> > >
> > > Is there a reason why we don't use `parse_options()` here? It might make
> > > this tool easier to use and extend going forward, and we wouldn't have
> > > to care about invalid arguments. Right now, we silently accept a
> > > non-integer argument and do the wrong thing.
> >
> > I think that `parse_options()` would be overkill here because:
> >
> > - This is a _mandatory_ argument, not an optional one.
> >
> > - The required data type is `uint32_t`, and `parse_options()` has no
> > support for that.
>
> Support for that has been merged just this week via 2bc5414c411 (Merge
> branch 'ps/parse-options-integers', 2025-04-24).
That's too new for me to be useful, as I have to work on top of v2.49.0
(see https://github.com/microsoft/git/pull/745).
> > But you do have a good point in that we may want to validate the data type
> > (even if technically, this is not a user-facing program, it's a test
> > helper that is used under tight control by Git's own test suite).
> >
> > Consequently, I would suggest this fixup instead:
>
> But in any case, I'd be equally fine with your suggestion.
Thanks!
Johannes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): Patrick Steinhardt <[email protected]> writes:
>> > Is there a reason why we don't use `parse_options()` here? It might make
>> > this tool easier to use and extend going forward, and we wouldn't have
>> > to care about invalid arguments. Right now, we silently accept a
>> > non-integer argument and do the wrong thing.
>>
>> I think that `parse_options()` would be overkill here because:
>>
>> - This is a _mandatory_ argument, not an optional one.
>>
>> - The required data type is `uint32_t`, and `parse_options()` has no
>> support for that.
>
> Support for that has been merged just this week via 2bc5414c411 (Merge
> branch 'ps/parse-options-integers', 2025-04-24).
>
>> But you do have a good point in that we may want to validate the data type
>> (even if technically, this is not a user-facing program, it's a test
>> helper that is used under tight control by Git's own test suite).
>>
>> Consequently, I would suggest this fixup instead:
>
> But in any case, I'd be equally fine with your suggestion.
Yeah, I think we clearly showed our "it's just test helper, whose
callers are supposed to know what they are doing" attitude, but with
proper helpers, it is not too much additional effort to do the right
thing.
Thanks, both. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Derrick Stolee wrote (reply to this): On 4/25/2025 12:27 PM, Junio C Hamano wrote:
> Patrick Steinhardt <[email protected]> writes:
>
>>>> Is there a reason why we don't use `parse_options()` here? It might make
>>>> this tool easier to use and extend going forward, and we wouldn't have
>>>> to care about invalid arguments. Right now, we silently accept a
>>>> non-integer argument and do the wrong thing.
>>>
>>> I think that `parse_options()` would be overkill here because:
>>>
>>> - This is a _mandatory_ argument, not an optional one.
>>>
>>> - The required data type is `uint32_t`, and `parse_options()` has no
>>> support for that.
>>
>> Support for that has been merged just this week via 2bc5414c411 (Merge
>> branch 'ps/parse-options-integers', 2025-04-24).
The thing that confused me even with those changes is that this is a
_positional_ argument and we don't have a way to say "parse the 1st
positional argument into an integer".
>>> But you do have a good point in that we may want to validate the data type
>>> (even if technically, this is not a user-facing program, it's a test
>>> helper that is used under tight control by Git's own test suite).
>>>
>>> Consequently, I would suggest this fixup instead:
>>
>> But in any case, I'd be equally fine with your suggestion.
>
> Yeah, I think we clearly showed our "it's just test helper, whose
> callers are supposed to know what they are doing" attitude, but with
> proper helpers, it is not too much additional effort to do the right
> thing.
But with this philosophy in mind I can change the CLI to be of the form
"--num-objects <n>" to use the parse-options feature. This should make
things more extensible in the future.
Thanks,
-Stolee
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): Derrick Stolee <[email protected]> writes:
>> Yeah, I think we clearly showed our "it's just test helper, whose
>> callers are supposed to know what they are doing" attitude, but with
>> proper helpers, it is not too much additional effort to do the right
>> thing.
>
> But with this philosophy in mind I can change the CLI to be of the form
> "--num-objects <n>" to use the parse-options feature. This should make
> things more extensible in the future.
True. If we are aiming to deliver this to end-user's hands in some
future, I agree that we want to make it extensible, make it dtrt
without being told, and make it harder to give wrong input. If we
are going in that direction [*], I suspect this should not be a
separate and independent input---rather, shouldn't the tool already
_know_ what objects it placed in the resulting output stream, and
should be able to _count_ that number by itself? One thing it lets
us do to have this as a separate number is to create an invalid pack
stream where the header gives a wrong number, and as a test tool,
that may trump the convenience of not having to give the number
explicitly.
Another thing we may want to add to the tool is to give it a mode
that either (1) refuses to place the same object in a single pack
stream more than once, or (2) warn when it happens. The latter
would be useful to create an invalid pack stream for testing.
Thanks.
[Footnote]
* ... which I would welcome as the project manager, even though
personally I find that 'it is just test helper' attitude
attractive. Not everybody in this project is experienced enough
to understand the 'it is just test helper, let's not spend too
much effort on it' attitude and elements in the code there should
not be blindly copied and pasted to production part of the system. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Derrick Stolee wrote (reply to this): On 4/28/2025 12:37 PM, Junio C Hamano wrote:
> Derrick Stolee <[email protected]> writes:
>
>>> Yeah, I think we clearly showed our "it's just test helper, whose
>>> callers are supposed to know what they are doing" attitude, but with
>>> proper helpers, it is not too much additional effort to do the right
>>> thing.
>>
>> But with this philosophy in mind I can change the CLI to be of the form
>> "--num-objects <n>" to use the parse-options feature. This should make
>> things more extensible in the future.
>
> True. If we are aiming to deliver this to end-user's hands in some
> future, I agree that we want to make it extensible, make it dtrt
> without being told, and make it harder to give wrong input.
I'm not focused on usability, but instead on faster development cycles
if more advanced options are required in the future. I'll happily take
some extra time now to help those who come after.
> If we
> are going in that direction [*], I suspect this should not be a
> separate and independent input---rather, shouldn't the tool already
> _know_ what objects it placed in the resulting output stream, and
> should be able to _count_ that number by itself?
This makes sense as a feature, except that we need to write the number
of objects present in a packfile in its header. If we wanted to avoid
the argument, then we'd need to load the data into a list before
starting to write the packfile. By taking the count in advance, the
implementation is simpler.
> One thing it lets
> us do to have this as a separate number is to create an invalid pack
> stream where the header gives a wrong number, and as a test tool,
> that may trump the convenience of not having to give the number
> explicitly.
But also, this allows generating bad packfiles which is a bonus.
A very good point.
> Another thing we may want to add to the tool is to give it a mode
> that either (1) refuses to place the same object in a single pack
> stream more than once, or (2) warn when it happens. The latter
> would be useful to create an invalid pack stream for testing.
Noted for potential future expansion.
Thanks,
-Stolee
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): Derrick Stolee <[email protected]> writes:
> This makes sense as a feature, except that we need to write the number
> of objects present in a packfile in its header. If we wanted to avoid
> the argument, then we'd need to load the data into a list before
> starting to write the packfile. By taking the count in advance, the
> implementation is simpler.
Perhaps, but in the production code we already seek back and update
the header after writing a packfile with fixup_pack_header_footer(),
so which shoulnd't be too much work conceptually. You earlier said
something about faster development cycles; not having to figure out
how to seek back and fix the header does help faster development of
this helper ;-).
>> One thing it lets
>> us do to have this as a separate number is to create an invalid pack
>> stream where the header gives a wrong number, and as a test tool,
>> that may trump the convenience of not having to give the number
>> explicitly.
> But also, this allows generating bad packfiles which is a bonus.
> A very good point.
If this were an end-user facing program, we would most likely let it
count but let the count be overridden by the command line argument.
But until then, as I said, I am fine to just declare that it is just
a test helper ;-)
Thanks. |
||
TEST_BUILTINS_OBJS += test-name-hash.o | ||
TEST_BUILTINS_OBJS += test-online-cpus.o | ||
TEST_BUILTINS_OBJS += test-pack-deltas.o | ||
TEST_BUILTINS_OBJS += test-pack-mtimes.o | ||
TEST_BUILTINS_OBJS += test-parse-options.o | ||
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1109,8 +1109,8 @@ static void *threaded_second_pass(void *data) | |
set_thread_data(data); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> From: Derrick Stolee <[email protected]>
>
> As detailed in the previous changes to t5309-pack-delta-cycles.sh, the
> logic within 'git index-pack' to analyze an incoming thin packfile with
> REF_DELTAs is suspect. The algorithm is overly cautious around delta
> cycles, and that leads in fact to failing even when there is no cycle.
>
> This change adjusts the algorithm to no longer fail in these cases. In
> fact, these cycle cases will no longer fail but more importantly the
> valid cases will no longer fail, either. The resulting packfile from the
> --fix-thin operation will not have cycles either since REF_DELTAs are
> forbidden from the on-disk format and OFS_DELTAs are impossible to write
> as a cycle.
Loosening cycle avoidance indeed is worrysome. How do we guarantee
"since REF_DELTAs are forbidden from the on-disk format" (it is
obvious with OFS_DELTA only you cannot form a cycle)? By code
inspection and in-code comment for the code path of "--fix-thin"?
I think the most interesting question would be how, with the
loosening of the cycle check in this patch, we would still protect
against a malicious on-the-wire packdata that has a cycle in it
already, where deltas cannot be resolved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Derrick Stolee wrote (reply to this): On 4/24/2025 5:41 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
>
>> From: Derrick Stolee <[email protected]>
>>
>> As detailed in the previous changes to t5309-pack-delta-cycles.sh, the
>> logic within 'git index-pack' to analyze an incoming thin packfile with
>> REF_DELTAs is suspect. The algorithm is overly cautious around delta
>> cycles, and that leads in fact to failing even when there is no cycle.
>>
>> This change adjusts the algorithm to no longer fail in these cases. In
>> fact, these cycle cases will no longer fail but more importantly the
>> valid cases will no longer fail, either. The resulting packfile from the
>> --fix-thin operation will not have cycles either since REF_DELTAs are
>> forbidden from the on-disk format and OFS_DELTAs are impossible to write
>> as a cycle.
>
> Loosening cycle avoidance indeed is worrysome. How do we guarantee
> "since REF_DELTAs are forbidden from the on-disk format" (it is
> obvious with OFS_DELTA only you cannot form a cycle)? By code
> inspection and in-code comment for the code path of "--fix-thin"?
>
> I think the most interesting question would be how, with the
> loosening of the cycle check in this patch, we would still protect
> against a malicious on-the-wire packdata that has a cycle in it
> already, where deltas cannot be resolved.
The good news is two-fold:
1. We have a test in t5309 (index-pack detects REF_DELTA cycles) that
checks this case and did not change behavior with this patch.
2. If we don't already have any of the objects that exist in the cycle,
then we can't _start_ expanding the objects to their full contents
as we don't have their bases. So what is left is a list of REF_DELTAs
that have not been processed but also a list of base object IDs that
we can't load. We'll fail without writing those objects to disk.
Does that satisfy your concerns in this space?
Thanks,
-Stolee
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Taylor Blau wrote (reply to this): On Mon, Apr 28, 2025 at 08:24:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> The crux of the matter is how the algorithm works when the REF_DELTAs
> point to base objects that exist in the local repository.
Hmm. I'm having trouble squaring this with these next two paragraphs:
> Consider the case where the packfile has two REF_DELTA objects, A and B,
> and the delta chain looks like "A depends on B" and "B depends on C" for
> some third object C, where C is already in the current repository. The
> algorithm _should_ start with all objects that depend on C, finding B,
> and then moving on to all objects depending on B, finding A.
>
> However, if the repository also already has object B, then the delta
> chain can be analyzed in a different order. The deltas with base B can
> be analyzed first, finding A, and then the deltas with base C are
> analyzed, finding B. The algorithm currently continues to look for
> objects that depend on B, finding A again. This fails due to A's
> 'real_type' member already being overwritten from OBJ_REF_DELTA to the
> correct object type.
ISTM that this A->B->C chain is a problem because (in the above example)
the server sent B as a REF_DELTA base for A but also had its own
pre-existing copy of B.
But the first quoted sentence suggests that the issue is with REF_DELTAs
that point to base objects that exist in the local repository. Does
"point to" mean that the REF_DELTA's base is the local object, or that
the local object itself was sent as a REF_DELTA against some other base?
I haven't fully wrapped my head around the implications of this all yet,
but I think that it's the former, though admittedly even typing this I
am not quite sure of myself. I *think* that doing this is OK if the only
path from base objects to their corresponding deltas is unique, and/or
there were no such paths at all.
I'm trying to think through the implications of this against my
series[1] from a while ago that converts OFS_DELTAs that weren't usable
as part of verbatim pack-reuse into REF_DELTAs. There are two cases
there that I was considering:
- For some (delta, base) pair in a pack, there was an additional copy
of 'base' in some other pack, and the MIDX chose the copy from that
pack. That forms what I'm calling a "cross-pack" delta. This isn't
currently reusable as an OFS_DELTA for a variety of reasons, but is
OK as a REF_DELTA provided we know that the client either already
has the base object or we are sending it as part of the pack anyway.
- The other case is that we the client wants the delta-half of a
delta/base-pair, but not the base object. In this case, we can't
currently reuse the OFS_DELTA verbatim, but could if we converted it
to a REF_DELTA based on the knowledge that the client already has
the base object.
The latter is doable based on the information in the wants/haves bitmap.
The process there looks like: determine that the client doesn't want the
base object, realize that its bit is set in the 'haves' bitmap, and then
convert the delta object from a OFS_DELTA to a REF_DELTA.
But I think that all breaks for older clients that don't meet the unique
paths condition above. Does that sound right to you?
I think the cross-pack case is fine, provided we know ahead of time that
the client doesn't have the (converted-to-REF_DELTA) delta object in its
local copy.
Unfortunately, I think this means that [1] is a bit of a dead-end for
serves that have older clients (running a version that does not include
this patch). At least, I think that's true if we can construct a
situation where the server sends a REF_DELTA that it thinks the client
doesn't have but actually does. I'm not immediately sure what such a
situation would look like beyond cases like: "the client verbatim asked
for an object it already has, but isn't reachable from the set of
provided 'haves'".
Thanks,
Taylor
[1]: https://lore.kernel.org/git/[email protected]/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Derrick Stolee wrote (reply to this): On 5/6/2025 10:08 PM, Taylor Blau wrote:
> On Mon, Apr 28, 2025 at 08:24:45PM +0000, Derrick Stolee via GitGitGadget wrote:
>> The crux of the matter is how the algorithm works when the REF_DELTAs
>> point to base objects that exist in the local repository.
>
> Hmm. I'm having trouble squaring this with these next two paragraphs:
>
>> Consider the case where the packfile has two REF_DELTA objects, A and B,
>> and the delta chain looks like "A depends on B" and "B depends on C" for
>> some third object C, where C is already in the current repository. The
>> algorithm _should_ start with all objects that depend on C, finding B,
>> and then moving on to all objects depending on B, finding A.
>>
>> However, if the repository also already has object B, then the delta
>> chain can be analyzed in a different order. The deltas with base B can
>> be analyzed first, finding A, and then the deltas with base C are
>> analyzed, finding B. The algorithm currently continues to look for
>> objects that depend on B, finding A again. This fails due to A's
>> 'real_type' member already being overwritten from OBJ_REF_DELTA to the
>> correct object type.
>
> ISTM that this A->B->C chain is a problem because (in the above example)
> the server sent B as a REF_DELTA base for A but also had its own
> pre-existing copy of B.
>
> But the first quoted sentence suggests that the issue is with REF_DELTAs
> that point to base objects that exist in the local repository. Does
> "point to" mean that the REF_DELTA's base is the local object, or that
> the local object itself was sent as a REF_DELTA against some other base?
The issue is that based on the negotiation, the server knows the client
wants A and has C but doesn't think the client has B despite the client
having B.
If the server sends these two objects as a delta chain A->B->C using
REF_DELTAs (and not the typical OFS_DELTA from A to B) then the client
may inflate A first and B second (and then try to inflate A again).
> I haven't fully wrapped my head around the implications of this all yet,
> but I think that it's the former, though admittedly even typing this I
> am not quite sure of myself. I *think* that doing this is OK if the only
> path from base objects to their corresponding deltas is unique, and/or
> there were no such paths at all.
>
> I'm trying to think through the implications of this against my
> series[1] from a while ago that converts OFS_DELTAs that weren't usable
> as part of verbatim pack-reuse into REF_DELTAs. There are two cases
> there that I was considering:
>
> - For some (delta, base) pair in a pack, there was an additional copy
> of 'base' in some other pack, and the MIDX chose the copy from that
> pack. That forms what I'm calling a "cross-pack" delta. This isn't
> currently reusable as an OFS_DELTA for a variety of reasons, but is
> OK as a REF_DELTA provided we know that the client either already
> has the base object or we are sending it as part of the pack anyway.
These are the kinds of deltas that could hit this problem.
> - The other case is that we the client wants the delta-half of a
> delta/base-pair, but not the base object. In this case, we can't
> currently reuse the OFS_DELTA verbatim, but could if we converted it
> to a REF_DELTA based on the knowledge that the client already has
> the base object.
If we know the client has the base object, then not including the base in
the pack and referencing it as a REF_DELTA will always be OK.
> The latter is doable based on the information in the wants/haves bitmap.
> The process there looks like: determine that the client doesn't want the
> base object, realize that its bit is set in the 'haves' bitmap, and then
> convert the delta object from a OFS_DELTA to a REF_DELTA.
>
> But I think that all breaks for older clients that don't meet the unique
> paths condition above. Does that sound right to you?
>
> I think the cross-pack case is fine, provided we know ahead of time that
> the client doesn't have the (converted-to-REF_DELTA) delta object in its
> local copy.
>
> Unfortunately, I think this means that [1] is a bit of a dead-end for
> serves that have older clients (running a version that does not include
> this patch). At least, I think that's true if we can construct a
> situation where the server sends a REF_DELTA that it thinks the client
> doesn't have but actually does. I'm not immediately sure what such a
> situation would look like beyond cases like: "the client verbatim asked
> for an object it already has, but isn't reachable from the set of
> provided 'haves'".
I agree that [1] is going to present more opportunities for this bug to
be hit by older clients, though I will admit that the frequency of the
problem seems to have odd frequencies. It relies upon the fact that the
fetch negotiation fails to identify "B" as a common object. This sort of
thing can happen more frequently when using the commit boundary between
the haves and wants (instead of walking from all haves like a bitmap walk
would do). If the client fails to advertise all of its haves, then that
could also lead to this problem even if the server is walking all objects
reachable from the haves.
The other thing that you'll have going for you is that cross-pack deltas
should be rare. The case that helped me discover this issue was where the
server did not send any OFS_DELTAs at all but used REF_DELTAs for all
delta relationships. (This is required if the client doesn't advertise
the "ofs-delta" capability, but we'd need to go _way_ back for a client
to not include that capability.)
Thanks,
-Stolee
|
||
for (;;) { | ||
struct base_data *parent = NULL; | ||
struct object_entry *child_obj; | ||
struct base_data *child; | ||
struct object_entry *child_obj = NULL; | ||
struct base_data *child = NULL; | ||
|
||
counter_lock(); | ||
display_progress(progress, nr_resolved_deltas); | ||
|
@@ -1137,15 +1137,18 @@ static void *threaded_second_pass(void *data) | |
parent = list_first_entry(&work_head, struct base_data, | ||
list); | ||
|
||
if (parent->ref_first <= parent->ref_last) { | ||
while (parent->ref_first <= parent->ref_last) { | ||
int offset = ref_deltas[parent->ref_first++].obj_no; | ||
child_obj = objects + offset; | ||
if (child_obj->real_type != OBJ_REF_DELTA) | ||
die("REF_DELTA at offset %"PRIuMAX" already resolved (duplicate base %s?)", | ||
(uintmax_t) child_obj->idx.offset, | ||
oid_to_hex(&parent->obj->idx.oid)); | ||
if (child_obj->real_type != OBJ_REF_DELTA) { | ||
child_obj = NULL; | ||
continue; | ||
} | ||
child_obj->real_type = parent->obj->real_type; | ||
} else { | ||
break; | ||
} | ||
|
||
if (!child_obj && parent->ofs_first <= parent->ofs_last) { | ||
child_obj = objects + | ||
ofs_deltas[parent->ofs_first++].obj_no; | ||
assert(child_obj->real_type == OBJ_OFS_DELTA); | ||
|
@@ -1178,29 +1181,32 @@ static void *threaded_second_pass(void *data) | |
} | ||
work_unlock(); | ||
|
||
if (parent) { | ||
child = resolve_delta(child_obj, parent); | ||
if (!child->children_remaining) | ||
FREE_AND_NULL(child->data); | ||
} else { | ||
child = make_base(child_obj, NULL); | ||
if (child->children_remaining) { | ||
/* | ||
* Since this child has its own delta children, | ||
* we will need this data in the future. | ||
* Inflate now so that future iterations will | ||
* have access to this object's data while | ||
* outside the work mutex. | ||
*/ | ||
child->data = get_data_from_pack(child_obj); | ||
child->size = child_obj->size; | ||
if (child_obj) { | ||
if (parent) { | ||
child = resolve_delta(child_obj, parent); | ||
if (!child->children_remaining) | ||
FREE_AND_NULL(child->data); | ||
} else{ | ||
child = make_base(child_obj, NULL); | ||
if (child->children_remaining) { | ||
/* | ||
* Since this child has its own delta children, | ||
* we will need this data in the future. | ||
* Inflate now so that future iterations will | ||
* have access to this object's data while | ||
* outside the work mutex. | ||
*/ | ||
child->data = get_data_from_pack(child_obj); | ||
child->size = child_obj->size; | ||
} | ||
} | ||
} | ||
|
||
work_lock(); | ||
if (parent) | ||
parent->retain_data--; | ||
if (child->data) { | ||
|
||
if (child && child->data) { | ||
/* | ||
* This child has its own children, so add it to | ||
* work_head. | ||
|
@@ -1209,7 +1215,7 @@ static void *threaded_second_pass(void *data) | |
base_cache_used += child->size; | ||
prune_base_data(NULL); | ||
free_base_data(child); | ||
} else { | ||
} else if (child) { | ||
/* | ||
* This child does not have its own children. It may be | ||
* the last descendant of its ancestors; free those | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,148 @@ | ||
#define USE_THE_REPOSITORY_VARIABLE | ||
|
||
#include "test-tool.h" | ||
#include "git-compat-util.h" | ||
#include "delta.h" | ||
#include "git-zlib.h" | ||
#include "hash.h" | ||
#include "hex.h" | ||
#include "pack.h" | ||
#include "pack-objects.h" | ||
#include "parse-options.h" | ||
#include "setup.h" | ||
#include "strbuf.h" | ||
#include "string-list.h" | ||
|
||
static const char *usage_str[] = { | ||
"test-tool pack-deltas --num-objects <num-objects>", | ||
NULL | ||
}; | ||
|
||
static unsigned long do_compress(void **pptr, unsigned long size) | ||
{ | ||
git_zstream stream; | ||
void *in, *out; | ||
unsigned long maxsize; | ||
|
||
git_deflate_init(&stream, 1); | ||
maxsize = git_deflate_bound(&stream, size); | ||
|
||
in = *pptr; | ||
out = xmalloc(maxsize); | ||
*pptr = out; | ||
|
||
stream.next_in = in; | ||
stream.avail_in = size; | ||
stream.next_out = out; | ||
stream.avail_out = maxsize; | ||
while (git_deflate(&stream, Z_FINISH) == Z_OK) | ||
; /* nothing */ | ||
git_deflate_end(&stream); | ||
|
||
free(in); | ||
return stream.total_out; | ||
} | ||
|
||
static void write_ref_delta(struct hashfile *f, | ||
struct object_id *oid, | ||
struct object_id *base) | ||
{ | ||
unsigned char header[MAX_PACK_OBJECT_HEADER]; | ||
unsigned long size, base_size, delta_size, compressed_size, hdrlen; | ||
enum object_type type; | ||
void *base_buf, *delta_buf; | ||
void *buf = repo_read_object_file(the_repository, | ||
oid, &type, | ||
&size); | ||
|
||
if (!buf) | ||
die("unable to read %s", oid_to_hex(oid)); | ||
|
||
base_buf = repo_read_object_file(the_repository, | ||
base, &type, | ||
&base_size); | ||
|
||
if (!base_buf) | ||
die("unable to read %s", oid_to_hex(base)); | ||
|
||
delta_buf = diff_delta(base_buf, base_size, | ||
buf, size, &delta_size, 0); | ||
|
||
compressed_size = do_compress(&delta_buf, delta_size); | ||
|
||
hdrlen = encode_in_pack_object_header(header, sizeof(header), | ||
OBJ_REF_DELTA, delta_size); | ||
hashwrite(f, header, hdrlen); | ||
hashwrite(f, base->hash, the_repository->hash_algo->rawsz); | ||
hashwrite(f, delta_buf, compressed_size); | ||
|
||
free(buf); | ||
free(base_buf); | ||
free(delta_buf); | ||
} | ||
|
||
int cmd__pack_deltas(int argc, const char **argv) | ||
{ | ||
int num_objects = -1; | ||
struct hashfile *f; | ||
struct strbuf line = STRBUF_INIT; | ||
struct option options[] = { | ||
OPT_INTEGER('n', "num-objects", &num_objects, N_("the number of objects to write")), | ||
OPT_END() | ||
}; | ||
|
||
argc = parse_options(argc, argv, NULL, | ||
options, usage_str, 0); | ||
|
||
if (argc || num_objects < 0) | ||
usage_with_options(usage_str, options); | ||
|
||
setup_git_directory(); | ||
|
||
f = hashfd(the_repository->hash_algo, 1, "<stdout>"); | ||
write_pack_header(f, num_objects); | ||
|
||
/* Read each line from stdin into 'line' */ | ||
while (strbuf_getline_lf(&line, stdin) != EOF) { | ||
const char *type_str, *content_oid_str, *base_oid_str = NULL; | ||
struct object_id content_oid, base_oid; | ||
struct string_list items = STRING_LIST_INIT_NODUP; | ||
/* | ||
* Tokenize into two or three parts: | ||
* 1. REF_DELTA, OFS_DELTA, or FULL. | ||
* 2. The object ID for the content object. | ||
* 3. The object ID for the base object (optional). | ||
*/ | ||
if (string_list_split_in_place(&items, line.buf, " ", 3) < 0) | ||
die("invalid input format: %s", line.buf); | ||
|
||
if (items.nr < 2) | ||
die("invalid input format: %s", line.buf); | ||
|
||
type_str = items.items[0].string; | ||
content_oid_str = items.items[1].string; | ||
|
||
if (get_oid_hex(content_oid_str, &content_oid)) | ||
die("invalid object: %s", content_oid_str); | ||
if (items.nr >= 3) { | ||
base_oid_str = items.items[2].string; | ||
if (get_oid_hex(base_oid_str, &base_oid)) | ||
die("invalid object: %s", base_oid_str); | ||
} | ||
string_list_clear(&items, 0); | ||
|
||
if (!strcmp(type_str, "REF_DELTA")) | ||
write_ref_delta(f, &content_oid, &base_oid); | ||
else if (!strcmp(type_str, "OFS_DELTA")) | ||
die("OFS_DELTA not implemented"); | ||
else if (!strcmp(type_str, "FULL")) | ||
die("FULL not implemented"); | ||
else | ||
die("unknown pack type: %s", type_str); | ||
} | ||
|
||
finalize_hashfile(f, NULL, FSYNC_COMPONENT_PACK, | ||
CSUM_HASH_IN_STREAM | CSUM_FSYNC | CSUM_CLOSE); | ||
strbuf_release(&line); | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):