Skip to content

Don't write code during size computation #547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 7, 2024

Conversation

zherczeg
Copy link
Collaborator

@zherczeg zherczeg commented Nov 2, 2024

This patch is in the top of #546

When I worked on the other patch, I was thinking that writing unknown amount of data when the length is computed can be an issue. It looks like fuzzers already caught this issue, so here is a patch. Can be combined with #546, but can be independent form it.

Adding a test is difficult because COMPILE_WORK_SIZE is (3000*LINK_SIZE) and a really huge test is needed. What shall we do? Add a really huge test? E.g. /[\x{200}&&\x{200}&&....\x{200}]/alt_extended_class

Copy link
Contributor

@carenas carenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be better to rebase the second patch on top of master to expedite the fix though, since the first patch is unlikely to be merged AS-IS.

BTW, I didn't notice this fatal error when reviewing first, but we might need to clarify the expectations of that "review" or risk bigger issues later.

@zherczeg zherczeg marked this pull request as draft November 3, 2024 03:24
@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 3, 2024

I think the first patch is closer to landing, since all requests were done. This one, however, is not necessary in its best form. Maybe we could do it in a nicer way, and testing is also a question. The new asserts do some testing, but is it enough?

@NWilson
Copy link
Member

NWilson commented Nov 4, 2024

Ouch. During the lengthptr != NUL phase, is in necessary to completely avoid writing any *code++ = ... assignments?

It appeared that other places in the compiler were unconditionally writing data out, and I'd uneasily just accepted that. Does it allocate an up-front amount of space equal to the size of the parsed input, so that if you have 10 META codes, there'll be space for 10 OP codes, and you only need to make the OP write conditional for cases where the compiler inflates the output?

PCRE2 triggers tons of lint-rule violations in Microsoft's in-house linters, due to widespread use of writes to pointers that don't have attached length information (for example, not checking each and every write to code against a code_end terminator). All writable buffers do have a maximum legal write, so the linter is not happy to see any writes at all to code pointers that are passed in without a code_size or code_end limit. Many of these are, of course, false positives (grr linters), because of the system of checking the required length with lengthptr, then allocating a buffer of known sufficient size.

Adding a test is difficult because COMPILE_WORK_SIZE is (3000*LINK_SIZE) and a really huge test is needed. What shall we do? Add a really huge test? E.g. /[\x{200}&&\x{200}&&....\x{200}]/alt_extended_class

Sounds good to just write a big test. 3000 characters isn't going to break the test suite, I hope?

@PhilipHazel
Copy link
Collaborator

Ouch. During the lengthptr != NUL phase, is in necessary to completely avoid writing any *code++ = ... assignments?

Luckily, no. There is a block of storage available, but the the idea is that after each "item" is written and its length ascertained, the pointer is reset to the start of the block. I forget the details offhand, but it is supposed to be clever enough to stay within the allotted memory. (It's over a decade since I implemented this, probably nearer two, so the details are no longer in mind. I'd have to read the code to refresh....)

@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 5, 2024

The idea is that the majority of opcodes have a fixed size. PCRE2 uses a stack allocated buffer, where you can write them during length computation. Then code-buffer_start is added to length, and code=buffer_start resets the buffer. So far only xclass was the only exception. It writes its beginning and end to the buffer, but there can be any number of internal tokens. These are not written to the buffer during length computation. And now eclass. I hope it explains how the system works, I think it is quite smart. The 3000*LINK_SIZE stack buffer is too much for me though. We could also add more asserts as well.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L10210

@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 5, 2024

Patch is updated. I still don't like it. It has no test, and maybe LINK_SIZE checks needed as well. The PCRE2_ASSERT(lengthptr == NULL || code == *pcode); could be incorrect, maybe the original value of pcode should be saved in a local in debug mode.

@NWilson
Copy link
Member

NWilson commented Nov 5, 2024

Thank you @zherczeg and @PhilipHazel for explaining. That makes good sense, and I see the need to fix it for ECLASS, which does indeed write an unbounded amount of data to its opcode.

@zherczeg zherczeg marked this pull request as ready for review November 5, 2024 12:23
@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 5, 2024

I am thinking about a possible test. It would be a huge test with a repeated character sequence. Since the asserts check the reset mechanic, it would not add more to the existing tests. The compile_class_nested is recursive, so I cannot really simplify the checks there as well.

I have no better idea to fix this, but I would be happy if future code changes would make this code nicer as a side effect.

@NWilson
Copy link
Member

NWilson commented Nov 5, 2024

Agreed. The fix looks good, but I'll try and reduce the complexity when I optimize away the recursive descent.

We should have tests of the form: [a--a--a--a--a ...] and [^[^a]--[^a]--...], and also of the form [^[^a--a][^a--a]...].

Copy link
Member

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve, but ideally should add a test before merging

first = FALSE;
}

DONE:
PCRE2_ASSERT(!first); /* Confirm that we found something. */
PCRE2_ASSERT(lengthptr == NULL ||
(code == code_start && *pcode == code_start));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added an extra condition to the assertion here - *pcode == code_start, which the other assertion doesn't have. I don't think it's needed, since there's literally no code in the function which touches pcode.

@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 6, 2024

Test added. Without hte patch, it fails with:
Failed: error 152 at offset 5486: internal error: overran compiling workspace

@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 7, 2024

Is there any changes needed for this patch?

@carenas
Copy link
Contributor

carenas commented Nov 7, 2024

Is there any changes needed for this patch?

I think is ready and doubly "approved"

@zherczeg zherczeg merged commit 037fb17 into PCRE2Project:master Nov 7, 2024
15 checks passed
@zherczeg zherczeg deleted the stack_fix branch November 7, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants