-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese #36121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
949dde2 to
75f90bb
Compare
|
cc @ArthurZucker @Cyrilvallez for core maintainer review (but I realize that reading regexes is very hard, lol - let me know if you want me to add equivalence tests or something instead) |
|
Review update: I pinged @McPatate, but this is quite a hard PR to review. I have done some local testing with scripts to generate demo examples and the results seem equivalent for both the old and new tokenizer, and I ran slow tests locally and they all still pass. I can add something like an "equivalence test" for the old vs. new regex to the tests, but this is a weird thing to add to the CI because it shouldn't regress - so we really only need to test it once, when adding the PR! |
|
cc @NielsRogge in case he has any ideas on how to test the nougat changes 😅 |
|
With Claude's help I wrote this script to compare the old and new Japanese price regexes: import re
def generate_test_cases():
# Build parts for the candidate currency strings
numbers = ["0", "1", "12", "123", "1,234", "12,345", "123,456"]
jp_units = ["", "億", "万", "千"]
# Build components by concatenating numbers with possible units
components = [num + unit for num in numbers for unit in jp_units if num or unit]
# Generate prefixes (empty, single component, or two components)
prefixes = set()
prefixes.add("")
for comp in components:
prefixes.add(comp)
for comp1 in components:
for comp2 in components:
prefixes.add(comp1 + comp2)
currency_units = [
"千円", "万円", "千万円", "円",
"千ドル", "万ドル", "千万ドル", "ドル",
"千ユーロ", "万ユーロ", "千万ユーロ", "ユーロ"
]
tax_suffixes = [
"", "(税込)", "(税抜)", "+tax",
"(税込)(税抜)", "(税込)+tax", "(税抜)+tax", "+tax(税込)"
]
test_cases = []
# Build valid test cases
for prefix in sorted(prefixes):
for cur in currency_units:
for tax in tax_suffixes:
test_cases.append(prefix + cur + tax)
# Add some invalid test cases
neg_tests = [
"abc",
"123abc",
"1,23,456円", # wrong comma grouping
"12,34円",
"1,234,56円", # invalid comma grouping
"123,45,678円", # invalid grouping
"1,2345円", # missing proper comma placement
]
test_cases.extend(neg_tests)
return test_cases
def main():
regex1_str = r"((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*億)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*万)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*千)*(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*(千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+(\(税込\)|\(税抜\)|\+tax)*"
regex2_str = (
r"(?:\d,\d{3}|[\d億万千])*"
r"(?:千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+"
r"(?:\(税込\)|\(税抜\)|\+tax)*"
)
re1 = re.compile(regex1_str)
re2 = re.compile(regex2_str)
test_cases = generate_test_cases()
total = 0
mismatches = 0
print(f"Testing {len(test_cases)} cases...\n")
for test_str in test_cases:
total += 1
m1 = re1.fullmatch(test_str)
m2 = re2.fullmatch(test_str)
if (m1 is None) != (m2 is None):
print(f"Mismatch in matching result for test case: {test_str!r}")
print(" regex1 matched:" , m1 is not None)
print(" regex2 matched:" , m2 is not None)
mismatches += 1
elif m1 is not None and m2 is not None:
if m1.span() != m2.span():
print(f"Mismatch in match span for test case: {test_str!r}")
print(" regex1 span:", m1.span())
print(" regex2 span:", m2.span())
mismatches += 1
else:
print(f"OK: {test_str!r}")
else:
print(f"OK (no match): {test_str!r}")
print(f"\nTotal test cases: {total}, mismatches: {mismatches}")
if mismatches == 0:
print("All tests passed: both regexes match the same spans!")
else:
print("Some mismatches were found; please review the differences.")
if __name__ == '__main__':
main()The conclusion is that both regexes match on most real price strings, but the new regex also matches unusual/incorrect orderings like However, I think this is still acceptable - for almost all "real" strings, both regexes match, and I don't think it's wrong to match these "unusual" prices if they actually appear in the text. |
|
I got it to write a similar comparison script for the old/new versions of the Nougat regex. In that case, there were no differences at all in the function output, so I think we're good 👍 |
|
cc @ArthurZucker @Cyrilvallez whenever you get a chance - I've done the testing and I'm happy that the outputs are good enough at this point! |
The original regex pattern in tokenization_nougat_fast.py was vulnerable to catastrophic backtracking due to greedy quantifiers and nested alternations. This commit replaces it with a more efficient pattern that: 1. Uses explicit character classes instead of dot (.) 2. Handles whitespace more precisely 3. Avoids unnecessary backtracking 4. Supports both lowercase and uppercase roman numerals 5. Maintains the same functionality while being more robust
404b72d to
8c2d71a
Compare
|
Oh woow this is indeed extremely hard to read - thanks for working on it, must not have been an easy one! 😅 You mentioned for 3.11 and later we could get a fully matching regex. IMO it would be best to dynamically choose between one regex or the other based on the current running version, as most people are using 3.11 and later (thus for those users there are no regresions at all). Plus it is future proof that way |
McPatate
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imo code bit looks good, easier to read than the regexes!
|
cc @Cyrilvallez done! There are now two versions of the GPTSan/GPTJNeoxJapanese regex:
|
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, LGTM! Thanks a lot for taking the time to tackle such an unwelcoming one!
I tried OpenHands on these but the result was generally a non-equivalent regex, even after a lot of loops of testing/iteration, so I rewrote these myself!
The Nougat regex is designed to detect lines that look like flattened bullet point lists. To avoid blowup, I rewrote the whole function containing it to stop trying to do everything inside one regex, as it was unnecessary - the regex always matched an entire line from
^to$regardless, so we could replace it by splitting the input into lines and then testing each line with a couple of simple tests, then extracting relevant patterns if needed.The GPTJNeoXJapanese/GPTSan regex is designed to detect prices in Japanese text and replace them with a
<PRICE>token. It uses complex matching to detect numbers that might contain commas (like100,000) followed by Japanese count markers or characters that indicate yen pricing or added tax etc. However, because a lot of the components of the regex were optional (because of?or*) the reality is this block matched just about any combination of digits and commas followed by price-related characters. The blowup was avoided by just replacing the dangerous block with a much simpler regex that made that explicit!