Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese #36121

Rocketknight1 · 2025-02-10T17:38:19Z

I tried OpenHands on these but the result was generally a non-equivalent regex, even after a lot of loops of testing/iteration, so I rewrote these myself!

The Nougat regex is designed to detect lines that look like flattened bullet point lists. To avoid blowup, I rewrote the whole function containing it to stop trying to do everything inside one regex, as it was unnecessary - the regex always matched an entire line from ^ to $ regardless, so we could replace it by splitting the input into lines and then testing each line with a couple of simple tests, then extracting relevant patterns if needed.

The GPTJNeoXJapanese/GPTSan regex is designed to detect prices in Japanese text and replace them with a <PRICE> token. It uses complex matching to detect numbers that might contain commas (like 100,000) followed by Japanese count markers or characters that indicate yen pricing or added tax etc. However, because a lot of the components of the regex were optional (because of ? or *) the reality is this block matched just about any combination of digits and commas followed by price-related characters. The blowup was avoided by just replacing the dangerous block with a much simpler regex that made that explicit!

HuggingFaceDocBuilderDev · 2025-02-10T18:04:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-02-12T12:46:54Z

cc @ArthurZucker @Cyrilvallez for core maintainer review (but I realize that reading regexes is very hard, lol - let me know if you want me to add equivalence tests or something instead)

Rocketknight1 · 2025-02-12T17:24:23Z

Review update: I pinged @McPatate, but this is quite a hard PR to review. I have done some local testing with scripts to generate demo examples and the results seem equivalent for both the old and new tokenizer, and I ran slow tests locally and they all still pass. I can add something like an "equivalence test" for the old vs. new regex to the tests, but this is a weird thing to add to the CI because it shouldn't regress - so we really only need to test it once, when adding the PR!

Rocketknight1 · 2025-02-12T17:25:02Z

cc @NielsRogge in case he has any ideas on how to test the nougat changes 😅

Rocketknight1 · 2025-02-13T16:37:23Z

With Claude's help I wrote this script to compare the old and new Japanese price regexes:

import re

def generate_test_cases():
    # Build parts for the candidate currency strings
    numbers = ["0", "1", "12", "123", "1,234", "12,345", "123,456"]
    jp_units = ["", "億", "万", "千"]
    
    # Build components by concatenating numbers with possible units
    components = [num + unit for num in numbers for unit in jp_units if num or unit]
    
    # Generate prefixes (empty, single component, or two components)
    prefixes = set()
    prefixes.add("")
    for comp in components:
        prefixes.add(comp)
    for comp1 in components:
        for comp2 in components:
            prefixes.add(comp1 + comp2)
    
    currency_units = [
        "千円", "万円", "千万円", "円",
        "千ドル", "万ドル", "千万ドル", "ドル",
        "千ユーロ", "万ユーロ", "千万ユーロ", "ユーロ"
    ]
    
    tax_suffixes = [
        "", "(税込)", "(税抜)", "+tax",
        "(税込)(税抜)", "(税込)+tax", "(税抜)+tax", "+tax(税込)"
    ]
    
    test_cases = []
    
    # Build valid test cases
    for prefix in sorted(prefixes):
        for cur in currency_units:
            for tax in tax_suffixes:
                test_cases.append(prefix + cur + tax)
    
    # Add some invalid test cases
    neg_tests = [
        "abc",
        "123abc",
        "1,23,456円",   # wrong comma grouping
        "12,34円",
        "1,234,56円",   # invalid comma grouping
        "123,45,678円", # invalid grouping
        "1,2345円",     # missing proper comma placement
    ]
    test_cases.extend(neg_tests)
    
    return test_cases

def main():
    regex1_str = r"((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*億)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*万)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*千)*(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*(千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+(\(税込\)|\(税抜\)|\+tax)*"
    
    regex2_str = (
            r"(?:\d,\d{3}|[\d億万千])*"
            r"(?:千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+"
            r"(?:\(税込\)|\(税抜\)|\+tax)*"
    )
    
    re1 = re.compile(regex1_str)
    re2 = re.compile(regex2_str)

    test_cases = generate_test_cases()
    
    total = 0
    mismatches = 0
    print(f"Testing {len(test_cases)} cases...\n")
    
    for test_str in test_cases:
        total += 1
        m1 = re1.fullmatch(test_str)
        m2 = re2.fullmatch(test_str)
        
        if (m1 is None) != (m2 is None):
            print(f"Mismatch in matching result for test case: {test_str!r}")
            print("   regex1 matched:" , m1 is not None)
            print("   regex2 matched:" , m2 is not None)
            mismatches += 1
        elif m1 is not None and m2 is not None:
            if m1.span() != m2.span():
                print(f"Mismatch in match span for test case: {test_str!r}")
                print("   regex1 span:", m1.span())
                print("   regex2 span:", m2.span())
                mismatches += 1
            else:
                print(f"OK: {test_str!r}")
        else:
            print(f"OK (no match): {test_str!r}")
    
    print(f"\nTotal test cases: {total}, mismatches: {mismatches}")
    if mismatches == 0:
        print("All tests passed: both regexes match the same spans!")
    else:
        print("Some mismatches were found; please review the differences.")

if __name__ == '__main__':
    main()

The conclusion is that both regexes match on most real price strings, but the new regex also matches unusual/incorrect orderings like 1万12億千円 (this is equivalent to saying something like "sixty and one hundred" in English instead of "one hundred and sixty" - it's comprehensible but the ordering is definitely weird). I couldn't find a way to fix it without using atomic/possessive regex features that were only added in Python 3.11.

However, I think this is still acceptable - for almost all "real" strings, both regexes match, and I don't think it's wrong to match these "unusual" prices if they actually appear in the text.

Rocketknight1 · 2025-02-13T17:04:14Z

I got it to write a similar comparison script for the old/new versions of the Nougat regex. In that case, there were no differences at all in the function output, so I think we're good 👍

Rocketknight1 · 2025-02-13T17:04:41Z

cc @ArthurZucker @Cyrilvallez whenever you get a chance - I've done the testing and I'm happy that the outputs are good enough at this point!

The original regex pattern in tokenization_nougat_fast.py was vulnerable to catastrophic backtracking due to greedy quantifiers and nested alternations. This commit replaces it with a more efficient pattern that: 1. Uses explicit character classes instead of dot (.) 2. Handles whitespace more precisely 3. Avoids unnecessary backtracking 4. Supports both lowercase and uppercase roman numerals 5. Maintains the same functionality while being more robust

Cyrilvallez · 2025-02-17T14:16:59Z

Oh woow this is indeed extremely hard to read - thanks for working on it, must not have been an easy one! 😅 You mentioned for 3.11 and later we could get a fully matching regex. IMO it would be best to dynamically choose between one regex or the other based on the current running version, as most people are using 3.11 and later (thus for those users there are no regresions at all). Plus it is future proof that way

McPatate

Imo code bit looks good, easier to read than the regexes!

Rocketknight1 · 2025-02-20T18:57:50Z

cc @Cyrilvallez done! There are now two versions of the GPTSan/GPTJNeoxJapanese regex:

A Python >= 3.11 version that is exactly equivalent to the original, but not exploitable with malicious input. It's also much faster.
A Python < 3.11 version that has very slightly different behaviour that accepts "weird" price strings that usually do not occur in real text, but is otherwise identical. This version is also not exploitable.

Cyrilvallez

All right, LGTM! Thanks a lot for taking the time to tackle such an unwelcoming one!

Rocketknight1 force-pushed the fix-nougat-regex branch from 949dde2 to 75f90bb Compare February 11, 2025 18:26

Rocketknight1 marked this pull request as ready for review February 11, 2025 18:26

Rocketknight1 changed the title ~~Fix potential regex catastrophic backtracking in NougatTokenizerFast~~ Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese Feb 11, 2025

openhands-agent and others added 9 commits February 13, 2025 18:19

Try another regex

662dc14

Trying deepseek's answer

2d26be3

Start with a simplification

a1a9c96

Another simplification

a6a8c7e

Just rewrite the whole function myself

f018df0

Fix gptneox and gptsan

2e4e1e7

Simplify the regex even further

2143d1c

Tighten up the price regex a little

8c2d71a

Rocketknight1 force-pushed the fix-nougat-regex branch from 404b72d to 8c2d71a Compare February 13, 2025 18:19

McPatate reviewed Feb 18, 2025

View reviewed changes

Rocketknight1 added 3 commits February 20, 2025 18:02

Add possessive version of the regex

80c265c

Fix regex

7e836fd

Much cleaner regexes

86a8696

Cyrilvallez approved these changes Feb 21, 2025

View reviewed changes

Rocketknight1 merged commit 92c5ca9 into main Feb 21, 2025
19 checks passed

Rocketknight1 deleted the fix-nougat-regex branch February 21, 2025 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese #36121

Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese #36121

Uh oh!

Rocketknight1 commented Feb 10, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 10, 2025

Uh oh!

Rocketknight1 commented Feb 12, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Feb 12, 2025

Uh oh!

Rocketknight1 commented Feb 12, 2025

Uh oh!

Rocketknight1 commented Feb 13, 2025 •

edited

Loading

Uh oh!

Rocketknight1 commented Feb 13, 2025

Uh oh!

Rocketknight1 commented Feb 13, 2025

Uh oh!

Cyrilvallez commented Feb 17, 2025

Uh oh!

McPatate left a comment

Uh oh!

Rocketknight1 commented Feb 20, 2025 •

edited

Loading

Uh oh!

Cyrilvallez left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese #36121

Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese #36121

Uh oh!

Conversation

Rocketknight1 commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 10, 2025

Uh oh!

Rocketknight1 commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Feb 12, 2025

Uh oh!

Rocketknight1 commented Feb 12, 2025

Uh oh!

Rocketknight1 commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Feb 13, 2025

Uh oh!

Rocketknight1 commented Feb 13, 2025

Uh oh!

Cyrilvallez commented Feb 17, 2025

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Rocketknight1 commented Feb 10, 2025 •

edited

Loading

Rocketknight1 commented Feb 12, 2025 •

edited

Loading

Rocketknight1 commented Feb 13, 2025 •

edited

Loading

Rocketknight1 commented Feb 20, 2025 •

edited

Loading