Skip to content

Conversation

@Rocketknight1
Copy link
Member

@Rocketknight1 Rocketknight1 commented Feb 10, 2025

I tried OpenHands on these but the result was generally a non-equivalent regex, even after a lot of loops of testing/iteration, so I rewrote these myself!

The Nougat regex is designed to detect lines that look like flattened bullet point lists. To avoid blowup, I rewrote the whole function containing it to stop trying to do everything inside one regex, as it was unnecessary - the regex always matched an entire line from ^ to $ regardless, so we could replace it by splitting the input into lines and then testing each line with a couple of simple tests, then extracting relevant patterns if needed.

The GPTJNeoXJapanese/GPTSan regex is designed to detect prices in Japanese text and replace them with a <PRICE> token. It uses complex matching to detect numbers that might contain commas (like 100,000) followed by Japanese count markers or characters that indicate yen pricing or added tax etc. However, because a lot of the components of the regex were optional (because of ? or *) the reality is this block matched just about any combination of digits and commas followed by price-related characters. The blowup was avoided by just replacing the dangerous block with a much simpler regex that made that explicit!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1 Rocketknight1 marked this pull request as ready for review February 11, 2025 18:26
@Rocketknight1 Rocketknight1 changed the title Fix potential regex catastrophic backtracking in NougatTokenizerFast Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese Feb 11, 2025
@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Feb 12, 2025

cc @ArthurZucker @Cyrilvallez for core maintainer review (but I realize that reading regexes is very hard, lol - let me know if you want me to add equivalence tests or something instead)

@Rocketknight1
Copy link
Member Author

Review update: I pinged @McPatate, but this is quite a hard PR to review. I have done some local testing with scripts to generate demo examples and the results seem equivalent for both the old and new tokenizer, and I ran slow tests locally and they all still pass. I can add something like an "equivalence test" for the old vs. new regex to the tests, but this is a weird thing to add to the CI because it shouldn't regress - so we really only need to test it once, when adding the PR!

@Rocketknight1
Copy link
Member Author

cc @NielsRogge in case he has any ideas on how to test the nougat changes 😅

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Feb 13, 2025

With Claude's help I wrote this script to compare the old and new Japanese price regexes:

import re

def generate_test_cases():
    # Build parts for the candidate currency strings
    numbers = ["0", "1", "12", "123", "1,234", "12,345", "123,456"]
    jp_units = ["", "億", "万", "千"]
    
    # Build components by concatenating numbers with possible units
    components = [num + unit for num in numbers for unit in jp_units if num or unit]
    
    # Generate prefixes (empty, single component, or two components)
    prefixes = set()
    prefixes.add("")
    for comp in components:
        prefixes.add(comp)
    for comp1 in components:
        for comp2 in components:
            prefixes.add(comp1 + comp2)
    
    currency_units = [
        "千円", "万円", "千万円", "円",
        "千ドル", "万ドル", "千万ドル", "ドル",
        "千ユーロ", "万ユーロ", "千万ユーロ", "ユーロ"
    ]
    
    tax_suffixes = [
        "", "(税込)", "(税抜)", "+tax",
        "(税込)(税抜)", "(税込)+tax", "(税抜)+tax", "+tax(税込)"
    ]
    
    test_cases = []
    
    # Build valid test cases
    for prefix in sorted(prefixes):
        for cur in currency_units:
            for tax in tax_suffixes:
                test_cases.append(prefix + cur + tax)
    
    # Add some invalid test cases
    neg_tests = [
        "abc",
        "123abc",
        "1,23,456円",   # wrong comma grouping
        "12,34円",
        "1,234,56円",   # invalid comma grouping
        "123,45,678円", # invalid grouping
        "1,2345円",     # missing proper comma placement
    ]
    test_cases.extend(neg_tests)
    
    return test_cases

def main():
    regex1_str = r"((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*億)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*万)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*千)*(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*(千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+(\(税込\)|\(税抜\)|\+tax)*"
    
    regex2_str = (
            r"(?:\d,\d{3}|[\d億万千])*"
            r"(?:千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+"
            r"(?:\(税込\)|\(税抜\)|\+tax)*"
    )
    
    re1 = re.compile(regex1_str)
    re2 = re.compile(regex2_str)

    test_cases = generate_test_cases()
    
    total = 0
    mismatches = 0
    print(f"Testing {len(test_cases)} cases...\n")
    
    for test_str in test_cases:
        total += 1
        m1 = re1.fullmatch(test_str)
        m2 = re2.fullmatch(test_str)
        
        if (m1 is None) != (m2 is None):
            print(f"Mismatch in matching result for test case: {test_str!r}")
            print("   regex1 matched:" , m1 is not None)
            print("   regex2 matched:" , m2 is not None)
            mismatches += 1
        elif m1 is not None and m2 is not None:
            if m1.span() != m2.span():
                print(f"Mismatch in match span for test case: {test_str!r}")
                print("   regex1 span:", m1.span())
                print("   regex2 span:", m2.span())
                mismatches += 1
            else:
                print(f"OK: {test_str!r}")
        else:
            print(f"OK (no match): {test_str!r}")
    
    print(f"\nTotal test cases: {total}, mismatches: {mismatches}")
    if mismatches == 0:
        print("All tests passed: both regexes match the same spans!")
    else:
        print("Some mismatches were found; please review the differences.")

if __name__ == '__main__':
    main()

The conclusion is that both regexes match on most real price strings, but the new regex also matches unusual/incorrect orderings like 1万12億千円 (this is equivalent to saying something like "sixty and one hundred" in English instead of "one hundred and sixty" - it's comprehensible but the ordering is definitely weird). I couldn't find a way to fix it without using atomic/possessive regex features that were only added in Python 3.11.

However, I think this is still acceptable - for almost all "real" strings, both regexes match, and I don't think it's wrong to match these "unusual" prices if they actually appear in the text.

@Rocketknight1
Copy link
Member Author

I got it to write a similar comparison script for the old/new versions of the Nougat regex. In that case, there were no differences at all in the function output, so I think we're good 👍

@Rocketknight1
Copy link
Member Author

cc @ArthurZucker @Cyrilvallez whenever you get a chance - I've done the testing and I'm happy that the outputs are good enough at this point!

openhands-agent and others added 9 commits February 13, 2025 18:19
The original regex pattern in tokenization_nougat_fast.py was vulnerable to
catastrophic backtracking due to greedy quantifiers and nested alternations.
This commit replaces it with a more efficient pattern that:

1. Uses explicit character classes instead of dot (.)
2. Handles whitespace more precisely
3. Avoids unnecessary backtracking
4. Supports both lowercase and uppercase roman numerals
5. Maintains the same functionality while being more robust
@Cyrilvallez
Copy link
Member

Oh woow this is indeed extremely hard to read - thanks for working on it, must not have been an easy one! 😅 You mentioned for 3.11 and later we could get a fully matching regex. IMO it would be best to dynamically choose between one regex or the other based on the current running version, as most people are using 3.11 and later (thus for those users there are no regresions at all). Plus it is future proof that way

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo code bit looks good, easier to read than the regexes!

@Rocketknight1
Copy link
Member Author

Rocketknight1 commented Feb 20, 2025

cc @Cyrilvallez done! There are now two versions of the GPTSan/GPTJNeoxJapanese regex:

  1. A Python >= 3.11 version that is exactly equivalent to the original, but not exploitable with malicious input. It's also much faster.
  2. A Python < 3.11 version that has very slightly different behaviour that accepts "weird" price strings that usually do not occur in real text, but is otherwise identical. This version is also not exploitable.

Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, LGTM! Thanks a lot for taking the time to tackle such an unwelcoming one!

@Rocketknight1 Rocketknight1 merged commit 92c5ca9 into main Feb 21, 2025
19 checks passed
@Rocketknight1 Rocketknight1 deleted the fix-nougat-regex branch February 21, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants