Skip to content

Fix two auto-atomicity Regex bugs #74726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 30, 2022

Conversation

stephentoub
Copy link
Member

We walk concatenations in order to combine adjacent loops, e.g. a+a+a+ becomes a{3,}. We also combine loops with individual items that are compatible, e.g. a+ab becomes a{2,}b. However, we're doing these operations on atomic loops as well, which is sometimes wrong. Since an atomic loop consumes as much as possible and never gives anything back, combining it with a subsequent loop will end up essentially ignoring any minimum specified in the latter loop. We thus can't combine atomic loops if the second loop has a minimum; this includes the case where the second "loop" is just an individual item.
Fixes #74686

We currently consider \w and \b non-overlapping, which allows a \w loop followed by a \b to be made atomic. The problem with this is that \b is zero-width, and it could be followed by something that does overlap with the \w. When matching at a location that is a word boundary, it is possible the first loop could give up something that matches the subsequent construct, and thus it can't be made atomic. (We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.)
Fixes #74722

We should backport these correctness fixes to both release/7.0 and release/6.0.

We walk concatenations in order to combine adjacent loops, e.g. `a+a+a+` becomes `a{3,}`.  We also combine loops with individual items that are compatible, e.g. `a+ab` becomes `a{2,}b`. However, we're doing these operations on atomic loops as well, which is sometimes wrong. Since an atomic loop consumes as much as possible and never gives anything back, combining it with a subsequent loop will end up essentially ignoring any minimum specified in the latter loop. We thus can't combine atomic loops if the second loop has a minimum; this includes the case where the second "loop" is just an individual item.
We currently consider \w and \b non-overlapping, which allows a \w loop followed by a \b to be made atomic.  The problem with this is that \b is zero-width, and it could be followed by something that does overlap with the \w. When matching at a location that is a word boundary, it is possible the first loop could give up something that matches the subsequent construct, and thus it can't be made atomic. (We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.)
@ghost
Copy link

ghost commented Aug 28, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

We walk concatenations in order to combine adjacent loops, e.g. a+a+a+ becomes a{3,}. We also combine loops with individual items that are compatible, e.g. a+ab becomes a{2,}b. However, we're doing these operations on atomic loops as well, which is sometimes wrong. Since an atomic loop consumes as much as possible and never gives anything back, combining it with a subsequent loop will end up essentially ignoring any minimum specified in the latter loop. We thus can't combine atomic loops if the second loop has a minimum; this includes the case where the second "loop" is just an individual item.
Fixes #74686

We currently consider \w and \b non-overlapping, which allows a \w loop followed by a \b to be made atomic. The problem with this is that \b is zero-width, and it could be followed by something that does overlap with the \w. When matching at a location that is a word boundary, it is possible the first loop could give up something that matches the subsequent construct, and thus it can't be made atomic. (We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.)
Fixes #74722

We should backport these correctness fixes to both release/7.0 and release/6.0.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

Copy link
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joperezr should review of course, as perhaps he can think of other variations

@stephentoub
Copy link
Member Author

(We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.)

For context, I tried this PR out on the ~19K regexes in our corpus. A grand total of 9 were impacted by the \w\b fix (meaning 9 of the patterns had a loop that was being made atomic that with this fix is no longer being made atomic). When I changed that to only consider a non-0 lower bound, the 9 only dropped to 8.

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks for fixing and thanks to @mathy-plutoflume and @Mankarse for the reports!

Just curious, did you happen to run the perf tests after the change? I believe I saw some cases that might be impacted so I was just wondering how much does it affect not doing auto-atomicity for those cases. Of course, correctness is more important than perf 😄

@stephentoub
Copy link
Member Author

did you happen to run the perf tests after the change? I believe I saw some cases that might be impacted

I didn't. Can you point to which you saw? I see one's that do \b(\w+)\b, but those shouldn't be negatively impacted. I just added two more test cases to validate that.

@joperezr
Copy link
Member

I saw this one but I guess it's not exactly the same: https://github.com/dotnet/performance/blob/e18fa8744878207f770c54295b03bbdd38b71168/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L165

I also recall one test I saw which you wrote, (but not sure if it was a perf test) where for a given input you would create Regexes for each word found, and then count how many times that given word showed up in the text. I can't find that one but that was one that I thought might be something that could be impacted by this.

@stephentoub
Copy link
Member Author

stephentoub commented Aug 29, 2022

I saw this one but I guess it's not exactly the same:

Yeah, that one has \w+n, so it wouldn't have been made atomic anyway, since n is part of \w.

where for a given input you would create Regexes for each word found

Yup, that's here:
https://github.com/dotnet/performance/blob/e18fa8744878207f770c54295b03bbdd38b71168/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L108
That's using @"\b(\w+)\b", in which that loop will still be made atomic (and I added a test for that).

@stephentoub stephentoub merged commit d5e3a5c into dotnet:main Aug 30, 2022
@stephentoub stephentoub deleted the fixregextransforms branch August 30, 2022 18:17
@stephentoub
Copy link
Member Author

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/2958561219

@craig-plutoflume
Copy link

@stephentoub - Mathy from my company found this during our upgrade from .NetCore 3.1 to .Net 6.0. Is there any chance this could also get ported to release/6.0? Without this fix we're blocked for migration and will have to figure something else out.

@stephentoub
Copy link
Member Author

@craig-plutoflume, can you share more details about the exact patterns you're using? Are they the exact ones from filed issue?

This fix built upon some additional work that was done in .NET 7, namely the ability to look beyond the next construct in a pattern to determine whether a loop can be made atomic. Previously we could only ask "can this be made atomic based on just the next construct, yes or no", and now we have three buckets "yes, no, and it depends on what comes next after that". This fix depends on that third category. \w loops followed by \b are not uncommon, but \w loops followed by a \b followed by something else that overlaps with \w is rare, so this fix enables us to keep the optimization and be correct. Backporting just this fix to 6.0 would mean giving up the optimization entirely, which makes me nervous, as it could significantly regress real use. And backporting pieces of the previous .NET 7 improvements to be able to look beyond the next character is riskier.

@stephentoub
Copy link
Member Author

Note that as a workaround, you can defeat the faulty auto-atomicity operation by adding an empty positive lookahead after the loop in question, e.g.

var stringToMatch = "sydney bogota berlin tokyo nairobi denver rio";
Console.WriteLine($"Input: {stringToMatch}");

GetMatches(@"(\b(?!bogo|nai)\w*(?=)\b)\w+", stringToMatch);
GetMatches(@"\w*(?=)\b\w+", stringToMatch);

void GetMatches(string expression, string stringToMatch)
{
    var reg = new System.Text.RegularExpressions.Regex(expression);

    Console.WriteLine($"\nOutput:");
    foreach (var match in reg.Matches(stringToMatch))
    {
        Console.WriteLine(match);
    }
}

@craig-plutoflume
Copy link

Thanks @stephentoub. For now we've worked around it by compiling the library that does our regex work in netstandard2 while we investigate the performance of your workaround above.

@stephentoub
Copy link
Member Author

For now we've worked around it by compiling the library that does our regex work in netstandard2

But still running on .NET 6? The surface area you target at build time won't affect this run-time behavior.

@craig-plutoflume
Copy link

@stephentoub - Thank you for that. It uncovered an issue in our test harness with the .net6 migration as our test harness showed this approach works, but after you comment we dug a little deeper and saw the error in the harness itself. You saved us a production incident!

@ghost ghost locked as resolved and limited conversation to collaborators Oct 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect behaviour of regex atomic groups Regex auto-atomicity too aggressive about \w and \b
4 participants