-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Fix two auto-atomicity Regex bugs #74726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We walk concatenations in order to combine adjacent loops, e.g. `a+a+a+` becomes `a{3,}`. We also combine loops with individual items that are compatible, e.g. `a+ab` becomes `a{2,}b`. However, we're doing these operations on atomic loops as well, which is sometimes wrong. Since an atomic loop consumes as much as possible and never gives anything back, combining it with a subsequent loop will end up essentially ignoring any minimum specified in the latter loop. We thus can't combine atomic loops if the second loop has a minimum; this includes the case where the second "loop" is just an individual item.
We currently consider \w and \b non-overlapping, which allows a \w loop followed by a \b to be made atomic. The problem with this is that \b is zero-width, and it could be followed by something that does overlap with the \w. When matching at a location that is a word boundary, it is possible the first loop could give up something that matches the subsequent construct, and thus it can't be made atomic. (We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.)
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsWe walk concatenations in order to combine adjacent loops, e.g. We currently consider \w and \b non-overlapping, which allows a \w loop followed by a \b to be made atomic. The problem with this is that \b is zero-width, and it could be followed by something that does overlap with the \w. When matching at a location that is a word boundary, it is possible the first loop could give up something that matches the subsequent construct, and thus it can't be made atomic. (We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.) We should backport these correctness fixes to both release/7.0 and release/6.0.
|
src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/Regex.Match.Tests.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/tests/UnitTests/RegexReductionTests.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joperezr should review of course, as perhaps he can think of other variations
For context, I tried this PR out on the ~19K regexes in our corpus. A grand total of 9 were impacted by the \w\b fix (meaning 9 of the patterns had a loop that was being made atomic that with this fix is no longer being made atomic). When I changed that to only consider a non-0 lower bound, the 9 only dropped to 8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me, thanks for fixing and thanks to @mathy-plutoflume and @Mankarse for the reports!
Just curious, did you happen to run the perf tests after the change? I believe I saw some cases that might be impacted so I was just wondering how much does it affect not doing auto-atomicity for those cases. Of course, correctness is more important than perf 😄
I didn't. Can you point to which you saw? I see one's that do |
I saw this one but I guess it's not exactly the same: https://github.com/dotnet/performance/blob/e18fa8744878207f770c54295b03bbdd38b71168/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L165 I also recall one test I saw which you wrote, (but not sure if it was a perf test) where for a given input you would create Regexes for each word found, and then count how many times that given word showed up in the text. I can't find that one but that was one that I thought might be something that could be impacted by this. |
Yeah, that one has
Yup, that's here: |
/backport to release/7.0 |
Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/2958561219 |
@stephentoub - Mathy from my company found this during our upgrade from .NetCore 3.1 to .Net 6.0. Is there any chance this could also get ported to |
@craig-plutoflume, can you share more details about the exact patterns you're using? Are they the exact ones from filed issue? This fix built upon some additional work that was done in .NET 7, namely the ability to look beyond the next construct in a pattern to determine whether a loop can be made atomic. Previously we could only ask "can this be made atomic based on just the next construct, yes or no", and now we have three buckets "yes, no, and it depends on what comes next after that". This fix depends on that third category. \w loops followed by \b are not uncommon, but \w loops followed by a \b followed by something else that overlaps with \w is rare, so this fix enables us to keep the optimization and be correct. Backporting just this fix to 6.0 would mean giving up the optimization entirely, which makes me nervous, as it could significantly regress real use. And backporting pieces of the previous .NET 7 improvements to be able to look beyond the next character is riskier. |
Note that as a workaround, you can defeat the faulty auto-atomicity operation by adding an empty positive lookahead after the loop in question, e.g. var stringToMatch = "sydney bogota berlin tokyo nairobi denver rio";
Console.WriteLine($"Input: {stringToMatch}");
GetMatches(@"(\b(?!bogo|nai)\w*(?=)\b)\w+", stringToMatch);
GetMatches(@"\w*(?=)\b\w+", stringToMatch);
void GetMatches(string expression, string stringToMatch)
{
var reg = new System.Text.RegularExpressions.Regex(expression);
Console.WriteLine($"\nOutput:");
foreach (var match in reg.Matches(stringToMatch))
{
Console.WriteLine(match);
}
} |
Thanks @stephentoub. For now we've worked around it by compiling the library that does our regex work in |
But still running on .NET 6? The surface area you target at build time won't affect this run-time behavior. |
@stephentoub - Thank you for that. It uncovered an issue in our test harness with the .net6 migration as our test harness showed this approach works, but after you comment we dug a little deeper and saw the error in the harness itself. You saved us a production incident! |
We walk concatenations in order to combine adjacent loops, e.g.
a+a+a+
becomesa{3,}
. We also combine loops with individual items that are compatible, e.g.a+ab
becomesa{2,}b
. However, we're doing these operations on atomic loops as well, which is sometimes wrong. Since an atomic loop consumes as much as possible and never gives anything back, combining it with a subsequent loop will end up essentially ignoring any minimum specified in the latter loop. We thus can't combine atomic loops if the second loop has a minimum; this includes the case where the second "loop" is just an individual item.Fixes #74686
We currently consider \w and \b non-overlapping, which allows a \w loop followed by a \b to be made atomic. The problem with this is that \b is zero-width, and it could be followed by something that does overlap with the \w. When matching at a location that is a word boundary, it is possible the first loop could give up something that matches the subsequent construct, and thus it can't be made atomic. (We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.)
Fixes #74722
We should backport these correctness fixes to both release/7.0 and release/6.0.