You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[release/7.0] Fix two auto-atomicity Regex bugs (#74834)
* Stop coalescing some adjacent Regex atomic loops
We walk concatenations in order to combine adjacent loops, e.g. `a+a+a+` becomes `a{3,}`. We also combine loops with individual items that are compatible, e.g. `a+ab` becomes `a{2,}b`. However, we're doing these operations on atomic loops as well, which is sometimes wrong. Since an atomic loop consumes as much as possible and never gives anything back, combining it with a subsequent loop will end up essentially ignoring any minimum specified in the latter loop. We thus can't combine atomic loops if the second loop has a minimum; this includes the case where the second "loop" is just an individual item.
* Fix auto-atomicity handling of \w and \b
We currently consider \w and \b non-overlapping, which allows a \w loop followed by a \b to be made atomic. The problem with this is that \b is zero-width, and it could be followed by something that does overlap with the \w. When matching at a location that is a word boundary, it is possible the first loop could give up something that matches the subsequent construct, and thus it can't be made atomic. (We could probably restrict this further to still allow atomicity when the first loop has a non-0 lower bound, but it doesn't appear to be worth the complication.)
* Add a few more tests
Co-authored-by: Stephen Toub <[email protected]>
Copy file name to clipboardExpand all lines: src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
+32-23Lines changed: 32 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -1606,22 +1606,33 @@ static bool CanCombineCounts(int nodeMin, int nodeMax, int nextMin, int nextMax)
1606
1606
// Coalescing a loop with its same type
1607
1607
caseRegexNodeKind.Oneloop or RegexNodeKind.Oneloopatomic or RegexNodeKind.Onelazy or RegexNodeKind.Notoneloop or RegexNodeKind.Notoneloopatomic or RegexNodeKind.NotonelazywhennextNode.Kind==currentNode.Kind&¤tNode.Ch==nextNode.Ch:
1608
1608
caseRegexNodeKind.Setloop or RegexNodeKind.Setloopatomic or RegexNodeKind.SetlazywhennextNode.Kind==currentNode.Kind&¤tNode.Str==nextNode.Str:
// Coalescing a loop with an additional item of the same type
1622
-
caseRegexNodeKind.Oneloop or RegexNodeKind.Oneloopatomic or RegexNodeKind.OnelazywhennextNode.Kind==RegexNodeKind.One&¤tNode.Ch==nextNode.Ch:
1623
-
caseRegexNodeKind.Notoneloop or RegexNodeKind.Notoneloopatomic or RegexNodeKind.NotonelazywhennextNode.Kind==RegexNodeKind.Notone&¤tNode.Ch==nextNode.Ch:
1624
-
caseRegexNodeKind.Setloop or RegexNodeKind.Setloopatomic or RegexNodeKind.SetlazywhennextNode.Kind==RegexNodeKind.Set&¤tNode.Str==nextNode.Str:
1633
+
caseRegexNodeKind.Oneloop or RegexNodeKind.OnelazywhennextNode.Kind==RegexNodeKind.One&¤tNode.Ch==nextNode.Ch:
1634
+
caseRegexNodeKind.Notoneloop or RegexNodeKind.NotonelazywhennextNode.Kind==RegexNodeKind.Notone&¤tNode.Ch==nextNode.Ch:
1635
+
caseRegexNodeKind.Setloop or RegexNodeKind.SetlazywhennextNode.Kind==RegexNodeKind.Set&¤tNode.Str==nextNode.Str:
@@ -1635,7 +1646,7 @@ static bool CanCombineCounts(int nodeMin, int nodeMax, int nextMin, int nextMax)
1635
1646
break;
1636
1647
1637
1648
// Coalescing a loop with a subsequent string
1638
-
caseRegexNodeKind.Oneloop or RegexNodeKind.Oneloopatomic or RegexNodeKind.OnelazywhennextNode.Kind==RegexNodeKind.Multi&¤tNode.Ch==nextNode.Str![0]:
1649
+
caseRegexNodeKind.Oneloop or RegexNodeKind.OnelazywhennextNode.Kind==RegexNodeKind.Multi&¤tNode.Ch==nextNode.Str![0]:
1639
1650
{
1640
1651
// Determine how many of the multi's characters can be combined.
1641
1652
// We already checked for the first, so we know it's at least one.
caseRegexNodeKind.Onelazy or RegexNodeKind.Oneloop or RegexNodeKind.Oneloopatomicwhensubsequent.M==0&&node.Ch!=subsequent.Ch:
2066
2073
caseRegexNodeKind.Notonelazy or RegexNodeKind.Notoneloop or RegexNodeKind.Notoneloopatomicwhensubsequent.M==0&&node.Ch==subsequent.Ch:
2067
2074
caseRegexNodeKind.Setlazy or RegexNodeKind.Setloop or RegexNodeKind.Setloopatomicwhensubsequent.M==0&&!RegexCharClass.CharInClass(node.Ch,subsequent.Str!):
caseRegexNodeKind.EndZ or RegexNodeKind.Eolwhen!RegexCharClass.CharInClass('\n',node.Str!):
2106
-
caseRegexNodeKind.Boundarywhennode.StrisRegexCharClass.WordClass or RegexCharClass.DigitClass:
2107
-
caseRegexNodeKind.NonBoundarywhennode.StrisRegexCharClass.NotWordClass or RegexCharClass.NotDigitClass:
2108
-
caseRegexNodeKind.ECMABoundarywhennode.StrisRegexCharClass.ECMAWordClass or RegexCharClass.ECMADigitClass:
2109
-
caseRegexNodeKind.NonECMABoundarywhennode.StrisRegexCharClass.NotECMAWordClass or RegexCharClass.NotDigitClass:
2110
2117
returntrue;
2111
2118
2112
2119
caseRegexNodeKind.Onelazy or RegexNodeKind.Oneloop or RegexNodeKind.Oneloopatomicwhensubsequent.M==0&&!RegexCharClass.CharInClass(subsequent.Ch,node.Str!):
2113
2120
caseRegexNodeKind.Setlazy or RegexNodeKind.Setloop or RegexNodeKind.Setloopatomicwhensubsequent.M==0&&!RegexCharClass.MayOverlap(node.Str!,subsequent.Str!):
2121
+
caseRegexNodeKind.Boundarywhennode.StrisRegexCharClass.WordClass or RegexCharClass.DigitClass:
2122
+
caseRegexNodeKind.NonBoundarywhennode.StrisRegexCharClass.NotWordClass or RegexCharClass.NotDigitClass:
2123
+
caseRegexNodeKind.ECMABoundarywhennode.StrisRegexCharClass.ECMAWordClass or RegexCharClass.ECMADigitClass:
2124
+
caseRegexNodeKind.NonECMABoundarywhennode.StrisRegexCharClass.NotECMAWordClass or RegexCharClass.NotDigitClass:
2114
2125
// The loop can be made atomic based on this subsequent node, but we'll need to evaluate the next one as well.
// We only get here if the node could be made atomic based on subsequent but subsequent has a lower bound of zero
2127
2138
// and thus we need to move subsequent to be the next node in sequence and loop around to try again.
2128
-
Debug.Assert(subsequent.KindisRegexNodeKind.Oneloop or RegexNodeKind.Oneloopatomic or RegexNodeKind.Onelazy or RegexNodeKind.Notoneloop or RegexNodeKind.Notoneloopatomic or RegexNodeKind.Notonelazy or RegexNodeKind.Setloop or RegexNodeKind.Setloopatomic or RegexNodeKind.Setlazy);
0 commit comments