Description
@BurntSushi noticed that in Rust regex (rust-lang/regex#779):
(|a)*
matchingaaa
matches the entire text.(|a)+
matchingaaa
matches only the empty string at the start of the text.
This is a bit of an unlikely corner case, but it's certainly a significant violation of the rules of regexps for *
to match more than +
. The correct answer is for (|a)*
to match an empty string too, because each iteration prefers the empty string, so even an infinite number of those matches will never fall back to matching an a
. (This behavior is what Perl and PCRE do.)
Go's regexp package and RE2 have the same bug, which ultimately derives from a mismatch between the e*
picture in my first regexp article and the backtracking prioritization matching in the second article. The implementation of e*
needs to be a little different to get the match prioritization correct in the case where e
has a preferred empty match.
I found this mismatch between RE2 and Perl in the e*
case when running Glenn Fowler's regexp test cases against RE2 long ago, but at the time I thought the mismatch was unavoidable in the proper NFA approach. @BurntSushi's observation that e+
handles the empty match correctly proves otherwise. The correct fix is to leave the compilation of e+
alone and then compile e*
as (e+)?
. (I will add a note to the article as well.)
This is a tiny bug fix but may of course result in different matches for certain programs, causing problems for programs that rely on the current buggy behavior. That is always the case for any bug fix, of course. Fixing the bug only in Go would also make Go diverge from RE2 and Rust.
@junyer, @BurntSushi, and I discussed both the potential for breakage and the goal of keeping Go, RE2, and Rust aligned. We decided the potential breakage is minimal and outweighed by the benefits of fixing the bug, to better match Perl and PCRE, and to reestablish the properties that e*
never matches more than e+
and that e+
and ee*
are always the same. We agreed to make the change in all of our implementations.
/cc @charlesmunger and @adonovan for RE2/J as well.