-
Notifications
You must be signed in to change notification settings - Fork 18k
proposal: regexp: Optimize fixed-length patterns #21463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC @rsc |
It would be interesting to examine a large corpus of Go code (perhaps https://github.com/rsc/corpus), extract all the regular expressions (at least the constant ones), and categorize them as to whether this matcher would apply. |
I looked in the BigQuery GitHub public dataset for constant regular expressions used with
13% of the time, the matcher applies. 65% of the time, it does not. Feel free to add comments or suggestions to the query, I can re-run it and update the results here. |
Great idea & thanks for the stats! Regarding BigQuery, I think we should also match calls to I'm going to update the repository with a more exhaustive list of supported regexps to make things clearer. |
I'm not aware of any theory making that possible. Can you please explain? |
I have updated the repository with a clearer table of supported patterns. @cznic I meant constant-time like |
I don't think strings.HasSuffix is O(1) either.
Nice chart, but the More importantly, theory needs to talk. Can you refer to a theory enabling/explaining constant time regexp matching? Don't get me wrong, none of the above means your contribution does not improve performance. |
I'm not sure if it is in theory, but I just added a benchmark for it and it does seem constant even with a 1Gb string. The actual function being used to read the input string for
Sorry if I missed something. I'm using the strings returned by I'm happy to change "constant" to something else if it's not actually the case but so far I haven't seen any linear behaviour (relative to the size of the input string) in the benchmarks for these simple patterns. |
The Normally, two parameters should be considered, length of the pattern (let's call it N) and length of the string (M). If the computation can take advantage of knowing the location of the end of the string in O(1) and perform the match from the end, it reduces to considering N only. But I'm not aware of any way to perform the match in O(1) wrt to N only. Therefore, regardless of the complexity wrt M, the overall one cannot be constant time. Known theory says O(NM), which is O(N) for strings.HasSuffix. |
@cznic OK I see what you mean. I did mean "constant" relative to the size of the input string only. I will add that clarification where needed and change |
I used this query to extract constant regular expressions used with Feel free to run an analysis on it. |
Thanks a lot @steren! I just added the dump to the tests, here are the stats:
33% is much higher than I expected! |
I have updated my repository to add support for all fixed-length unanchored patterns. Support on the GitHub dataset increased from 33.78% to 36.51% with overall performance improvements. It would even go up to 55% if/when we support captures. I am pretty excited by the fact that on some patterns, this implementation beats all the engines I have been able to benchmark, including Rust which usually performs best. I think implementing both this proposal and a proper DFA would make Go's regexp engine a very strong performer. |
Hi. Thanks for looking into the performance here. I would like to understand better how much of the improved performance can be had by appropriate adjustments to the existing matchers versus how much requires a whole new matcher. Every new matcher adds a cost, and honestly I think we have too many already. I don't really like how overengineered RE2 is, and I don't want Go's regexp package to end up the same way one matcher at a time. So I want to make sure we're doing all we can with existing matchers before adding a new one. I observe the following about the standard Go regexp package:
I think those would recover the bulk of the improvements shown in the benchmarks, without a new matcher. If you'd like to look into CLs making those changes, please go ahead. |
Thanks for your feedback @rsc! I agree it's fair to try improving the existing matchers before considering adding a new one. I still believe fixed-length patterns can benefit from optimizations that will require new matching methods, but let's indeed find out what the minimal amount of new code would be.
I don't see any difference in the benchmark in std between In your points 1 & 4 you mention optimizations for I will try to make a CL for backward matching of anchored patterns first. Should I open a separate issue for it? I have a couple questions for further CLs:
Thanks! |
re difference, looks like maybe I misread the benchmark output. re .*, yes, what I described would only apply when . matches \n. That's kind of sad. re backward matching, sure feel free to open a separate issue. re InstOps, it would be nice to avoid them by default, since new InstOps invalidate other code using the Prog representation (that don't know how to handle them), but if they make a big difference it is probably OK. re LiteralPrefix, the nice thing about the prefix is that it tells you where to start too, so the scanning is not wasted effort. Scanning for the other required literals might be OK but they're harder to extract from the Prog, and I try to avoid doing analyses directly on the Regexp beacuse of all the syntax complications. Worth trying though. |
On hold for response (no hurry). |
Change https://golang.org/cl/353711 mentions this issue: |
Change https://golang.org/cl/377294 mentions this issue: |
Check whether a regex has any 'alt' instructions before rejecting it as one-pass. Previously `^abc` would run the backtrack matcher. I tried to make the comment match what the code does now. Updates #21463 ``` name old time/op new time/op delta Find-8 167ns ± 1% 170ns ± 3% ~ (p=0.500 n=5+5) FindAllNoMatches-8 88.8ns ± 5% 87.3ns ± 0% ~ (p=0.095 n=5+5) FindString-8 166ns ± 3% 164ns ± 0% ~ (p=0.063 n=5+5) FindSubmatch-8 191ns ± 1% 191ns ± 0% ~ (p=0.556 n=4+5) FindStringSubmatch-8 183ns ± 0% 182ns ± 0% -0.43% (p=0.048 n=5+5) Literal-8 50.3ns ± 0% 50.1ns ± 0% -0.40% (p=0.016 n=5+4) NotLiteral-8 914ns ± 0% 927ns ± 7% ~ (p=0.730 n=5+5) MatchClass-8 1.20µs ± 1% 1.22µs ± 6% ~ (p=0.738 n=5+5) MatchClass_InRange-8 1.20µs ± 6% 1.21µs ± 6% ~ (p=0.548 n=5+5) ReplaceAll-8 796ns ± 0% 792ns ± 0% -0.51% (p=0.032 n=5+5) AnchoredLiteralShortNonMatch-8 41.0ns ± 2% 34.2ns ± 2% -16.47% (p=0.008 n=5+5) AnchoredLiteralLongNonMatch-8 53.3ns ± 0% 34.3ns ± 3% -35.74% (p=0.008 n=5+5) AnchoredShortMatch-8 74.0ns ± 2% 75.8ns ± 0% +2.46% (p=0.032 n=5+4) AnchoredLongMatch-8 146ns ± 3% 76ns ± 1% -48.12% (p=0.008 n=5+5) OnePassShortA-8 424ns ± 0% 423ns ± 0% ~ (p=0.222 n=5+4) NotOnePassShortA-8 373ns ± 1% 375ns ± 2% ~ (p=0.690 n=5+5) OnePassShortB-8 315ns ± 2% 308ns ± 0% -2.12% (p=0.008 n=5+5) NotOnePassShortB-8 244ns ± 3% 239ns ± 0% ~ (p=0.476 n=5+5) OnePassLongPrefix-8 61.6ns ± 2% 60.9ns ± 0% -1.13% (p=0.016 n=5+4) OnePassLongNotPrefix-8 236ns ± 3% 230ns ± 0% ~ (p=0.143 n=5+5) ``` Change-Id: I8a94b53bc761cd7ec89923c905ec8baaaa58a5fd GitHub-Last-Rev: e9e0c29 GitHub-Pull-Request: #48748 Reviewed-on: https://go-review.googlesource.com/c/go/+/353711 Reviewed-by: Daniel Martí <[email protected]> Reviewed-by: Russ Cox <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Ian Lance Taylor <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Commit-Queue: Ian Lance Taylor <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
The
regexp
package has 3 different matchers (NFA, onepass, backtrack). One of them is selected depending on the pattern.I suggest adding a 4th matcher optimized for fixed-length patterns like
a.ab$
on strings.I wrote a proof-of-concept implementation which provides constant-time matching relative to the size of the input string in many cases, whereas the current matchers from
regexp
perform linearly. Performance becomes close to what can be achieved with methods likestrings.Index
.Here is a sample benchmark result of
regexp.MatchString("a.ab$", strings.Repeat("a", M)+"b")
:I've added more details, benchmarks and tests (with fuzzing!) to the repository. Not sure how much should be inlined here.
The obvious cons of this proposal are:
The pros are:
regexp
package is usually considered immature performance-wise. This proposal plays a small role in fixing that by adding optimizations that can reasonably be expected from the end-user.regexp.go
regexp.MatchString("(?:png|jpg)$")
could obviously be rewritten asstrings.HasSuffix("png") or strings.HasSuffix("jpg")
but sometimes it is not practical because the pattern to be matched is user-supplied or part of a long list of patterns. Examples include interactive log search or lists of paths in HTTP routers.Feedback would be highly appreciated. Thanks!!
The text was updated successfully, but these errors were encountered: