Regex part deux - INTERPOLATION_SYNTAX#669
Merged
radar merged 3 commits intoruby-i18n:masterfrom Jun 21, 2023
Merged
Conversation
From what I can see, this is done in linear time: 4*O(n) This tokenizer change converts that to something a little quicker: 3*O(n) Seems that not using a capture group and something other than split would be a big win. Other than that, the changes were meager. I used https://regex101.com/ (and pcre2) to evaluate the cost of the TOKENIZER. I verified with cruby 3.0.6 ``` /(%%\{[^\}]+\}|%\{[^\}]+\})/ =~ '%{{'*9999)+'}' /(%%\{[^\}]+\}|%\{[^\}]+\})/ ==> 129,990 steps /(%?%\{[^\}]+\})/ ==> 129,990 steps /(%%?\{[^\}]+\})/ ==> 99,992 steps (simple savings of 25%) <=== /(%%?\{[^%}{]+\})/ ==> 89,993 steps (limiting variable contents has minimal gains) ``` Also of note are the null/simple cases: ``` /x/ =~ '%{{'*9999)+'}' /x/ ==> 29,998 steps /(x)/ ==> 59,996 steps /%{x/ ==> 49,998 steps /(%%?{x)/ ==> 89,993 steps ``` And comparing against a the plain string of the same length. ``` /x/ =~ 'abb'*9999+'c' /x/ ==> 29,999 /(%%?{x)/ ==> 59,998 /(%%?\{[^\}]+\})/ ==> 59,998 /(%%\{[^\}]+\}|%\{[^\}]+\})/ ==> 89,997 ``` per ruby-i18n#667
same as tokenizer change:
from regex101.com pcre2 debugger:
```ruby
/(%)?(%\{([^\}]+)\})/ =~ '%{{'*9999)+'}'
/(%)?(%\{([^\}]+)\})/ ==> 199,984 steps
/(%%?)\{([^\}]+)\}/ ==> 129,989 steps
/(%%?\{[^\}]+\})/ ==> 99,992 steps
```
So the extra capture group is the main hit.
Contributor
Author
|
come to think of it, may be able to skip using this regular expression at all. or if using it, skip on the capture group all together. But feeling this is way overkill, especially since we are in linear time. |
Collaborator
Ruby 2.6. This is currently the earliest version of Ruby that i18n supports, so I think it is safe. |
Collaborator
|
I like it! Simpler regular expressions will always get my vote. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for the great gem.
I was curious what I could do with
INTERPOLATION_SYNTAX.This has 3 commits.
INTERPOLATION_SYNTAXwith minor ruby changes.INTERPOLATION_SYNTAXand just usingTOKENIZER.I ran tests with ruby 2.6.9 and 3.0.6
Not sure when the syntax change for the substring was introduced
str[1..].rubocop suggested I change my
str[1..-1]over to that.They also said the backslash in
[^\}]was not necessary.Let me know if you would like to keep
INTERPOLATION_SYNTAXand I can throw away the second commit. Or if you like it, I can squash the two.Something was nice about the multiple capture groups in the regular expression, but I didn't feel the complexity (from pcre's perspective) bought too much. But since this is your project, it is your call.
Also in reference to #667
As I started to run numbers, I'm feeling less and less like this is a DoS. So maybe I'm not the right person to state an opinion on whether these changes are necessary.
From the commit messages
But that hasn't reached the
TOKENIZERperformance, so the second commit went with that one: