-
Notifications
You must be signed in to change notification settings - Fork 123
Description
This is perhaps an interesting case, but not a common one, to optimize.
An "empty-matching regex pattern" permits zero-length input to be matched. For example, a?b?c?
and a*b*c*
match a
, b
and c
in various forms in the input, but also matches "nothingness" (empty space between characters).
In fact, I bet that most grep users won't use these patterns much, if ever, because every line in the input is returned as a match anyway. With option -o
it has some use to find specific matches though, so there is a use case to optimize.
Ugrep has options to control the matching behavior when empty-matching patterns are used. Option -Y
permits empty matches, like GNU grep does. So when ugrep is used as a grep replacement (hard/soft link or alias), then option -Y
is enabled to emulate GNU grep. This is not efficient at all yet (at this time of writing) with ugrep to match such patterns.
Actually, GNU grep does kind-of strange things with empty-matching patterns. Again, it is not something that most people would worry about, but it's interesting nevertheless:
ggrep 'j?a?v?' Hello.java
<nothing>
but
ggrep 'j*a*v*' Hello.java
<everything>
And with option -o
:
ggrep -o 'j?a?v?' tests/Hello.java
<nothing>
but
ggrep -o 'j*a*v*' Hello.java
jav
a
a
a
a
a
v
a
a
For option -o
, GNU grep behaves like ugrep (without -Y
) to return matches, not everything.
To optimize the default ugrep behavior to reject empty matches for e.g. a*b*c*
to get actual matches, such as a
, b
, c
, bbccc
, aab
and so on, the pattern matching engine can be modified as if it is matching a pattern of length 1, not 0. This will capture the pattern when an a
, b
, or c
is found in the input while skipping over everything else.
The modification is possible in ugrep without too much effort and should improve matching speed for these kind of "empty-matching patterns". I will post a timing comparison using my dev version to demonstrate what we're talking about.