Skip to content

Faster matching of "empty-matching" patterns (preview) #287

@genivia-inc

Description

@genivia-inc

This is perhaps an interesting case, but not a common one, to optimize.

An "empty-matching regex pattern" permits zero-length input to be matched. For example, a?b?c? and a*b*c* match a, b and c in various forms in the input, but also matches "nothingness" (empty space between characters).

In fact, I bet that most grep users won't use these patterns much, if ever, because every line in the input is returned as a match anyway. With option -o it has some use to find specific matches though, so there is a use case to optimize.

Ugrep has options to control the matching behavior when empty-matching patterns are used. Option -Y permits empty matches, like GNU grep does. So when ugrep is used as a grep replacement (hard/soft link or alias), then option -Y is enabled to emulate GNU grep. This is not efficient at all yet (at this time of writing) with ugrep to match such patterns.

Actually, GNU grep does kind-of strange things with empty-matching patterns. Again, it is not something that most people would worry about, but it's interesting nevertheless:

ggrep 'j?a?v?' Hello.java
<nothing>

but

ggrep 'j*a*v*' Hello.java
<everything>

And with option -o:

ggrep -o 'j?a?v?' tests/Hello.java
<nothing>

but

ggrep -o 'j*a*v*' Hello.java
jav
a
a
a
a
a
v
a
a

For option -o, GNU grep behaves like ugrep (without -Y) to return matches, not everything.

To optimize the default ugrep behavior to reject empty matches for e.g. a*b*c* to get actual matches, such as a, b, c, bbccc, aab and so on, the pattern matching engine can be modified as if it is matching a pattern of length 1, not 0. This will capture the pattern when an a, b, or c is found in the input while skipping over everything else.

The modification is possible in ugrep without too much effort and should improve matching speed for these kind of "empty-matching patterns". I will post a timing comparison using my dev version to demonstrate what we're talking about.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions