Skip to content

Commit 50133cb

Browse files
committed
syntax: factor out common prefixes of alternations
It is generally quite subtle to reason clearly about how this actually helps things in a finite automata based regex engine, but this sort of factoring can lead to lots of improvements: * We do use a bounded backtracker, so "pushing branches" down will help things there, just like it would with a classical backtracker. * It may lead to better literal extraction due to the simpler regex. Whether prefix factoring is really to blame here is somewhat unclear, but some downstream optimizations are more brittle than others. For example, the "reverse inner" optimization requires examining a "top level" concatenation to find literals to search for. By factoring out a common prefix, we potentially expand the number of regexes that have a top-level concat. For example, `\wfoo|\wbar` has no top-level concat but `\w(?:foo|bar)` does. * It should lead to faster matching even in finite automata oriented engines like the PikeVM, and also faster construction of DFAs (lazy or not). Namely, by pushing the branches down, we make it so they are visited less frequently, and thus the constant state shuffling caused by branches is reduced. The prefix extraction could be better, as mentioned in the comments, but this is a good start.
1 parent 800e119 commit 50133cb

File tree

2 files changed

+105
-6
lines changed

2 files changed

+105
-6
lines changed

regex-syntax/src/hir/mod.rs

+75-6
Original file line numberDiff line numberDiff line change
@@ -429,12 +429,14 @@ impl Hir {
429429
return new.pop().unwrap();
430430
}
431431
// Now that it's completely flattened, look for the special case of
432-
// 'char1|char2|...|charN' and collapse that into a class. Note that we
433-
// look for 'char' first and then bytes. The issue here is that if we
434-
// find both non-ASCII codepoints and non-ASCII singleton bytes, then
435-
// it isn't actually possible to smush them into a single class. So we
436-
// look for all chars and then all bytes, and don't handle anything
437-
// else.
432+
// 'char1|char2|...|charN' and collapse that into a class. Note that
433+
// we look for 'char' first and then bytes. The issue here is that if
434+
// we find both non-ASCII codepoints and non-ASCII singleton bytes,
435+
// then it isn't actually possible to smush them into a single class.
436+
// (Because classes are either "all codepoints" or "all bytes." You
437+
// can have a class that both matches non-ASCII but valid UTF-8 and
438+
// invalid UTF-8.) So we look for all chars and then all bytes, and
439+
// don't handle anything else.
438440
if let Some(singletons) = singleton_chars(&new) {
439441
let it = singletons
440442
.into_iter()
@@ -455,6 +457,14 @@ impl Hir {
455457
if let Some(cls) = class_bytes(&new) {
456458
return Hir::class(cls);
457459
}
460+
// Factor out a common prefix if we can, which might potentially
461+
// simplify the expression and unlock other optimizations downstream.
462+
// It also might generally make NFA matching and DFA construction
463+
// faster by reducing the scope of branching in the regex.
464+
new = match lift_common_prefix(new) {
465+
Ok(hir) => return hir,
466+
Err(unchanged) => unchanged,
467+
};
458468
let props = Properties::alternation(&new);
459469
Hir { kind: HirKind::Alternation(new), props }
460470
}
@@ -2251,6 +2261,65 @@ fn singleton_bytes(hirs: &[Hir]) -> Option<Vec<u8>> {
22512261
Some(singletons)
22522262
}
22532263

2264+
/// Looks for a common prefix in the list of alternation branches given. If one
2265+
/// is found, then an equivalent but (hopefully) simplified Hir is returned.
2266+
/// Otherwise, the original given list of branches is returned unmodified.
2267+
///
2268+
/// This is not quite as good as it could be. Right now, it requires that
2269+
/// all branches are 'Concat' expressions. It also doesn't do well with
2270+
/// literals. For example, given 'foofoo|foobar', it will not refactor it to
2271+
/// 'foo(?:foo|bar)' because literals are flattened into their own special
2272+
/// concatenation. (One wonders if perhaps 'Literal' should be a single atom
2273+
/// instead of a string of bytes because of this. Otherwise, handling the
2274+
/// current representation in this routine will be pretty gnarly. Sigh.)
2275+
fn lift_common_prefix(hirs: Vec<Hir>) -> Result<Hir, Vec<Hir>> {
2276+
if hirs.len() <= 1 {
2277+
return Err(hirs);
2278+
}
2279+
let mut prefix = match hirs[0].kind() {
2280+
HirKind::Concat(ref xs) => &**xs,
2281+
_ => return Err(hirs),
2282+
};
2283+
if prefix.is_empty() {
2284+
return Err(hirs);
2285+
}
2286+
for h in hirs.iter().skip(1) {
2287+
let concat = match h.kind() {
2288+
HirKind::Concat(ref xs) => xs,
2289+
_ => return Err(hirs),
2290+
};
2291+
let common_len = prefix
2292+
.iter()
2293+
.zip(concat.iter())
2294+
.take_while(|(x, y)| x == y)
2295+
.count();
2296+
prefix = &prefix[..common_len];
2297+
if prefix.is_empty() {
2298+
return Err(hirs);
2299+
}
2300+
}
2301+
let len = prefix.len();
2302+
assert_ne!(0, len);
2303+
let mut prefix_concat = vec![];
2304+
let mut suffix_alts = vec![];
2305+
for h in hirs {
2306+
let mut concat = match h.into_kind() {
2307+
HirKind::Concat(xs) => xs,
2308+
// We required all sub-expressions to be
2309+
// concats above, so we're only here if we
2310+
// have a concat.
2311+
_ => unreachable!(),
2312+
};
2313+
suffix_alts.push(Hir::concat(concat.split_off(len)));
2314+
if prefix_concat.is_empty() {
2315+
prefix_concat = concat;
2316+
}
2317+
}
2318+
let mut concat = prefix_concat;
2319+
concat.push(Hir::alternation(suffix_alts));
2320+
Ok(Hir::concat(concat))
2321+
}
2322+
22542323
#[cfg(test)]
22552324
mod tests {
22562325
use super::*;

regex-syntax/src/hir/translate.rs

+30
Original file line numberDiff line numberDiff line change
@@ -3428,5 +3428,35 @@ mod tests {
34283428
t("a|b|c|d|e|f|x|y|z"),
34293429
hir_uclass(&[('a', 'f'), ('x', 'z')]),
34303430
);
3431+
// Tests that we lift common prefixes out of an alternation.
3432+
assert_eq!(
3433+
t("[A-Z]foo|[A-Z]quux"),
3434+
hir_cat(vec![
3435+
hir_uclass(&[('A', 'Z')]),
3436+
hir_alt(vec![hir_lit("foo"), hir_lit("quux")]),
3437+
]),
3438+
);
3439+
assert_eq!(
3440+
t("[A-Z][A-Z]|[A-Z]quux"),
3441+
hir_cat(vec![
3442+
hir_uclass(&[('A', 'Z')]),
3443+
hir_alt(vec![hir_uclass(&[('A', 'Z')]), hir_lit("quux")]),
3444+
]),
3445+
);
3446+
assert_eq!(
3447+
t("[A-Z][A-Z]|[A-Z][A-Z]quux"),
3448+
hir_cat(vec![
3449+
hir_uclass(&[('A', 'Z')]),
3450+
hir_uclass(&[('A', 'Z')]),
3451+
hir_alt(vec![Hir::empty(), hir_lit("quux")]),
3452+
]),
3453+
);
3454+
assert_eq!(
3455+
t("[A-Z]foo|[A-Z]foobar"),
3456+
hir_cat(vec![
3457+
hir_uclass(&[('A', 'Z')]),
3458+
hir_alt(vec![hir_lit("foo"), hir_lit("foobar")]),
3459+
]),
3460+
);
34313461
}
34323462
}

0 commit comments

Comments
 (0)