Skip to content

Commit c09d9e0

Browse files
committed
syntax: make Unicode completely optional
This commit refactors the way this library handles Unicode data by making it completely optional. Several features are introduced which permit callers to select only the Unicode data they need (up to a point of granularity). An important property of these changes is that presence of absence of crate features will never change the match semantics of a regular expression. Instead, the presence or absence of a crate feature can only add or subtract from the set of all possible valid regular expressions. So for example, if the `unicode-case` feature is disabled, then attempting to produce `Hir` for the regex `(?i)a` will fail. Instead, callers must use `(?i-u)a` (or enable the `unicode-case` feature). This partially addresses #583 since it permits callers to decrease binary size.
1 parent 98a7337 commit c09d9e0

File tree

15 files changed

+1381
-246
lines changed

15 files changed

+1381
-246
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ linear time with respect to the size of the regular expression and search text.
77
Much of the syntax and implementation is inspired
88
by [RE2](https://github.com/google/re2).
99

10-
[![Build Status](https://travis-ci.com/rust-lang/regex.svg?branch=master)](https://travis-ci.com/rust-lang/regex)
10+
[![Build status](https://travis-ci.com/rust-lang/regex.svg?branch=master)](https://travis-ci.com/rust-lang/regex)
1111
[![Build status](https://ci.appveyor.com/api/projects/status/github/rust-lang/regex?svg=true)](https://ci.appveyor.com/project/rust-lang-libs/regex)
1212
[![Coverage Status](https://coveralls.io/repos/github/rust-lang/regex/badge.svg?branch=master)](https://coveralls.io/github/rust-lang/regex?branch=master)
1313
[![](https://meritbadge.herokuapp.com/regex)](https://crates.io/crates/regex)

ci/script.sh

+8-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#!/bin/sh
22

3+
# vim: tabstop=2 shiftwidth=2 softtabstop=2
4+
35
# This is the main CI script for testing the regex crate and its sub-crates.
46

57
set -ex
@@ -42,8 +44,13 @@ RUST_REGEX_RANDOM_TEST=1 \
4244
ci/run-shootout-test
4345

4446
# Run tests on regex-syntax crate.
45-
cargo test --verbose --manifest-path regex-syntax/Cargo.toml
4647
cargo doc --verbose --manifest-path regex-syntax/Cargo.toml
48+
# Only run the full test suite on one job, to conserve resources.
49+
if [ "$TRAVIS_RUST_VERSION" = "stable" ]; then
50+
(cd regex-syntax && ./test)
51+
else
52+
cargo test --verbose --manifest-path regex-syntax/Cargo.toml
53+
fi
4754

4855
# Run tests on regex-capi crate.
4956
ci/test-regex-capi

regex-syntax/Cargo.toml

+22
Original file line numberDiff line numberDiff line change
@@ -8,3 +8,25 @@ documentation = "https://docs.rs/regex-syntax"
88
homepage = "https://github.com/rust-lang/regex"
99
description = "A regular expression parser."
1010
workspace = ".."
11+
12+
# Features are documented in the "Crate features" section of the crate docs:
13+
# https://docs.rs/regex-syntax/*/#crate-features
14+
[features]
15+
default = ["unicode"]
16+
17+
unicode = [
18+
"unicode-age",
19+
"unicode-bool",
20+
"unicode-case",
21+
"unicode-gencat",
22+
"unicode-perl",
23+
"unicode-script",
24+
"unicode-segment",
25+
]
26+
unicode-age = []
27+
unicode-bool = []
28+
unicode-case = []
29+
unicode-gencat = []
30+
unicode-perl = []
31+
unicode-script = []
32+
unicode-segment = []

regex-syntax/README.md

+82
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
regex-syntax
2+
============
3+
This crate provides a robust regular expression parser.
4+
5+
[![Build status](https://travis-ci.com/rust-lang/regex.svg?branch=master)](https://travis-ci.com/rust-lang/regex)
6+
[![Build status](https://ci.appveyor.com/api/projects/status/github/rust-lang/regex?svg=true)](https://ci.appveyor.com/project/rust-lang-libs/regex)
7+
[![](https://meritbadge.herokuapp.com/regex-syntax)](https://crates.io/crates/regex-syntax)
8+
[![Rust](https://img.shields.io/badge/rust-1.28.0%2B-blue.svg?maxAge=3600)](https://github.com/rust-lang/regex)
9+
10+
11+
### Documentation
12+
13+
https://docs.rs/regex-syntax
14+
15+
16+
### Overview
17+
18+
There are two primary types exported by this crate: `Ast` and `Hir`. The former
19+
is a faithful abstract syntax of a regular expression, and can convert regular
20+
expressions back to their concrete syntax while mostly preserving its original
21+
form. The latter type is a high level intermediate representation of a regular
22+
expression that is amenable to analysis and compilation into byte codes or
23+
automata. An `Hir` achieves this by drastically simplifying the syntactic
24+
structure of the regular expression. While an `Hir` can be converted back to
25+
its equivalent concrete syntax, the result is unlikely to resemble the original
26+
concrete syntax that produced the `Hir`.
27+
28+
29+
### Example
30+
31+
This example shows how to parse a pattern string into its HIR:
32+
33+
```rust
34+
use regex_syntax::Parser;
35+
use regex_syntax::hir::{self, Hir};
36+
37+
let hir = Parser::new().parse("a|b").unwrap();
38+
assert_eq!(hir, Hir::alternation(vec![
39+
Hir::literal(hir::Literal::Unicode('a')),
40+
Hir::literal(hir::Literal::Unicode('b')),
41+
]));
42+
```
43+
44+
45+
### Crate features
46+
47+
By default, this crate bundles a fairly large amount of Unicode data tables
48+
(a source size of ~750KB). Because of their large size, one can disable some
49+
or all of these data tables. If a regular expression attempts to use Unicode
50+
data that is not available, then an error will occur when translating the `Ast`
51+
to the `Hir`.
52+
53+
The full set of features one can disable are
54+
[in the "Crate features" section of the documentation](https://docs.rs/regex-syntax/*/#crate-features).
55+
56+
57+
### Testing
58+
59+
Simply running `cargo test` will give you very good coverage. However, because
60+
of the large number of features exposed by this crate, a `test` script is
61+
included in this directory which will test several feature combinations. This
62+
is the same script that is run in CI.
63+
64+
65+
### Motivation
66+
67+
The primary purpose of this crate is to provide the parser used by `regex`.
68+
Specifically, this crate is treated as an implementation detail of the `regex`,
69+
and is primarily developed for the needs of `regex`.
70+
71+
Since this crate is an implementation detail of `regex`, it may experience
72+
breaking change releases at a different cadence from `regex`. This is only
73+
possible because this crate is _not_ a public dependency of `regex`.
74+
75+
Another consequence of this de-coupling is that there is no direct way to
76+
compile a `regex::Regex` from a `regex_syntax::hir::Hir`. Instead, one must
77+
first convert the `Hir` to a string (via its `std::fmt::Display`) and then
78+
compile that via `Regex::new`. While this does repeat some work, compilation
79+
typically takes much longer than parsing.
80+
81+
Stated differently, the coupling between `regex` and `regex-syntax` exists only
82+
at the level of the concrete syntax.

regex-syntax/src/hir/interval.rs

+17-4
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ use std::fmt::Debug;
44
use std::slice;
55
use std::u8;
66

7+
use unicode;
8+
79
// This module contains an *internal* implementation of interval sets.
810
//
911
// The primary invariant that interval sets guards is canonical ordering. That
@@ -14,7 +16,8 @@ use std::u8;
1416
//
1517
// Since case folding (as implemented below) breaks that invariant, we roll
1618
// that into this API even though it is a little out of place in an otherwise
17-
// generic interval set.
19+
// generic interval set. (Hence the reason why the `unicode` module is imported
20+
// here.)
1821
//
1922
// Some of the implementation complexity here is a result of me wanting to
2023
// preserve the sequential representation without using additional memory.
@@ -72,13 +75,20 @@ impl<I: Interval> IntervalSet<I> {
7275
/// characters. For example, if this class consists of the range `a-z`,
7376
/// then applying case folding will result in the class containing both the
7477
/// ranges `a-z` and `A-Z`.
75-
pub fn case_fold_simple(&mut self) {
78+
///
79+
/// This returns an error if the necessary case mapping data is not
80+
/// available.
81+
pub fn case_fold_simple(&mut self) -> Result<(), unicode::CaseFoldError> {
7682
let len = self.ranges.len();
7783
for i in 0..len {
7884
let range = self.ranges[i];
79-
range.case_fold_simple(&mut self.ranges);
85+
if let Err(err) = range.case_fold_simple(&mut self.ranges) {
86+
self.canonicalize();
87+
return Err(err);
88+
}
8089
}
8190
self.canonicalize();
91+
Ok(())
8292
}
8393

8494
/// Union this set with the given set, in place.
@@ -331,7 +341,10 @@ pub trait Interval:
331341
fn upper(&self) -> Self::Bound;
332342
fn set_lower(&mut self, bound: Self::Bound);
333343
fn set_upper(&mut self, bound: Self::Bound);
334-
fn case_fold_simple(&self, intervals: &mut Vec<Self>);
344+
fn case_fold_simple(
345+
&self,
346+
intervals: &mut Vec<Self>,
347+
) -> Result<(), unicode::CaseFoldError>;
335348

336349
/// Create a new interval.
337350
fn create(lower: Self::Bound, upper: Self::Bound) -> Self {

regex-syntax/src/hir/literal/mod.rs

+12-8
Original file line numberDiff line numberDiff line change
@@ -1105,6 +1105,7 @@ mod tests {
11051105
test_lit!(pfx_one_lit1, prefixes, "a", M("a"));
11061106
test_lit!(pfx_one_lit2, prefixes, "abc", M("abc"));
11071107
test_lit!(pfx_one_lit3, prefixes, "(?u)☃", M("\\xe2\\x98\\x83"));
1108+
#[cfg(feature = "unicode-case")]
11081109
test_lit!(pfx_one_lit4, prefixes, "(?ui)☃", M("\\xe2\\x98\\x83"));
11091110
test_lit!(pfx_class1, prefixes, "[1-4]", M("1"), M("2"), M("3"), M("4"));
11101111
test_lit!(
@@ -1114,6 +1115,7 @@ mod tests {
11141115
M("\\xe2\\x85\\xa0"),
11151116
M("\\xe2\\x98\\x83")
11161117
);
1118+
#[cfg(feature = "unicode-case")]
11171119
test_lit!(
11181120
pfx_class3,
11191121
prefixes,
@@ -1122,11 +1124,11 @@ mod tests {
11221124
M("\\xe2\\x85\\xb0"),
11231125
M("\\xe2\\x98\\x83")
11241126
);
1125-
test_lit!(pfx_one_lit_casei1, prefixes, "(?i)a", M("A"), M("a"));
1127+
test_lit!(pfx_one_lit_casei1, prefixes, "(?i-u)a", M("A"), M("a"));
11261128
test_lit!(
11271129
pfx_one_lit_casei2,
11281130
prefixes,
1129-
"(?i)abc",
1131+
"(?i-u)abc",
11301132
M("ABC"),
11311133
M("aBC"),
11321134
M("AbC"),
@@ -1158,7 +1160,7 @@ mod tests {
11581160
test_lit!(
11591161
pfx_cat3,
11601162
prefixes,
1161-
"(?i)[ab]z",
1163+
"(?i-u)[ab]z",
11621164
M("AZ"),
11631165
M("BZ"),
11641166
M("aZ"),
@@ -1295,7 +1297,7 @@ mod tests {
12951297
test_exhausted!(
12961298
pfx_exhausted4,
12971299
prefixes,
1298-
"(?i)foobar",
1300+
"(?i-u)foobar",
12991301
C("FO"),
13001302
C("fO"),
13011303
C("Fo"),
@@ -1336,6 +1338,7 @@ mod tests {
13361338
test_lit!(sfx_one_lit1, suffixes, "a", M("a"));
13371339
test_lit!(sfx_one_lit2, suffixes, "abc", M("abc"));
13381340
test_lit!(sfx_one_lit3, suffixes, "(?u)☃", M("\\xe2\\x98\\x83"));
1341+
#[cfg(feature = "unicode-case")]
13391342
test_lit!(sfx_one_lit4, suffixes, "(?ui)☃", M("\\xe2\\x98\\x83"));
13401343
test_lit!(sfx_class1, suffixes, "[1-4]", M("1"), M("2"), M("3"), M("4"));
13411344
test_lit!(
@@ -1345,6 +1348,7 @@ mod tests {
13451348
M("\\xe2\\x85\\xa0"),
13461349
M("\\xe2\\x98\\x83")
13471350
);
1351+
#[cfg(feature = "unicode-case")]
13481352
test_lit!(
13491353
sfx_class3,
13501354
suffixes,
@@ -1353,11 +1357,11 @@ mod tests {
13531357
M("\\xe2\\x85\\xb0"),
13541358
M("\\xe2\\x98\\x83")
13551359
);
1356-
test_lit!(sfx_one_lit_casei1, suffixes, "(?i)a", M("A"), M("a"));
1360+
test_lit!(sfx_one_lit_casei1, suffixes, "(?i-u)a", M("A"), M("a"));
13571361
test_lit!(
13581362
sfx_one_lit_casei2,
13591363
suffixes,
1360-
"(?i)abc",
1364+
"(?i-u)abc",
13611365
M("ABC"),
13621366
M("ABc"),
13631367
M("AbC"),
@@ -1389,7 +1393,7 @@ mod tests {
13891393
test_lit!(
13901394
sfx_cat3,
13911395
suffixes,
1392-
"(?i)[ab]z",
1396+
"(?i-u)[ab]z",
13931397
M("AZ"),
13941398
M("Az"),
13951399
M("BZ"),
@@ -1480,7 +1484,7 @@ mod tests {
14801484
test_exhausted!(
14811485
sfx_exhausted4,
14821486
suffixes,
1483-
"(?i)foobar",
1487+
"(?i-u)foobar",
14841488
C("AR"),
14851489
C("Ar"),
14861490
C("aR"),

0 commit comments

Comments
 (0)