Skip to content

Commit 5385503

Browse files
committed
Implement Perl extended character classes
1 parent 6f36e8a commit 5385503

13 files changed

+644
-209
lines changed

HACKING

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -199,9 +199,11 @@ META_RANGE_ESCAPED hyphen in class range with at least one escape
199199
META_RANGE_LITERAL hyphen in class range defined literally
200200
META_SKIP (*SKIP) - no argument (see below for with argument)
201201
META_THEN (*THEN) - no argument (see below for with argument)
202-
META_ECLASS_OR || in an extended character class
203-
META_ECLASS_AND && in an extended character class
204-
META_ECLASS_SUB -- in an extended character class
202+
META_ECLASS_AND && (or &) in an extended character class
203+
META_ECLASS_OR || (or |, +) in an extended character class
204+
META_ECLASS_SUB -- (or -) in an extended character class
205+
META_ECLASS_XOR ~~ (or ^) in an extended character class
206+
META_ECLASS_NOT ! in an extended character class
205207

206208
The two RANGE values occur only in character classes. They are positioned
207209
between two literals that define the start and end of the range. In an EBCDIC

doc/html/pcre2pattern.html

Lines changed: 96 additions & 58 deletions
Large diffs are not rendered by default.

doc/pcre2pattern.3

Lines changed: 42 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1547,6 +1547,39 @@ the next two sections), and the terminating closing square bracket. However,
15471547
escaping other non-alphanumeric characters does no harm.
15481548
.
15491549
.
1550+
.SH "PERL EXTENDED CHARACTER CLASSES"
1551+
.rs
1552+
PCRE2 supports Perl's "(?[...])" extended character class syntax. This can
1553+
be used to perform set operations, such intersection.
1554+
.P
1555+
The syntax permitted within "(?[...])" is quite different to ordinary character
1556+
classes. Inside the extended class, there is an expression syntax consisting of
1557+
"atoms", operators, and ordinary parentheses "()" used for grouping. The allowed
1558+
atoms are any escaped characters or sets such as "\en" or "\ed", POSIX classes
1559+
such as "[:alpha:]", and any ordinary character class may be nested as an atom
1560+
within an extended class. For example, in "(?[\ed & [...]])" the nested ordinary
1561+
class "[...]" follows the ordinary rules for character classes, in which
1562+
parentheses are not metacharacters, and character literals and ranges are
1563+
permitted. However, when outside an ordinary character class (such as in "(?[...
1564+
+ (...)])") character literals and ranges may not be used, as they are not atoms
1565+
in the extended syntax. The extended syntax does not introduce any additional
1566+
escape sequences, so "(?[\ey])" is an unknown escape, as it would be inside
1567+
"[\ey]".
1568+
.P
1569+
In the extended syntax, ^ does not negate a class (except within an
1570+
ordinary class nested inside an extended class); it is instead a binary
1571+
operator.
1572+
.P
1573+
The binary operators are "&" (intersection), "|" or "+" (union), "-"
1574+
(subtraction) and "^" (symmetric difference). These are left-associative and
1575+
"&" has higher (tighter) precedence, while the others have equal lower
1576+
precedence. The one prefix unary operator is "!" (complement), with highest
1577+
precedence.
1578+
.P
1579+
A Perl extended character class always has the /xx modifier turned on within
1580+
it.
1581+
.
1582+
.
15501583
.SH "UTS#18 EXTENDED CHARACTER CLASSES"
15511584
.rs
15521585
The PCRE2_ALT_EXTENDED_CLASS option enables an alternative to Perl's "(?[...])"
@@ -1560,18 +1593,19 @@ character becomes an additional metacharacter within classes, denoting the start
15601593
of a nested class, so a literal "[" must be escaped as "\e[".
15611594
.P
15621595
Secondly, within the UTS#18 extended syntax, there are additional operators
1563-
"||", "&&" and "--" which denote character class union, intersection, and
1564-
subtraction respectively. In standard Perl syntax, these would simply be
1565-
needlessly-repeated literals (except for "-" which can denote a range). These
1566-
operators can be used in constructs such as "[\ep{L}--[QW]]" for "Unicode
1567-
letters, other than Q and W". A literal "-" at the end of a range must be
1568-
escaped (so while "[--1]" in Perl syntax is the range from hyphen to "1", it
1569-
must be escaped as "[\e--1]" in UTS#18 extended classes).
1596+
"||", "&&", "--" and "~~" which denote character class union, intersection,
1597+
subtraction, and symmetric difference respectively. In standard Perl syntax,
1598+
these would simply be needlessly-repeated literals (except for "-" which can
1599+
denote a range). These operators can be used in constructs such as
1600+
"[\ep{L}--[QW]]" for "Unicode letters, other than Q and W". A literal "-" at
1601+
the end of a range must be escaped (so while "[--1]" in Perl syntax is the
1602+
range from hyphen to "1", it must be escaped as "[\e--1]" in UTS#18 extended
1603+
classes).
15701604
.P
15711605
The specific rules in PCRE2 are that classes can be nested:
15721606
"[...[B]...[^C]...]". The individual class items (literal characters, literal
15731607
ranges, properties such as \ed or \ep{...}, and nested classes) can be
1574-
combined by juxtaposition or by an operator "||", "&&", or "--".
1608+
combined by juxtaposition or by an operator "||", "&&", "--", or "~~".
15751609
Juxtaposition is the implicit union operator, and binds more tightly than any
15761610
explicit operator. Precedence between the explicit operators is not defined,
15771611
so mixing operators is a syntax error (thus "[A&&B--C]" is an error, but

src/pcre2.h.generic

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -339,6 +339,10 @@ pcre2_pattern_convert(). */
339339
#define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210
340340
#define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211
341341
#define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212
342+
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213
343+
#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214
344+
#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215
345+
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216
342346

343347
/* "Expected" matching error codes: no match and partial match. */
344348

src/pcre2.h.in

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -339,6 +339,10 @@ pcre2_pattern_convert(). */
339339
#define PCRE2_ERROR_ECLASS_EXPECTED_OPERAND 210
340340
#define PCRE2_ERROR_ECLASS_MIXED_OPERATORS 211
341341
#define PCRE2_ERROR_ECLASS_HINT_SQUARE_BRACKET 212
342+
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_EXPR 213
343+
#define PCRE2_ERROR_PERL_ECLASS_EMPTY_EXPR 214
344+
#define PCRE2_ERROR_PERL_ECLASS_MISSING_CLOSE 215
345+
#define PCRE2_ERROR_PERL_ECLASS_UNEXPECTED_CHAR 216
342346

343347
/* "Expected" matching error codes: no match and partial match. */
344348

0 commit comments

Comments
 (0)