Skip to content

Commit 2f43049

Browse files
committed
add PCRE2_ASCII (RFC)
As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier it is preferably for performance/security[1] reasons to avoid including in \d characters that are outside the commonly expected digits. Add that functionality with the foundations of what was suggested in PCRE2Project#11 [1] https://perldoc.perl.org/perlre#/a-(and-/aa)
1 parent 0746b3d commit 2f43049

File tree

7 files changed

+37
-18
lines changed

7 files changed

+37
-18
lines changed

doc/pcre2_compile.3

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ The option bits are:
4444
PCRE2_ALT_BSUX Alternative handling of \eu, \eU, and \ex
4545
PCRE2_ALT_CIRCUMFLEX Alternative handling of ^ in multiline mode
4646
PCRE2_ALT_VERBNAMES Process backslashes in verb names
47+
PCRE2_ASCII Prefer ASCII in conflicting UTF classes
4748
PCRE2_AUTO_CALLOUT Compile automatic callouts
4849
PCRE2_CASELESS Do caseless matching
4950
PCRE2_DOLLAR_ENDONLY $ not to match newline at end

doc/pcre2api.3

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1446,6 +1446,12 @@ included in a name either as \e) or between \eQ and \eE. If the PCRE2_EXTENDED
14461446
or PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
14471447
whitespace in verb names is skipped and #-comments are recognized, exactly as
14481448
in the rest of the pattern.
1449+
.sp
1450+
PCRE2_ASCII
1451+
.sp
1452+
When PCRE2_UTF and PCRE2_UCP are both being used, some classes are changed in
1453+
ways that conflict between UTF and ASCII characters. This option can be set
1454+
to restrict \ed to only match the non UTF digits.
14491455
.sp
14501456
PCRE2_AUTO_CALLOUT
14511457
.sp

doc/pcre2pattern.3

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,10 +73,11 @@ appearance in a pattern causes an error.
7373
.sp
7474
Another special sequence that may appear at the start of a pattern is (*UCP).
7575
This has the same effect as setting the PCRE2_UCP option: it causes sequences
76-
such as \ed and \ew to use Unicode properties to determine character types,
77-
instead of recognizing only characters with codes less than 256 via a lookup
78-
table. If also causes upper/lower casing operations to use Unicode properties
79-
for characters with code points greater than 127, even when UTF is not set.
76+
such as \ed (unless PCRE2_ASCII was set) and \ew to use Unicode properties
77+
to determine character types, instead of recognizing only characters with
78+
codes less than 256 via a lookup table. It also causes upper/lower casing
79+
operations to use Unicode properties for characters with code points greater
80+
than 127, even when UTF is not set.
8081
.P
8182
Some applications that allow their users to supply patterns may wish to
8283
restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
@@ -670,7 +671,8 @@ determine character types, as follows:
670671
\ew any character that matches \ep{L} or \ep{N}, plus underscore
671672
.sp
672673
The upper case escapes match the inverse sets of characters. Note that \ed
673-
matches only decimal digits, whereas \ew matches any Unicode digit, as well as
674+
matches only decimal digits and could be forced to match only the original
675+
set with PCRE2_ASCII, whereas \ew matches any Unicode digit, as well as
674676
any Unicode letter, and underscore. Note also that PCRE2_UCP affects \eb, and
675677
\eB because they are defined in terms of \ew and \eW. Matching these sequences
676678
is noticeably slower when PCRE2_UCP is set.

doc/pcre2syntax.3

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,8 @@ or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
101101
happening, \es and \ew may also match characters with code points in the range
102102
128-255. If the PCRE2_UCP option is set, the behaviour of these escape
103103
sequences is changed to use Unicode properties and they match many more
104-
characters.
104+
characters. Alternatively if the PCRE2_ASCII option is also set \ed original
105+
definition is preserved.
105106
.P
106107
Property descriptions in \ep and \eP are matched caselessly; hyphens,
107108
underscores, and white space are ignored, in accordance with Unicode's "loose

src/pcre2.h.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ D is inspected during pcre2_dfa_match() execution
143143
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
144144
#define PCRE2_LITERAL 0x02000000u /* C */
145145
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
146+
#define PCRE2_ASCII 0x08000000u /* C */
146147

147148
/* An additional compile options word is available in the compile context. */
148149

src/pcre2_compile.c

Lines changed: 15 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -776,7 +776,7 @@ are allowed. */
776776
PCRE2_EXTENDED|PCRE2_EXTENDED_MORE|PCRE2_MATCH_UNSET_BACKREF| \
777777
PCRE2_MULTILINE|PCRE2_NEVER_BACKSLASH_C|PCRE2_NEVER_UCP| \
778778
PCRE2_NEVER_UTF|PCRE2_NO_AUTO_CAPTURE|PCRE2_NO_AUTO_POSSESS| \
779-
PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_UCP|PCRE2_UNGREEDY)
779+
PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_UCP|PCRE2_UNGREEDY|PCRE2_ASCII)
780780

781781
#define PUBLIC_LITERAL_COMPILE_EXTRA_OPTIONS \
782782
(PCRE2_EXTRA_MATCH_LINE|PCRE2_EXTRA_MATCH_WORD)
@@ -3124,14 +3124,17 @@ while (ptr < ptrend)
31243124
}
31253125
else
31263126
{
3127-
*parsed_pattern++ = META_ESCAPE +
3128-
((escape == ESC_d || escape == ESC_s || escape == ESC_w)?
3129-
ESC_p : ESC_P);
3127+
if ((options & PCRE2_ASCII) == 0)
3128+
*parsed_pattern++ = META_ESCAPE +
3129+
((escape == ESC_s || escape == ESC_w)? ESC_p : ESC_P);
3130+
else
3131+
*parsed_pattern++ = META_ESCAPE + escape;
31303132
switch(escape)
31313133
{
31323134
case ESC_d:
31333135
case ESC_D:
3134-
*parsed_pattern++ = (PT_PC << 16) | ucp_Nd;
3136+
if ((options & PCRE2_ASCII) == 0)
3137+
*parsed_pattern++ = (PT_PC << 16) | ucp_Nd;
31353138
break;
31363139

31373140
case ESC_s:
@@ -3671,14 +3674,17 @@ while (ptr < ptrend)
36713674
}
36723675
else
36733676
{
3674-
*parsed_pattern++ = META_ESCAPE +
3675-
((escape == ESC_d || escape == ESC_s || escape == ESC_w)?
3676-
ESC_p : ESC_P);
3677+
if ((options & PCRE2_ASCII) == 0)
3678+
*parsed_pattern++ = META_ESCAPE +
3679+
((escape == ESC_s || escape == ESC_w)? ESC_p : ESC_P);
3680+
else
3681+
*parsed_pattern++ = META_ESCAPE + escape;
36773682
switch(escape)
36783683
{
36793684
case ESC_d:
36803685
case ESC_D:
3681-
*parsed_pattern++ = (PT_PC << 16) | ucp_Nd;
3686+
if ((options & PCRE2_ASCII) == 0)
3687+
*parsed_pattern++ = (PT_PC << 16) | ucp_Nd;
36823688
break;
36833689

36843690
case ESC_s:

src/pcre2test.c

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -640,6 +640,7 @@ static modstruct modlist[] = {
640640
{ "alt_verbnames", MOD_PAT, MOD_OPT, PCRE2_ALT_VERBNAMES, PO(options) },
641641
{ "altglobal", MOD_PND, MOD_CTL, CTL_ALTGLOBAL, PO(control) },
642642
{ "anchored", MOD_PD, MOD_OPT, PCRE2_ANCHORED, PD(options) },
643+
{ "ascii", MOD_PATP, MOD_OPT, PCRE2_ASCII, PO(options) },
643644
{ "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) },
644645
{ "bad_escape_is_literal", MOD_CTC, MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) },
645646
{ "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) },
@@ -762,8 +763,8 @@ static modstruct modlist[] = {
762763
/* Controls and options that are supported for use with the POSIX interface. */
763764

764765
#define POSIX_SUPPORTED_COMPILE_OPTIONS ( \
765-
PCRE2_CASELESS|PCRE2_DOTALL|PCRE2_LITERAL|PCRE2_MULTILINE|PCRE2_UCP| \
766-
PCRE2_UTF|PCRE2_UNGREEDY)
766+
PCRE2_ASCII|PCRE2_CASELESS|PCRE2_DOTALL|PCRE2_LITERAL|PCRE2_MULTILINE| \
767+
PCRE2_UCP| PCRE2_UTF|PCRE2_UNGREEDY)
767768

768769
#define POSIX_SUPPORTED_COMPILE_EXTRA_OPTIONS (0)
769770

@@ -4202,12 +4203,13 @@ static void
42024203
show_compile_options(uint32_t options, const char *before, const char *after)
42034204
{
42044205
if (options == 0) fprintf(outfile, "%s <none>%s", before, after);
4205-
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
4206+
else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
42064207
before,
42074208
((options & PCRE2_ALT_BSUX) != 0)? " alt_bsux" : "",
42084209
((options & PCRE2_ALT_CIRCUMFLEX) != 0)? " alt_circumflex" : "",
42094210
((options & PCRE2_ALT_VERBNAMES) != 0)? " alt_verbnames" : "",
42104211
((options & PCRE2_ALLOW_EMPTY_CLASS) != 0)? " allow_empty_class" : "",
4212+
((options & PCRE2_ASCII) != 0)? " ascii" : "",
42114213
((options & PCRE2_ANCHORED) != 0)? " anchored" : "",
42124214
((options & PCRE2_AUTO_CALLOUT) != 0)? " auto_callout" : "",
42134215
((options & PCRE2_CASELESS) != 0)? " caseless" : "",

0 commit comments

Comments
 (0)