pcre2grep does not set PCRE2_UCP #185

klpn · 2023-01-06T11:23:59Z

When pcre2grep is called with -u, the flag PCRE2_UCP is not set, see:

Line 3607 in 0746b3d

case 'u': options |= PCRE2_UTF; utf = TRUE; break;

This leads to incorrect behavior with boundary matches and non-ASCII letters, such as:

$  echo 'Öm' | pcre2grep -u '\bm'
Öm

The output should be nothing. Adding the PCRE2_UCPflag fixes the problem:

  case 'u': options |= PCRE2_UTF|PCRE2_UCP; utf = TRUE; break;

GNU grep with the -P option also behaves incorrectly in this case, unlike perl (see this Twitter thread for more examples and discussion).

The text was updated successfully, but these errors were encountered:

zherczeg · 2023-01-06T19:03:43Z

I think this change is reasonable, However, we don't know who depends on the current behaviour. Anyway pcre2grep is just an example code, not intended for use in production.

klpn · 2023-01-06T19:48:55Z

Yes. And if the flag added for the -u option, the -U option should probably also be updated accordingly.

klpn · 2023-01-07T09:27:55Z

This issue was noticed by GNU grep team, and has been patched there.

zherczeg · 2023-01-07T09:39:23Z

I would not call it an issue, more of a feature request. The proper change probably needs an option to go back to the old mode as well.

PhilipHazel · 2023-01-07T16:07:00Z

Yes indeed. I am always wary about making incompatible changes, however "obvious" they may be. In this case, I agree that it's probably best to make it the default - especially as that's what others do - but I will not be surprised if it catches out at least one person. I will look at this in due course and include something like --utf-no-ucp.

This fixes a serious bug affecting word-boundary and word-constituent regular expressions when the desired match involves non-ASCII UTF8 characters. * src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF * tests/pcre-utf8-w: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Bug fixes): Mention this. * THANKS.in: Add Gro-Tsen and Karl Petterson. Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777 via Karl Pettersson in PCRE2Project/pcre2#185 This bug was present from grep-2.5, when --perl-regexp (-P) support was added.

carenas · 2023-01-08T10:40:04Z

Looking at the performance impact of this change, was wondering if we would need to at least introduce an equivalent to the /aa modifier as well as suggested in #11.

I would suspect that most people would still prefer to have \d matching only [0-9] even when PCRE2_UCP is enabled and so a flag to avoid UTF/ASCII mixing would be needed to allow signaling that preference.

As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier it is preferably for performance/security[1] reasons to avoid including in \d characters that are outside the commonly expected digits. Add that functionality with the foundations of what was suggested in PCRE2Project#11 [1] https://perldoc.perl.org/perlre#/a-(and-/aa)

PhilipHazel · 2023-02-04T17:30:22Z

There are now (in HEAD) some new options for restricting various things to ASCII and disabling Unicode/ASCII case folding. I will now consider how best to use these in pcre2grep.

PhilipHazel · 2023-02-08T15:21:23Z

I have now updated pcre2grep in HEAD. Both -u and -U set PCRE2_UCP by default, but there is a --no-ucp option to disable this. There is also a --case-restrict option which sets PCRE2_EXTRA_CASELESS_RESTRICT. Finer control is now available via the new options settings within patterns, e.g. (?aD) to force \d to ASCII while leaving \s and \w as Unicode.

PhilipHazel closed this as completed Feb 8, 2023

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pcre2grep does not set PCRE2_UCP #185

pcre2grep does not set PCRE2_UCP #185

klpn commented Jan 6, 2023 •

edited

Loading

zherczeg commented Jan 6, 2023

Uh oh!

klpn commented Jan 6, 2023

Uh oh!

klpn commented Jan 7, 2023

Uh oh!

zherczeg commented Jan 7, 2023

Uh oh!

PhilipHazel commented Jan 7, 2023

Uh oh!

carenas commented Jan 8, 2023 •

edited

Loading

Uh oh!

PhilipHazel commented Feb 4, 2023

Uh oh!

PhilipHazel commented Feb 8, 2023

Uh oh!

pcre2grep does not set PCRE2_UCP #185

pcre2grep does not set PCRE2_UCP #185

Comments

klpn commented Jan 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

zherczeg commented Jan 6, 2023

Uh oh!

klpn commented Jan 6, 2023

Uh oh!

klpn commented Jan 7, 2023

Uh oh!

zherczeg commented Jan 7, 2023

Uh oh!

PhilipHazel commented Jan 7, 2023

Uh oh!

carenas commented Jan 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PhilipHazel commented Feb 4, 2023

Uh oh!

PhilipHazel commented Feb 8, 2023

Uh oh!

klpn commented Jan 6, 2023 •

edited

Loading

carenas commented Jan 8, 2023 •

edited

Loading