-
Notifications
You must be signed in to change notification settings - Fork 210
pcre2grep does not set PCRE2_UCP #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this change is reasonable, However, we don't know who depends on the current behaviour. Anyway pcre2grep is just an example code, not intended for use in production. |
Yes. And if the flag added for the |
This issue was noticed by GNU grep team, and has been patched there. |
I would not call it an issue, more of a feature request. The proper change probably needs an option to go back to the old mode as well. |
Yes indeed. I am always wary about making incompatible changes, however "obvious" they may be. In this case, I agree that it's probably best to make it the default - especially as that's what others do - but I will not be surprised if it catches out at least one person. I will look at this in due course and include something like --utf-no-ucp. |
This fixes a serious bug affecting word-boundary and word-constituent regular expressions when the desired match involves non-ASCII UTF8 characters. * src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF * tests/pcre-utf8-w: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Bug fixes): Mention this. * THANKS.in: Add Gro-Tsen and Karl Petterson. Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777 via Karl Pettersson in PCRE2Project/pcre2#185 This bug was present from grep-2.5, when --perl-regexp (-P) support was added.
Looking at the performance impact of this change, was wondering if we would need to at least introduce an equivalent to the I would suspect that most people would still prefer to have \d matching only [0-9] even when PCRE2_UCP is enabled and so a flag to avoid UTF/ASCII mixing would be needed to allow signaling that preference. |
As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier it is preferably for performance/security[1] reasons to avoid including in \d characters that are outside the commonly expected digits. Add that functionality with the foundations of what was suggested in PCRE2Project#11 [1] https://perldoc.perl.org/perlre#/a-(and-/aa)
As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier it is preferably for performance/security[1] reasons to avoid including in \d characters that are outside the commonly expected digits. Add that functionality with the foundations of what was suggested in PCRE2Project#11 [1] https://perldoc.perl.org/perlre#/a-(and-/aa)
There are now (in HEAD) some new options for restricting various things to ASCII and disabling Unicode/ASCII case folding. I will now consider how best to use these in pcre2grep. |
I have now updated pcre2grep in HEAD. Both -u and -U set PCRE2_UCP by default, but there is a --no-ucp option to disable this. There is also a --case-restrict option which sets PCRE2_EXTRA_CASELESS_RESTRICT. Finer control is now available via the new options settings within patterns, e.g. (?aD) to force \d to ASCII while leaving \s and \w as Unicode. |
Uh oh!
There was an error while loading. Please reload this page.
When
pcre2grep
is called with-u
, the flagPCRE2_UCP
is not set, see:pcre2/src/pcre2grep.c
Line 3607 in 0746b3d
This leads to incorrect behavior with boundary matches and non-ASCII letters, such as:
The output should be nothing. Adding the
PCRE2_UCP
flag fixes the problem:GNU grep with the
-P
option also behaves incorrectly in this case, unlike perl (see this Twitter thread for more examples and discussion).The text was updated successfully, but these errors were encountered: