Skip to content

pcre2grep does not set PCRE2_UCP #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
klpn opened this issue Jan 6, 2023 · 8 comments
Closed

pcre2grep does not set PCRE2_UCP #185

klpn opened this issue Jan 6, 2023 · 8 comments

Comments

@klpn
Copy link

klpn commented Jan 6, 2023

When pcre2grep is called with -u, the flag PCRE2_UCP is not set, see:

case 'u': options |= PCRE2_UTF; utf = TRUE; break;

This leads to incorrect behavior with boundary matches and non-ASCII letters, such as:

$  echo 'Öm' | pcre2grep -u '\bm'
Öm

The output should be nothing. Adding the PCRE2_UCPflag fixes the problem:

  case 'u': options |= PCRE2_UTF|PCRE2_UCP; utf = TRUE; break;

GNU grep with the -P option also behaves incorrectly in this case, unlike perl (see this Twitter thread for more examples and discussion).

@zherczeg
Copy link
Collaborator

zherczeg commented Jan 6, 2023

I think this change is reasonable, However, we don't know who depends on the current behaviour. Anyway pcre2grep is just an example code, not intended for use in production.

@klpn
Copy link
Author

klpn commented Jan 6, 2023

Yes. And if the flag added for the -u option, the -U option should probably also be updated accordingly.

@klpn
Copy link
Author

klpn commented Jan 7, 2023

This issue was noticed by GNU grep team, and has been patched there.

@zherczeg
Copy link
Collaborator

zherczeg commented Jan 7, 2023

I would not call it an issue, more of a feature request. The proper change probably needs an option to go back to the old mode as well.

@PhilipHazel
Copy link
Collaborator

Yes indeed. I am always wary about making incompatible changes, however "obvious" they may be. In this case, I agree that it's probably best to make it the default - especially as that's what others do - but I will not be surprised if it catches out at least one person. I will look at this in due course and include something like --utf-no-ucp.

jollaitbot pushed a commit to sailfishos-mirror/grep that referenced this issue Jan 8, 2023
This fixes a serious bug affecting word-boundary and word-constituent regular
expressions when the desired match involves non-ASCII UTF8 characters.
* src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF
* tests/pcre-utf8-w: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention this.
* THANKS.in: Add Gro-Tsen and Karl Petterson.
Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777
via Karl Pettersson in PCRE2Project/pcre2#185
This bug was present from grep-2.5, when --perl-regexp (-P) support was added.
@carenas
Copy link
Contributor

carenas commented Jan 8, 2023

Looking at the performance impact of this change, was wondering if we would need to at least introduce an equivalent to the /aa modifier as well as suggested in #11.

I would suspect that most people would still prefer to have \d matching only [0-9] even when PCRE2_UCP is enabled and so a flag to avoid UTF/ASCII mixing would be needed to allow signaling that preference.

carenas added a commit to carenas/pcre2 that referenced this issue Jan 8, 2023
As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier
it is preferably for performance/security[1] reasons to avoid
including in \d characters that are outside the commonly expected
digits.

Add that functionality with the foundations of what was suggested
in PCRE2Project#11

[1] https://perldoc.perl.org/perlre#/a-(and-/aa)
carenas added a commit to carenas/pcre2 that referenced this issue Jan 9, 2023
As suggested in PCRE2Project#185 and as done with Perl with the '/aa' modifier
it is preferably for performance/security[1] reasons to avoid
including in \d characters that are outside the commonly expected
digits.

Add that functionality with the foundations of what was suggested
in PCRE2Project#11

[1] https://perldoc.perl.org/perlre#/a-(and-/aa)
@PhilipHazel
Copy link
Collaborator

There are now (in HEAD) some new options for restricting various things to ASCII and disabling Unicode/ASCII case folding. I will now consider how best to use these in pcre2grep.

@PhilipHazel
Copy link
Collaborator

I have now updated pcre2grep in HEAD. Both -u and -U set PCRE2_UCP by default, but there is a --no-ucp option to disable this. There is also a --case-restrict option which sets PCRE2_EXTRA_CASELESS_RESTRICT. Finer control is now available via the new options settings within patterns, e.g. (?aD) to force \d to ASCII while leaving \s and \w as Unicode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants