-
Notifications
You must be signed in to change notification settings - Fork 210
The \X
matcher doesn't catch all symbols
#361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
…(alexandre-daubois) This PR was merged into the 5.4 branch. Discussion ---------- [String] Skip a test when an issue is detected in PCRE2 | Q | A | ------------- | --- | Branch? | 5.4 | Bug fix? | yes | New feature? | no | Deprecations? | no | Issues | Part of #52206 | License | MIT I propose to ignore this test when [this issue of PCRE2](PCRE2Project/pcre2#361) is detected until it's resolved and the polyfill updated. Commits ------- bf66274 [String] Skip a test when an issue is detected in PCRE2
Your reproducer seems to be testing PCRE1 versions - the 8.xx series. PCRE1 is obsolete and no longer maintained. PCRE2 is now nearly 9 years old and is the version maintained in this repository. Current PCRE2 matches your three characters as one cluster. This is the output from pcre2test: PCRE2 version 10.42 2022-12-11 However, I see that it differs from Perl v5.38.1 which matches only the first character (as does PCRE1 so maybe you are using PCRE2 after all). In PCRE2, the documentation in pcre2pattern lists the cluster breaking rules. This seems to be the relevant one:
All three of your characters have the extended pictographic property in Unicode 15.0.0, which is what current PCRE2 supports. The rules came from Unicode documentation https://unicode.org/reports/tr29/. PCRE1 does not have rule 6, and indeed, before 8.31 used even simpler rules. In TR29 there is this sentence: "Each emoji sequence is a single grapheme cluster. See definition ED-17 in Unicode Technical Standard #51, "Unicode Emoji" [UAX51]." So it seems to me that PCRE2 is correctly following the rules. |
Indeed the problem seems to come from php-intl. Will report the bug, thanks for your explanations 👍 |
Hi! 👋
We're facing an issue in the Symfony repository in the CI: https://ci.appveyor.com/project/fabpot/symfony/builds/48712798
The problems comes from the Grapheme Cluster polyfill when the
php-intl
extension is not available. This polyfill uses the\X
matcher of PCRE to get the length of a unicode string. However, it seems the it dosen't work with symbols. Indeed, for the following sequence:The
\X
matcher returns a length of 1, where 3 is expected. Here's the reproducer: https://3v4l.org/C0UuO. As you can see, the$matches
array only returns 1 result containing all three symbols.The text was updated successfully, but these errors were encountered: