-
Notifications
You must be signed in to change notification settings - Fork 210
Inconsistent behaviour of character classes + ucp in 16- and 32-bit mode #360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Seemingly related:
|
Note that similar issues seem to appear in 32-bit mode as well. |
@addisoncrump: could you better explain what is the issue from your point of view? also worth mentioning it is at least not a regression. |
This seems to be a straightforward JIT bug. The interpreter gives the same answer as Perl. |
I guess the part I am missing is what is "this". It seems that JIT returns bad results in both 16 and 32 bit libraries but not in 8 bit, and it ONLY fails when there is an implicit "inverse union" (this last part I am not even sure, as the description from the issue doesn't make sense and examples that don't have "^" in the class definition are also provided) FWIW, PCRE2 is missing the whole implementation of Unicode class logical operations as suggested by TR#18 and that might also "fix" this if implemented IMHO. |
Sorry, I thought it was obvious. /[^[:print:]\x{f6f6}]/ucp should match a character that is not printing and not 0xf6f6. Clearly this should not match 0xf6f6, but in JIT 16/32 bit modes, it does. Also in 8-bit mode with UTF set. Similarly, /[[:xdigit:]\x{6500}]a/ should match a hex digit or 0x6500, followed by "a", but JIT doesn't. As far as class operations go, see #13. |
Apologies for the late response. The holidays are a busy time, ironically! I am using terminology based on how I learned regex (in formal automata class, not in PCRE2!) and so there's a disconnect there. "Inverse union" is a negatively-matching character set (i.e. "if it's in this set it should not match") and the "union" here is just the set operation that pulls together |
could you change the subject of this ticket to indicate it affects all libraries but the 8bit one? in the description (as Phillip mentioned), that it affects JIT only, has nothing to do with the use of "^" (as shown by the example with xdigit), and that is not a regression at least for 10.42? |
Very well 🙂 |
542cb11 should fix this |
Seems correct -- I will reopen if there are new corner cases discovered. |
Seems that adding the dictionary was good for #322.
JIT seems to perform incorrectly here,
\x{f6f6}
should not be matched. Behaviour disappears when ucp flag is not set.The text was updated successfully, but these errors were encountered: