-
Notifications
You must be signed in to change notification settings - Fork 461
support Unicode grapheme clusters #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yup. The regex crate assumes a single character is a Unicode scalar value. Do you know of other regex libraries that handle graphemes correctly? |
@tbu- one way to handle this is to apply a normalization to strings before matching. For example, converting a string to NFC form (or was it NFKC? See UAX#15 and the |
@kwantam I believe that this still does not work for the |
@tbu- UAX#18§2.2 discusses matching grapheme clusters. The rule here, RL2.2, does not say that
I agree that it would be nice to have As regards other languages, the perlunicode man page has a nice summary of perl's support for UAX#18. One assumes if perl's implementation is missing it, most everyone else's is, too, and indeed they do not support RL2.2 at least as of perl5.20.1. |
Oh, I incorrectly assumed it would be described in "UAX#18§2.2.1 Grapheme Cluster Mode". |
Nothing incorrect about that; it would be nice to have that mode available, too! :) (And, I think, once RL2.2 is implemented, it would be pretty easy to add a switch for grapheme cluster mode.) |
I changed the title of this issue because this crate doesn't support grapheme clusters, so what you're seeing is intended behavior. I actually don't know much about graphemes and how they're encoded, but if they can be made to fit into the |
While I'm not against potentially adding this, I do think it would be a mighty big undertaking and it's not quite clear it could be feasibly done in the DFA. If someone wanted to move forward on working on this, then we can re-open this issue. Before writing any code, I would like to see a plan for how it would be added to the DFA though and at least some analysis on the memory required. |
Any chance for grapheme cluster support in the near future? One of our projects requires that we match them, and I really would like to stick with rust and not switch to python, which unfortunately can do it: import regex
haystack = "🏳️🌈F́́̂꣩⃧꣩ᷭ⳱꣪L⃒̲̀̀̀ᷛᷩ⃰᷃À̀̀̀꣢᪱̅̾ᷥG︪̯̰̀̀̀̀ͭ̇"
needle = "🏳️🌈\\X\\X\\X\\X"
foo = regex.search(needle, haystack)
print(foo) prints
|
No, I don't see it happening. You can write the regex yourself if you want, which is here: https://github.com/BurntSushi/bstr/blob/master/scripts/regex/grapheme.sh The regex is quite gnarly though, but that's exactly what this crate would do if it supported If you describe the high level problem you're trying to solve, then there are almost certainly other approaches you could take here. It would be pretty surprising to me to switch your entire programming language based on this one thing. You could also use the |
I'm also interested in having this. For me the use case is finding emojis in text. Sounds like a simple problem, but consider these strings found in the wild:
The best algorithm I found to find these (and all other emoji) is this one: import emoji
import regex
def split_count(text):
emoji_list = []
data = regex.findall(r'\X', text)
for word in data:
if any(char in emoji.UNICODE_EMOJI for char in word):
emoji_list.append(word)
return emoji_list Everything else I've seen basically uses a whitelist of known emojis , but that misses variants that are supported in different devices by combining other emoji as shown above. So for each grapheme cluster, if at least one of the characters is a known emoji character, the whole cluster is counted as an emoji. This python code uses the The algorithm PCRE2 uses is described in detail in the Extended grapheme clusters section of the docs. I guess it's pretty similar to the .sh file @BurntSushi linked above, though it looks somewhat simpler |
@phiresky I don't see why you need a regex for that. Just iterate over all graphemes in the text and you can use the same detection logic. I'm not an expert on emojis but I eould consult Unicode's specification for them. They likely have an algorithm for detecting them already written. And are you sure the |
The
That's true. It seems like the unicode_segmentation Rust crate does implement the same algorithm:
I guess that's off topic here then, but as far as I have found they only have a list of base emojis, and a list of "known" zero-width-joined sequences, as well as the general grapheme splitting algorithm. But sequences not in this list are not invalid, and they already don't have all combinations of skin tones etc in there I think. I think phone manufacturers etc just add emojis when they want so it's not really possible to create a comprehensive list. https://unicode.org/emoji/charts/index.html Edit: You're right, the regex pypi package does implement \X, but they have their own implementation separate from PCRE2. |
Right. As does Also, it's worth mentioning that there are two reasonable interpretations to what "support Unicode grapheme clusters" actually means in the context of a regex engine. The simplest is support for The other interpretation is more pervasive, whereby the regex engine itself becomes "grapheme aware." That is, things like |
The regex engine doesn't consider characters (graphemes) that consist of multiple code points correctly.
For example the letter 'ä' has two representations, that should both be matched by the regex
.
, howver only the latter is.The text was updated successfully, but these errors were encountered: