-
-
Notifications
You must be signed in to change notification settings - Fork 556
reverse-string: Add unicode tests #2367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I like this test case, but feel as though we should also be including cases for scripts/words that do not already have normalized combined forms. This case can be passed in Python without any extra processing, because these two ideographs are a pre-combined form and Python 'sees' them each as a single code point.
On the other hand, this string has to be normalized to
That being said, I don't have objections to adding the first test case. 😄 |
@BethanyG I'm not sure I understand your concern. What is a "normalized combined form" and an "ideograph"? The vocabulary around unicode is sometimes confusing to me. To add some context the test case The second test case For exmale, a simple solution with
|
@senekor - apologies, I am probably too far down a personal rabbit hole of Unicode documentation to communicate effectively at the moment! 😆 I used 'ideograph' for lack of a better term. 'Character' felt to loaded to me. What I meant to convey is that the string contains two 'characters' that are made up of individual strokes, but they have pre-composed equivalents that are represented by single codepoints in Unicode. So in a language like Python (which represents strings as a collection of Unicode codepoints), they behave no differently from plain ASCII -- there are no surprises in reversal, and no conversion or checking needed. Contrast that with this example in Hangul: "기운찰만하다"[::-1]
'ᅡ다ᄒᆫᅡᄆᆯᅡᄎᆫᅮ이ᄀ' However, this can be easily normalized to use the pre-combined Unicode codepoints: import unicodedata #This is part of Pythons std lib
print(unicodedata.normalize('NFC', "기운찰만하다")[::-1])
다하만찰운기
Your example can also be normalized into combined codepoints: import unicodedata
unicodedata.normalize('NFC', "Würstchenstand")[::-1])
dnatsnehctsrüW However, this word in Thai cannot simply be normalized. This word has to be iterated through and checked for combining marks, and steps then need to be taken to keep the combining marks together as a grapheme cluster: import unicode data
punicodedata.normalize('NFC', "ผู้เขียนโปรแกรม")[::-1]
มรกแรปโนยีขเู้ผ #diacritical marks are now off by 1 As does this word in Bangla (which uses zero-width joiners): import unicode data
unicodedata.normalize('NFC', "শিক্ষা দেওয়া")[::-1]
া়যওেদ াষ্কিশ So these particular cases would force additional (and probably complex) code .... or the use of Unicode regex or another third-party library that implements a grapheme cluster break algorithm. ...which is all a long-winded way of saying Unicode is complicated, as are graphemes. 😄 |
@BethanyG thank you, I think I understand it better now :) What is it exaclty you suggest we do? Should I replace |
@senekor Maybe rather than replace, we can add
There are more cases (and let us not consider flags or emoji) - but as you have said, to handle everything correctly requires an expert written third-party library. And even then, language-specific knowledge is also required for anything complicated. But I think these cases are enough to prompt students into thinking about the complexities and encouraging them to learn more. |
Alright, I added a third case with the Thai word. @BethanyG is the |
Apologies for the delay, @senekor -- Yes! That is a correct reversal. 😄 Many thanks for your work and your patience. |
Thank you! 🙂 |
forum post