Skip to content

reverse-string: Add unicode tests #2367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 25, 2024
Merged

reverse-string: Add unicode tests #2367

merged 2 commits into from
Jan 25, 2024

Conversation

senekor
Copy link
Contributor

@senekor senekor commented Jan 24, 2024

@BethanyG
Copy link
Member

I like this test case, but feel as though we should also be including cases for scripts/words that do not already have normalized combined forms. This case can be passed in Python without any extra processing, because these two ideographs are a pre-combined form and Python 'sees' them each as a single code point.

猫" --> U+29483
'子' --> U+23376

On the other hand, this string has to be normalized to NFC (forcing the pre-combined forms) or checked for combining marks:

aînée

That being said, I don't have objections to adding the first test case. 😄

@senekor
Copy link
Contributor Author

senekor commented Jan 25, 2024

@BethanyG I'm not sure I understand your concern. What is a "normalized combined form" and an "ideograph"? The vocabulary around unicode is sometimes confusing to me.

To add some context the test case "子猫" is there to prevent the string from being treated as a byte array, because these characters are multiple bytes in UTF-8. I expect this test case to be easy in any language that has a notion of unicode built-in.

The second test case "Würstchenstand" contains what I usually see referred to as a "grapheme cluster" where two unicode scalar values combine to one human-readable character. Note that ü can be represented as a single scalar value (UTF-8 bytes: [195, 188]) or the regular ascii u followed by a diaeresis ◌̈ (UTF-8 bytes: [117, 204, 136]).

For exmale, a simple solution with [::-1] results in the diaeresis being on the r:

>>> "Würstchenstand"[::-1]
'dnatsnehctsr̈uW'

@BethanyG
Copy link
Member

@senekor - apologies, I am probably too far down a personal rabbit hole of Unicode documentation to communicate effectively at the moment! 😆

I used 'ideograph' for lack of a better term. 'Character' felt to loaded to me.

What I meant to convey is that the string contains two 'characters' that are made up of individual strokes, but they have pre-composed equivalents that are represented by single codepoints in Unicode. So in a language like Python (which represents strings as a collection of Unicode codepoints), they behave no differently from plain ASCII -- there are no surprises in reversal, and no conversion or checking needed.

Contrast that with this example in Hangul:

"기운찰만하다"[::-1]
'ᅡ다ᄒᆫᅡᄆᆯᅡᄎᆫᅮ이ᄀ'

However, this can be easily normalized to use the pre-combined Unicode codepoints:

import unicodedata #This is part of Pythons std lib

print(unicodedata.normalize('NFC', "기운찰만하다")[::-1])
다하만찰운기

The second test case "Würstchenstand" contains what I usually see referred to as a "grapheme cluster" where two unicode scalar values combine to one human-readable character. Note that ü can be represented as a single scalar value (UTF-8 bytes: [195, 188]) or the regular ascii u followed by a diaeresis ◌̈ (UTF-8 bytes: [117, 204, 136]).

For exmale, a simple solution with [::-1] results in the diaeresis being on the r:

>>> "Würstchenstand"[::-1]
'dnatsnehctsr̈uW'

Your example can also be normalized into combined codepoints:

import unicodedata

unicodedata.normalize('NFC', "Würstchenstand")[::-1])
dnatsnehctsrüW

However, this word in Thai cannot simply be normalized. This word has to be iterated through and checked for combining marks, and steps then need to be taken to keep the combining marks together as a grapheme cluster:

import unicode data

punicodedata.normalize('NFC', "ผู้เขียนโปรแกรม")[::-1]
มรกแรปโนยีขเู้ผ   #diacritical marks are now off by 1

As does this word in Bangla (which uses zero-width joiners):

import unicode data

unicodedata.normalize('NFC', "শিক্ষা দেওয়া")[::-1]
া়যওেদষ্কিশ

So these particular cases would force additional (and probably complex) code .... or the use of Unicode regex or another third-party library that implements a grapheme cluster break algorithm.

...which is all a long-winded way of saying Unicode is complicated, as are graphemes. 😄

@senekor
Copy link
Contributor Author

senekor commented Jan 25, 2024

@BethanyG thank you, I think I understand it better now :) What is it exaclty you suggest we do? Should I replace Würstchenstand with ผู้เขียนโปรแกรม? That would force people to have logic that cosiders grapheme clusters. (on the Rust track, we encourage the use of a third-party library for this)

@BethanyG
Copy link
Member

@senekor Maybe rather than replace, we can add ผู้เขียนโปรแกรม as a third case? That way we've covered at least three major categories:

  1. Single Unicode codepoints that are multi-byte
  2. Unicode codepoints with combining characters that have combined Unicode equivalents
  3. Unicode codepoints that are meant to combine together into extended grapheme clusters without any combined equivalent.

There are more cases (and let us not consider flags or emoji) - but as you have said, to handle everything correctly requires an expert written third-party library. And even then, language-specific knowledge is also required for anything complicated. But I think these cases are enough to prompt students into thinking about the complexities and encouraging them to learn more.

@senekor
Copy link
Contributor Author

senekor commented Jan 25, 2024

Alright, I added a third case with the Thai word. @BethanyG is the expected version correct?

@BethanyG
Copy link
Member

Apologies for the delay, @senekor -- Yes! That is a correct reversal. 😄 Many thanks for your work and your patience.

@senekor
Copy link
Contributor Author

senekor commented Jan 25, 2024

Thank you! 🙂

@senekor senekor merged commit 94cae6e into main Jan 25, 2024
@senekor senekor deleted the reverse-string-unicode branch January 25, 2024 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants