to_lowercase only uses unconditional parts of unicode.org's special-casing #51362
Labels
A-Unicode
Area: Unicode
C-discussion
Category: Discussion or questions that doesn't represent real issues.
T-libs-api
Relevant to the library API team, which will review and decide on the PR/issue.
https://github.com/rust-lang/rust/blob/f9157f5b869fdb14308eaf6778d01ee3d0e1268a/src/libcore/unicode/unicode.py#L168-169
Since #25800, to_lowercase uses unicode.org's SpecialCasing.txt.
However, it only follows unconditional rules from this file. One "interesting" case is:
I think that (2) only makes sense when accompanied by (3): They are in the same file, touching the same character; (3) for tr/az and (2) for other languages.
But because only unconditional rules are handled, we end up with something hybrid that was intended for non-tr/az languages in contrast with tr/az, while ignoring the default language-independent specification from UnicodeData.txt.
I realize that it's quite a corner case, and open to interpretation.
Also, SpecialCasing.txt contains other useful unconditional rules that are worth having, so it would be unfortunate to lose those.
And (2) does have the advantage of making lowercasing reversible; though it's not a goal of unicode AFAIK.
So in the end, I'm not sure if&how this should be fixed -- other than implementing conditions, which would require handling languages.
A compromise would be to ignore this one rule, by hard-coding an exception.
This restriction could be made less arbitrary, by saying: Unconditional rules are only accepted for characters that do not also have conditional rules.
I understand if this won't be fixed, I at least wanted to bring attention to this case.
The text was updated successfully, but these errors were encountered: