Skip to content

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
squelart opened this issue Jun 5, 2018 · 1 comment
Open
Labels
A-Unicode Area: Unicode C-discussion Category: Discussion or questions that doesn't represent real issues. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@squelart
Copy link

squelart commented Jun 5, 2018

https://github.com/rust-lang/rust/blob/f9157f5b869fdb14308eaf6778d01ee3d0e1268a/src/libcore/unicode/unicode.py#L168-169

Since #25800, to_lowercase uses unicode.org's SpecialCasing.txt.
However, it only follows unconditional rules from this file. One "interesting" case is:

  1. The main UnicodeData.txt file says that the lowercase of 'İ' (0130, Latin capital letter I with dot above) should be 'i' (0069, good-old boring ASCII Latin small letter i).
  2. SpecialCasing.txt adds an unconditional rule that 'İ' (0130) should in fact be lowercased to 'i̇' (0069 Latin small letter i + 0307 combining dot above).
  3. SpecialCasing.txt then adds a rule for tr (Turkish) and az (Azerbaijani) where 'İ' (0130) should now be lowercased to just 'i' (0069 Latin small letter i) -- There are other related rules, dotted-i's match and non-dotted-i's match too.

I think that (2) only makes sense when accompanied by (3): They are in the same file, touching the same character; (3) for tr/az and (2) for other languages.
But because only unconditional rules are handled, we end up with something hybrid that was intended for non-tr/az languages in contrast with tr/az, while ignoring the default language-independent specification from UnicodeData.txt.

I realize that it's quite a corner case, and open to interpretation.
Also, SpecialCasing.txt contains other useful unconditional rules that are worth having, so it would be unfortunate to lose those.
And (2) does have the advantage of making lowercasing reversible; though it's not a goal of unicode AFAIK.

So in the end, I'm not sure if&how this should be fixed -- other than implementing conditions, which would require handling languages.

A compromise would be to ignore this one rule, by hard-coding an exception.
This restriction could be made less arbitrary, by saying: Unconditional rules are only accepted for characters that do not also have conditional rules.

I understand if this won't be fixed, I at least wanted to bring attention to this case.

@estebank estebank added A-Unicode Area: Unicode T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 5, 2018
@Enselic Enselic added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Oct 16, 2023
@crlf0710 crlf0710 added C-discussion Category: Discussion or questions that doesn't represent real issues. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Mar 2, 2025
@crlf0710
Copy link
Member

crlf0710 commented Mar 2, 2025

Unicode Standard Clause 3.13.1 Definitions says:

The full case mappings for Unicode characters are obtained by using the mappings
from SpecialCasing.txt plus the mappings from UnicodeData.txt, excluding any of the
latter mappings that would conflict.

So in your mentioned items, (2) and (3) takes priority over (1). Rust stdlib's current behavior is reasonable, and doesn't really need to be fixed. It's true there's no way to assert that a specific String values contains tr/az language text, but i believe this kind language annotation is out of scope for stdlib String type. One may refer to api like https://docs.rs/icu/latest/icu/casemap/struct.CaseMapper.html#method.lowercase to access expected low-level expected behaviors.

It would be very nice if a community crate raise up to become the go-to solution for multi-lingual string values in the future, but currently this has not happened yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-discussion Category: Discussion or questions that doesn't represent real issues. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants