to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

squelart · 2018-06-05T08:05:55Z

https://github.com/rust-lang/rust/blob/f9157f5b869fdb14308eaf6778d01ee3d0e1268a/src/libcore/unicode/unicode.py#L168-169

Since #25800, to_lowercase uses unicode.org's SpecialCasing.txt.
However, it only follows unconditional rules from this file. One "interesting" case is:

The main UnicodeData.txt file says that the lowercase of 'İ' (0130, Latin capital letter I with dot above) should be 'i' (0069, good-old boring ASCII Latin small letter i).
SpecialCasing.txt adds an unconditional rule that 'İ' (0130) should in fact be lowercased to 'i̇' (0069 Latin small letter i + 0307 combining dot above).
SpecialCasing.txt then adds a rule for tr (Turkish) and az (Azerbaijani) where 'İ' (0130) should now be lowercased to just 'i' (0069 Latin small letter i) -- There are other related rules, dotted-i's match and non-dotted-i's match too.

I think that (2) only makes sense when accompanied by (3): They are in the same file, touching the same character; (3) for tr/az and (2) for other languages.
But because only unconditional rules are handled, we end up with something hybrid that was intended for non-tr/az languages in contrast with tr/az, while ignoring the default language-independent specification from UnicodeData.txt.

I realize that it's quite a corner case, and open to interpretation.
Also, SpecialCasing.txt contains other useful unconditional rules that are worth having, so it would be unfortunate to lose those.
And (2) does have the advantage of making lowercasing reversible; though it's not a goal of unicode AFAIK.

So in the end, I'm not sure if&how this should be fixed -- other than implementing conditions, which would require handling languages.

A compromise would be to ignore this one rule, by hard-coding an exception.
This restriction could be made less arbitrary, by saying: Unconditional rules are only accepted for characters that do not also have conditional rules.

I understand if this won't be fixed, I at least wanted to bring attention to this case.

crlf0710 · 2025-03-02T12:27:46Z

Unicode Standard Clause 3.13.1 Definitions says:

The full case mappings for Unicode characters are obtained by using the mappings
from SpecialCasing.txt plus the mappings from UnicodeData.txt, excluding any of the
latter mappings that would conflict.

So in your mentioned items, (2) and (3) takes priority over (1). Rust stdlib's current behavior is reasonable, and doesn't really need to be fixed. It's true there's no way to assert that a specific String values contains tr/az language text, but i believe this kind language annotation is out of scope for stdlib String type. One may refer to api like https://docs.rs/icu/latest/icu/casemap/struct.CaseMapper.html#method.lowercase to access expected low-level expected behaviors.

It would be very nice if a community crate raise up to become the go-to solution for multi-lingual string values in the future, but currently this has not happened yet.

estebank added A-Unicode Area: Unicode T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 5, 2018

Enselic added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Oct 16, 2023

crlf0710 added C-discussion Category: Discussion or questions that doesn't represent real issues. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

squelart commented Jun 5, 2018 •

edited

Loading

crlf0710 commented Mar 2, 2025

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

to_lowercase only uses unconditional parts of unicode.org's special-casing #51362

Comments

squelart commented Jun 5, 2018 • edited Loading

crlf0710 commented Mar 2, 2025

squelart commented Jun 5, 2018 •

edited

Loading