-
Notifications
You must be signed in to change notification settings - Fork 18k
x/text/collate: ignores case and diacritics by default #67296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For reference, the comparisons in the play example:
The sort order seems correct to me, according to TR10's collation algorithm. The expected outputs you listed are what would happen with an ascii-like comparison of code points, but the Unicode collation algorithm does a lot more. https://www.unicode.org/reports/tr10/#Scope is the crucial part: each character is broken up into different "levels", with different weights in each level. Then the entire string is compared one level at time, and the comparison stops as soon as it finds any difference. The default collation behavior mimics the cultural default of many Latin languages: level 1 encodes the "base" character with no diacritics (accents) and no upper/lowercase distinction. Level 2 encodes only the diacritics, and level 3 encodes upper/lowercase. So, for the example strings you provided, all the sort orders are correct:
Similarly
In both cases, the strings compare unequal at the primary level, so the diacritics and upper/lowercase do not contribute to the collation result. OTOH if you look at
Different locales can override and adjust these collation levels to match the local culture, but the default English collation is as described above. If you really want the expected output you described, you can use |
I see! Thanks a lot for the exhaustive writeup, I would have never guessed such a reason behind this. Of course, the incompatibility of complex collation with case/diacritics sensitivity might also be considered an issue in itself, but my use case is fully handled by your solution. |
Well, uh, the inverse problem is now preventing the use of this solution: https://go.dev/play/p/c2fKP4d59c3 |
Yeah, more documentation would be good. In general, Unicode has to deal with the complexity of all human languages, and so a lot of the configuration options and behaviors are not necessarily intuitive :( x/text/collate is a relatively low level package that closely follows TR10, and TR10 assumes you already know quite a lot about Unicode. And yes, because of the way In the If you want your two options to be "ignore case completely" and "treat case as more important than anything else", then I don't know of any collations that will do a good job of the second one, because it's not how English is sorted :( If you're okay with the weaker behavior "treat ASCII case as more important than anything else, and do whatever for non-ASCII", then you can get the behavior you want by switching completely between different collations:
However, be warned that you might get complaints that the sort order doesn't make sense to people. The posix ordering is quite unintuitive if you're not writing pure ASCII on a mainframe. Consider:
Pretend you don't know the ascii table for a moment: if you're looking in this list for someone called Jávier, who maybe had capslock enabled the day he filled out the form, which list makes intuitive sense? Looks like your application is a terminal-based file manager. For that, ideally you should read the Unfortunately if I'm reading #25340 correctly, parsing language tags from these env vars has some sharp edges:
That will make the sort order match what your users expect, no matter what language/culture they are in. For example, if I set |
Thanks a lot again for your effortposting. Seeing the current lf configuration knobs and with my guess of how the target userbase (mostly POSIX OS programmers) sees string comparison, I think trying to use collation was a mistake on my part, working directly with Unicode without locale interpretation (via the https://pkg.go.dev/strings package for EqualFold and NFD Unicode decomposition to handle diacritics) seems closer to what's wanted. |
Go version
go version go1.22.2 linux/amd64
Output of
go env
in your module/workspace:What did you do?
https://go.dev/play/p/hkm5UYyP1Sx
What did you see happen?
Output is
-1 -1 -1 1
. It seems the optionscollate.IgnoreCase
andcollate.IgnoreDiactrics
are set by default with no way for the user to unset them.What did you expect to see?
1 -1 1 -1
or at least a different value for the first two.The text was updated successfully, but these errors were encountered: