Skip to content

x/text/collate: ignores case and diacritics by default #67296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
q3cpma opened this issue May 10, 2024 · 6 comments
Open

x/text/collate: ignores case and diacritics by default #67296

q3cpma opened this issue May 10, 2024 · 6 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@q3cpma
Copy link

q3cpma commented May 10, 2024

Go version

go version go1.22.2 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/q3cpma/.cache/go-build'
GOENV='/home/q3cpma/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/q3cpma/.go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/q3cpma/.go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.2'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='x86_64-pc-linux-gnu-gcc'
CXX='x86_64-pc-linux-gnu-g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4078013636=/tmp/go-build -gno-record-gcc-switches'

What did you do?

https://go.dev/play/p/hkm5UYyP1Sx

What did you see happen?

Output is -1 -1 -1 1. It seems the options collate.IgnoreCase and collate.IgnoreDiactrics are set by default with no way for the user to unset them.

What did you expect to see?

1 -1 1 -1 or at least a different value for the first two.

@dmitshur
Copy link
Member

CC @mpvl per owners.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 13, 2024
@danderson
Copy link
Contributor

For reference, the comparisons in the play example:

c.CompareString("a", "B"),
c.CompareString("A", "b"),
c.CompareString("é", "f"),
c.CompareString("2", "10"),

The sort order seems correct to me, according to TR10's collation algorithm. The expected outputs you listed are what would happen with an ascii-like comparison of code points, but the Unicode collation algorithm does a lot more.

https://www.unicode.org/reports/tr10/#Scope is the crucial part: each character is broken up into different "levels", with different weights in each level. Then the entire string is compared one level at time, and the comparison stops as soon as it finds any difference.

The default collation behavior mimics the cultural default of many Latin languages: level 1 encodes the "base" character with no diacritics (accents) and no upper/lowercase distinction. Level 2 encodes only the diacritics, and level 3 encodes upper/lowercase.

So, for the example strings you provided, all the sort orders are correct: c.CompareString("a", "B") is transformed into the equivalent of

if res := cmp.Compare("a", "b"); res != 0 {
    return res
}

// nothing in level 2 for this input

return cmp.Compare(2, 8) // level 3 values that mean "is lowercase" and "is uppercase"

Similarly c.CompareString("é", "f") becomes:

if res := cmp.Compare("e", "f"); res != 0 {
    return res
}

// that first string is "Combining Acute Accent", just the accent
if res := cmp.Compare("\u0301", ""); res != 0 {
  return res
}

return cmp.Compare(2, 2) // magic numbers for "is lowercase"

In both cases, the strings compare unequal at the primary level, so the diacritics and upper/lowercase do not contribute to the collation result. OTOH if you look at CompareString("a", "A") and CompareString("A", "a"), the only distinction between the inputs is upper vs. lowercase, the algorithm reaches the 3rd collation level, and correctly returns -1 and +1.

collate.IgnoreCase and collate.IgnoreDiacritics do something different: they delete the corresponding level entirely. If you IgnoreCase, the collator will not check the 3rd level when comparing, and so CompareString("a", "A") == 0. IgnoreDiacritics deletes the 2nd level, which also makes CompareString("é", "e") == 0.

Different locales can override and adjust these collation levels to match the local culture, but the default English collation is as described above. If you really want the expected output you described, you can use collate.New(language.MustParse("en-us-posix")) which tries to emulate the old POSIX-specified collation as much as it can. That includes uppercase-before-lowercase in the primary level, so A < B < a < b instead of the standard sort order a < A < b < B. But if you're showing these strings to humans, the default language.English collation really is the more correct ordering.

@q3cpma
Copy link
Author

q3cpma commented Jun 25, 2024

I see! Thanks a lot for the exhaustive writeup, I would have never guessed such a reason behind this.
Would you agree that some additional documentation is a good idea? It is especially confusion that language.Und doesn't display the same behaviour as language.MustParse("en-us-posix") either.

Of course, the incompatibility of complex collation with case/diacritics sensitivity might also be considered an issue in itself, but my use case is fully handled by your solution.

@q3cpma
Copy link
Author

q3cpma commented Jun 25, 2024

Well, uh, the inverse problem is now preventing the use of this solution: https://go.dev/play/p/c2fKP4d59c3
Can't ignore the case now.

@danderson
Copy link
Contributor

Yeah, more documentation would be good. In general, Unicode has to deal with the complexity of all human languages, and so a lot of the configuration options and behaviors are not necessarily intuitive :( x/text/collate is a relatively low level package that closely follows TR10, and TR10 assumes you already know quite a lot about Unicode.

And yes, because of the way en-us-posix is defined, it's impossible to ignore ASCII character case in that collation. It's literally hardcoded into the comparison. Really the bug is that collate.IgnoreCase is a confusing name, all it really means in the code is "ignore tertiary weights", and what that means exactly depends on the chosen collation.

In the en and und collations, ignoring the tertiary layer means "do not use case and other minor variations as tie-breakers, just say the two strings are equivalent". In the en-us-posix collation, it means the same thing, except that the collation forces the primary level to treat ASCII a-z and A-Z as different, with A-Z < a-z. So it's kind of "ignore case, except for ascii letters where case matters more than anything else". The en-us-posix collation is very confusing, because it's emulating a sort order that only makes sense to programmers who know the numeric layout of the ASCII table 🙃

If you want your two options to be "ignore case completely" and "treat case as more important than anything else", then I don't know of any collations that will do a good job of the second one, because it's not how English is sorted :( If you're okay with the weaker behavior "treat ASCII case as more important than anything else, and do whatever for non-ASCII", then you can get the behavior you want by switching completely between different collations:

  • Ignore case completely: use collate.New(language.English, collate.IgnoreCase). Alternatively use the language tag en-u-ks-level2, which is exactly the same thing (limit comparison to levels 1 and 2, aka don't use case information in level 3).
  • Case sensitive, with forced POSIX compat so that A-Z < a-z: use collate.New(language.MustParse("en-us-posix")).

However, be warned that you might get complaints that the sort order doesn't make sense to people. The posix ordering is quite unintuitive if you're not writing pure ASCII on a mainframe. Consider:

Input     English   POSIX
==========================
jack      jack      JÁVIER
John      jávier    JORDAN
JÁVIER    Jávier    Jávier
june      JÁVIER    John
Jávier    jayden    jack
jayden    John      jávier
JORDAN    jordan    jayden
jordan    JORDAN    jordan
jávier    june      june

Pretend you don't know the ascii table for a moment: if you're looking in this list for someone called Jávier, who maybe had capslock enabled the day he filled out the form, which list makes intuitive sense?

Looks like your application is a terminal-based file manager. For that, ideally you should read the LANG and LC_COLLATE environment variables (LC_COLLATE overrides LANG), and use the collation order it specifies.

Unfortunately if I'm reading #25340 correctly, parsing language tags from these env vars has some sharp edges:

  • the language package cannot parse common locale names that include a nonstandard encoding specifier (e.g. en_US.UTF-8), so you will have to cut the string at the ., and then the first part (e.g. en_US) should be a valid language tag that language.Parse can understand.
  • language.Parse does not recognize the special legacy values C and POSIX, so you have to recognize those yourself and transform them to en-us-posix manually, or possibly en-us-posix-u-ka-posix, which is a non-standard undocumented behavior that the collate package offers to "emulate posix" (no details given). I haven't traced the source code to figure out exactly what it does though, YMMV.

That will make the sort order match what your users expect, no matter what language/culture they are in. For example, if I set LANG=fr_CA and run your program, files containing accents will sort according to Canadian French dictionary rules, which is not the same as the default English/POSIX collations.

@q3cpma
Copy link
Author

q3cpma commented Jun 25, 2024

Thanks a lot again for your effortposting. Seeing the current lf configuration knobs and with my guess of how the target userbase (mostly POSIX OS programmers) sees string comparison, I think trying to use collation was a mistake on my part, working directly with Unicode without locale interpretation (via the https://pkg.go.dev/strings package for EqualFold and NFD Unicode decomposition to handle diacritics) seems closer to what's wanted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants