Skip to content

String.ToLower uses Turkish casing rules with en-US-POSIX #4894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zhichengzhu opened this issue Jan 5, 2016 · 6 comments
Closed

String.ToLower uses Turkish casing rules with en-US-POSIX #4894

zhichengzhu opened this issue Jan 5, 2016 · 6 comments
Assignees
Milestone

Comments

@zhichengzhu
Copy link
Contributor

Hi Guys,
Here is the repro code:

using System;
using System.IO;

namespace ConsoleApplication
{
    public class Program
    {
        public static void Main(string[] args)
        {
            string originStr = "#IF";
            string lowerStr = originStr.ToLower();
            Console.WriteLine("origin string:{0}", originStr);
            Console.WriteLine("lower string:{0}", lowerStr);
            Console.WriteLine("\"{0} == #if\" == {1}", lowerStr, lowerStr.Equals("#if"));
        }
    }
}
  1. Start an ssh session(using putty.exe) from windows 10 to Mac OSX(Yosemite Version 10.10.5)
  2. Use cli tool(https://github.com/dotnet/cli) to create a project
  3. run "dotnet restore"
  4. run "dotnet run"

Then you will see this result:
origin string:#IF
lower string:#ıf
"#ıf == #if" == False

But if you directly running in mac, then you will see the right out put
origin string:#IF
lower string:#if
"#if == #if" == true

@stephentoub
Copy link
Member

In both cases, what do you see if you output:

Console.WriteLine(CultureInfo.CurrentCulture);

?

The lower-casing being done in the False case is Turkish; I'm guessing that your environment in the two cases is causing you to have different cultures set up.

You can see the Turkish casing with code like:

using System;
using System.Globalization;

class Program
{
    static void Main(string[] args)
    {
        char I = 'I';
        Console.WriteLine("0x{0:X2}", (int)I);

        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        Console.WriteLine("0x{0:X2}", (int)char.ToLower(I));

        CultureInfo.CurrentCulture = new CultureInfo("en-US");
        Console.WriteLine("0x{0:X2}", (int)char.ToLower(I));
    }
}

which outputs:

0x49
0x131
0x69

@zhichengzhu
Copy link
Contributor Author

That's interesting. I didn't do anything special about setting the a special environment.
But Console.WriteLine(CultureInfo.CurrentCulture); does give me different results:

  1. For ssh: en-US-POSIX
  2. For mac: en-US

Even though ssh is en-US-POSIX, the "i" shouldn't be the Turkish "i".
Thanks

@natemcmaster
Copy link
Contributor

@stephentoub we are seeing the same issue in @aspnet testing. On OSX, somehow the culture is set to "en-US-POSIX" and this causes string.ToLower() not to work correctly.

@stephentoub
Copy link
Member

@natemcmaster, what the culture gets set to is based on the environment variables you have set in your environment, e.g. LC_ALL, LANG, LC_COLLATE, etc. That's by design. Check your environment variables (e.g. locale from your shell).

However, that en-US-POSIX is using Turkish casing rules is a bug. @ellismg, before I look into it more deeply, any idea why we're setting m_needsTurkishCasing to true for en-US-POSIX? e.g. on Ubuntu this:

using System;
using System.Reflection;
using System.Globalization;

class Program
{
    static void Main()
    {
        var cultures = new[] {
            new CultureInfo("en-US"),
            new CultureInfo("fr-FR"),
            new CultureInfo("tr-TR"),
            new CultureInfo("blah"),
            new CultureInfo("blah-BLAH"),
            new CultureInfo("blah-BLAH-BLAH"),
            new CultureInfo("en-US-blah"),
            new CultureInfo("en-US-POSIX"),
            new CultureInfo("zz-ZZ-POSIX")
        };

        var f = typeof(TextInfo).GetTypeInfo().GetDeclaredField("m_needsTurkishCasing");
        foreach (var c in cultures)
        {
            Console.WriteLine($"Culture: {c}\tTurkish: {f.GetValue(c.TextInfo)}");
        }
    }
}

outputs this:

Culture: en-US  Turkish: False
Culture: fr-FR  Turkish: False
Culture: tr-TR  Turkish: True
Culture: blah   Turkish: False
Culture: blah-Blah  Turkish: False
Culture: blah-Blah--BLAH    Turkish: False
Culture: en-US-BLAH Turkish: False
Culture: en-US-POSIX    Turkish: True
Culture: zz-ZZ-POSIX    Turkish: False

@natemcmaster
Copy link
Contributor

@stephentoub yup, you were right. Our locale is set incorrectly on build agents.

LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

@ellismg
Copy link
Contributor

ellismg commented Jan 5, 2016

before I look into it more deeply, any idea why we're setting m_needsTurkishCasing to true for en-US-POSIX

I have a hunch:

The source does the following:

        private bool NeedsTurkishCasing(string localeName)
        {
            Contract.Assert(localeName != null);
            return CultureInfo.GetCultureInfo(localeName).CompareInfo.Compare("i", "I", CompareOptions.IgnoreCase) != 0;
        }

ICU Does tailoring of the en-US-POSIX locale and assigns different primary wights to 'i' and 'I'. You can see this in the ICU Collation Demo by looking at the raw collation elements. Different primary weights mean they different letters, not the same letter with a difference in casing (which is a secondary weight).

I think that we should update the check to actually compare using some of the turkish characters instead of doing it the way we currently do.

I will fix this for RC2.

@ellismg ellismg self-assigned this Jan 5, 2016
@stephentoub stephentoub changed the title String.ToLower returns the wrong value String.ToLower uses Turkish casing rules with en-US-POSIX Jan 5, 2016
ellismg referenced this issue in ellismg/coreclr Jan 6, 2016
Previously, we were using a comparision between "i" and "I" to (while
ignoring case) to figure out if we needed to do Turkish casing (on the
assumption that locales which compared i and I as non equal when
ignoring case were doing turkish casing).

ICU Does tailoring of the en-US-POSIX locale and assigns different
primary wights to 'i' and 'I'. You can see this in the ICU Collation
Demo by looking at the raw collation elements. Different primary weights
mean they different letters, not the same letter with a difference in
casing (which is a tinary weight).

This changes the check to compare using an actual Turkish i when doing
our detection to not get confused by these cases.

Fixes #2531
ellismg referenced this issue in ellismg/coreclr Jan 6, 2016
Previously, we were using a comparision between "i" and "I" to (while
ignoring case) to figure out if we needed to do Turkish casing (on the
assumption that locales which compared i and I as non equal when
ignoring case were doing Turkish casing).

ICU Does tailoring of the en-US-POSIX locale and assigns different
primary wights to 'i' and 'I'. You can see this in the ICU Collation
Demo by looking at the raw collation elements. Different primary weights
mean they different letters, not the same letter with a difference in
casing (which is a trinary weight).

This changes the check to compare using an actual Turkish i when doing
our detection to not get confused by these cases.

Fixes #2531
ellismg referenced this issue in ellismg/corefx Jan 6, 2016
Ensure that en-US-POSIX does not get Turkish casing behavior.
ellismg referenced this issue in ellismg/coreclr Jan 6, 2016
Previously, we were using a comparision between "i" and "I" (while
ignoring case) to figure out if we needed to do Turkish casing (on the
assumption that locales which compared i and I as non equal when
ignoring case were doing Turkish casing).

ICU Does tailoring of the en-US-POSIX locale and assigns different
primary wights to 'i' and 'I'. You can see this in the ICU Collation
Demo by looking at the raw collation elements. Different primary weights
mean they different letters, not the same letter with a difference in
casing (which is a trinary weight).

This changes the check to compare using an actual Turkish i when doing
our detection to not get confused by these cases.

Fixes #2531
stephentoub referenced this issue in dotnet/corefx Jan 7, 2016
Add regression test for dotnet/coreclr#2531
ericeil referenced this issue in ericeil/corefx Jan 12, 2016
Ensure that en-US-POSIX does not get Turkish casing behavior.
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the 1.0.0-rc2 milestone Jan 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants