String.ToLower uses Turkish casing rules with en-US-POSIX #4894

zhichengzhu · 2016-01-05T18:36:51Z

Hi Guys,
Here is the repro code:

using System;
using System.IO;

namespace ConsoleApplication
{
    public class Program
    {
        public static void Main(string[] args)
        {
            string originStr = "#IF";
            string lowerStr = originStr.ToLower();
            Console.WriteLine("origin string:{0}", originStr);
            Console.WriteLine("lower string:{0}", lowerStr);
            Console.WriteLine("\"{0} == #if\" == {1}", lowerStr, lowerStr.Equals("#if"));
        }
    }
}

Start an ssh session(using putty.exe) from windows 10 to Mac OSX(Yosemite Version 10.10.5)
Use cli tool(https://github.com/dotnet/cli) to create a project
run "dotnet restore"
run "dotnet run"

Then you will see this result:
origin string:#IF
lower string:#ıf
"#ıf == #if" == False

But if you directly running in mac, then you will see the right out put
origin string:#IF
lower string:#if
"#if == #if" == true

The text was updated successfully, but these errors were encountered:

stephentoub · 2016-01-05T18:50:20Z

In both cases, what do you see if you output:

Console.WriteLine(CultureInfo.CurrentCulture);

?

The lower-casing being done in the False case is Turkish; I'm guessing that your environment in the two cases is causing you to have different cultures set up.

You can see the Turkish casing with code like:

using System;
using System.Globalization;

class Program
{
    static void Main(string[] args)
    {
        char I = 'I';
        Console.WriteLine("0x{0:X2}", (int)I);

        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        Console.WriteLine("0x{0:X2}", (int)char.ToLower(I));

        CultureInfo.CurrentCulture = new CultureInfo("en-US");
        Console.WriteLine("0x{0:X2}", (int)char.ToLower(I));
    }
}

which outputs:

0x49
0x131
0x69

zhichengzhu · 2016-01-05T19:39:17Z

That's interesting. I didn't do anything special about setting the a special environment.
But Console.WriteLine(CultureInfo.CurrentCulture); does give me different results:

For ssh: en-US-POSIX
For mac: en-US

Even though ssh is en-US-POSIX, the "i" shouldn't be the Turkish "i".
Thanks

natemcmaster · 2016-01-05T20:26:38Z

@stephentoub we are seeing the same issue in @aspnet testing. On OSX, somehow the culture is set to "en-US-POSIX" and this causes string.ToLower() not to work correctly.

stephentoub · 2016-01-05T20:32:44Z

@natemcmaster, what the culture gets set to is based on the environment variables you have set in your environment, e.g. LC_ALL, LANG, LC_COLLATE, etc. That's by design. Check your environment variables (e.g. locale from your shell).

However, that en-US-POSIX is using Turkish casing rules is a bug. @ellismg, before I look into it more deeply, any idea why we're setting m_needsTurkishCasing to true for en-US-POSIX? e.g. on Ubuntu this:

using System;
using System.Reflection;
using System.Globalization;

class Program
{
    static void Main()
    {
        var cultures = new[] {
            new CultureInfo("en-US"),
            new CultureInfo("fr-FR"),
            new CultureInfo("tr-TR"),
            new CultureInfo("blah"),
            new CultureInfo("blah-BLAH"),
            new CultureInfo("blah-BLAH-BLAH"),
            new CultureInfo("en-US-blah"),
            new CultureInfo("en-US-POSIX"),
            new CultureInfo("zz-ZZ-POSIX")
        };

        var f = typeof(TextInfo).GetTypeInfo().GetDeclaredField("m_needsTurkishCasing");
        foreach (var c in cultures)
        {
            Console.WriteLine($"Culture: {c}\tTurkish: {f.GetValue(c.TextInfo)}");
        }
    }
}

outputs this:

Culture: en-US  Turkish: False
Culture: fr-FR  Turkish: False
Culture: tr-TR  Turkish: True
Culture: blah   Turkish: False
Culture: blah-Blah  Turkish: False
Culture: blah-Blah--BLAH    Turkish: False
Culture: en-US-BLAH Turkish: False
Culture: en-US-POSIX    Turkish: True
Culture: zz-ZZ-POSIX    Turkish: False

natemcmaster · 2016-01-05T20:41:35Z

@stephentoub yup, you were right. Our locale is set incorrectly on build agents.

LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

ellismg · 2016-01-05T21:01:20Z

before I look into it more deeply, any idea why we're setting m_needsTurkishCasing to true for en-US-POSIX

I have a hunch:

The source does the following:

        private bool NeedsTurkishCasing(string localeName)
        {
            Contract.Assert(localeName != null);
            return CultureInfo.GetCultureInfo(localeName).CompareInfo.Compare("i", "I", CompareOptions.IgnoreCase) != 0;
        }

ICU Does tailoring of the en-US-POSIX locale and assigns different primary wights to 'i' and 'I'. You can see this in the ICU Collation Demo by looking at the raw collation elements. Different primary weights mean they different letters, not the same letter with a difference in casing (which is a secondary weight).

I think that we should update the check to actually compare using some of the turkish characters instead of doing it the way we currently do.

I will fix this for RC2.

Previously, we were using a comparision between "i" and "I" to (while ignoring case) to figure out if we needed to do Turkish casing (on the assumption that locales which compared i and I as non equal when ignoring case were doing turkish casing). ICU Does tailoring of the en-US-POSIX locale and assigns different primary wights to 'i' and 'I'. You can see this in the ICU Collation Demo by looking at the raw collation elements. Different primary weights mean they different letters, not the same letter with a difference in casing (which is a tinary weight). This changes the check to compare using an actual Turkish i when doing our detection to not get confused by these cases. Fixes #2531

Previously, we were using a comparision between "i" and "I" to (while ignoring case) to figure out if we needed to do Turkish casing (on the assumption that locales which compared i and I as non equal when ignoring case were doing Turkish casing). ICU Does tailoring of the en-US-POSIX locale and assigns different primary wights to 'i' and 'I'. You can see this in the ICU Collation Demo by looking at the raw collation elements. Different primary weights mean they different letters, not the same letter with a difference in casing (which is a trinary weight). This changes the check to compare using an actual Turkish i when doing our detection to not get confused by these cases. Fixes #2531

Ensure that en-US-POSIX does not get Turkish casing behavior.

Previously, we were using a comparision between "i" and "I" (while ignoring case) to figure out if we needed to do Turkish casing (on the assumption that locales which compared i and I as non equal when ignoring case were doing Turkish casing). ICU Does tailoring of the en-US-POSIX locale and assigns different primary wights to 'i' and 'I'. You can see this in the ICU Collation Demo by looking at the raw collation elements. Different primary weights mean they different letters, not the same letter with a difference in casing (which is a trinary weight). This changes the check to compare using an actual Turkish i when doing our detection to not get confused by these cases. Fixes #2531

Add regression test for dotnet/coreclr#2531

Ensure that en-US-POSIX does not get Turkish casing behavior.

ellismg self-assigned this Jan 5, 2016

stephentoub changed the title ~~String.ToLower returns the wrong value~~ String.ToLower uses Turkish casing rules with en-US-POSIX Jan 5, 2016

ellismg referenced this issue in ellismg/corefx Jan 6, 2016

Add regression test for dotnet/coreclr#2531

4c67129

Ensure that en-US-POSIX does not get Turkish casing behavior.

jkotas closed this as completed in dotnet/coreclr#2548 Jan 7, 2016

stephentoub referenced this issue in dotnet/corefx Jan 7, 2016

Merge pull request #5214 from ellismg/add-en-us-posix-casing-test

513efd8

Add regression test for dotnet/coreclr#2531

ericeil referenced this issue in ericeil/corefx Jan 12, 2016

Add regression test for dotnet/coreclr#2531

32c5583

Ensure that en-US-POSIX does not get Turkish casing behavior.

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

msftgits added this to the 1.0.0-rc2 milestone Jan 30, 2020

ghost locked as resolved and limited conversation to collaborators Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

String.ToLower uses Turkish casing rules with en-US-POSIX #4894

String.ToLower uses Turkish casing rules with en-US-POSIX #4894

zhichengzhu commented Jan 5, 2016

stephentoub commented Jan 5, 2016

Uh oh!

zhichengzhu commented Jan 5, 2016

Uh oh!

natemcmaster commented Jan 5, 2016

Uh oh!

stephentoub commented Jan 5, 2016

Uh oh!

natemcmaster commented Jan 5, 2016

Uh oh!

ellismg commented Jan 5, 2016

Uh oh!

String.ToLower uses Turkish casing rules with en-US-POSIX #4894

String.ToLower uses Turkish casing rules with en-US-POSIX #4894

Comments

zhichengzhu commented Jan 5, 2016

stephentoub commented Jan 5, 2016

Uh oh!

zhichengzhu commented Jan 5, 2016

Uh oh!

natemcmaster commented Jan 5, 2016

Uh oh!

stephentoub commented Jan 5, 2016

Uh oh!

natemcmaster commented Jan 5, 2016

Uh oh!

ellismg commented Jan 5, 2016

Uh oh!