Codec lookup failing under turkish locale #46138

arnimar · 2008-01-12T15:00:03Z

BPO	1813
Nosy	@malemburg, @pitrou, @vstinner, @jwilk, @djc, @bitdancer, @skrah
Files	verify_locale.py: Program to verify bug/fix turklocale.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/malemburg'
closed_at = <Date 2012-02-04.04:04:36.033>
created_at = <Date 2008-01-12.15:00:02.935>
labels = ['type-bug', 'library']
title = 'Codec lookup failing under turkish locale'
updated_at = <Date 2012-02-04.04:04:36.032>
user = 'https://bugs.python.org/arnimar'

bugs.python.org fields:

activity = <Date 2012-02-04.04:04:36.032>
actor = 'Arfrever'
assignee = 'lemburg'
closed = True
closed_date = <Date 2012-02-04.04:04:36.033>
closer = 'Arfrever'
components = ['Library (Lib)']
creation = <Date 2008-01-12.15:00:02.935>
creator = 'arnimar'
dependencies = []
files = ['9140', '9440']
hgrepos = []
issue_num = 1813
keywords = ['patch']
message_count = 31.0
messages = ['59821', '62386', '62433', '62463', '62464', '62466', '62472', '64109', '64162', '111605', '111765', '119686', '119692', '140399', '141028', '141029', '141030', '141190', '141191', '141193', '141196', '141262', '141322', '141550', '141551', '141559', '141561', '141562', '143954', '152461', '152462']
nosy_count = 13.0
nosy_names = ['lemburg', 'jafo', 'pitrou', 'vstinner', 'arnimar', 'jwilk', 'djc', 'Arfrever', 'r.david.murray', 'skrah', 'BreamoreBoy', 'python-dev', 'gkcn']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue1813'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

arnimar · 2008-01-12T15:00:02Z

When switching to a turkish locale, the codecs registry fails on a codec
lookup which worked before the locale change.

This happens when the codec name contains an uppercase 'I'. What
happens, is just before doing a cache lookup, the string is normalized,
which includes a call to <ctype.h>'s tolower. tolower is locale
dependant, and the turkish locale handles 'I's different from other
locales. Thus, the lookup fails, since the normalization behaves
differently then it did before.

Replacing the tolower() call with this made the lookup work:

int my_tolower(char c)
{
	if ('A' <= c && c <= 'Z')
		c += 32;

	return c;
}

PS: If the turkish locale is not supported, this here will enable it to
an Ubuntu system

a) sudo cp /usr/share/i18n/SUPPORTED /var/lib/locales/supported.d/local
(or just copy the lines with "tr" in them)
b) sudo dpkg-reconfigure locales

pitrou · 2008-02-14T10:52:09Z

I can confirm this on SVN trunk on a Mandriva system.

arnimar · 2008-02-15T16:36:35Z

There is more to this bug than appears. I'm guessing that the name
mangling code in locale (e.g. the normalizing code) is locale dependent.

See this example:

#!/usr/bin/python2.5

import locale

print 'TR', locale.normalize('tr')

print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))

# first issue, not quite the same coming out, as came in
print locale.getlocale()

# and this fails
print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))

First, the value returned from getlocale is ('tr_TR', 'so8859-9'), not
('tr_TR', 'ISO8859-9'), and the second setlocale fails.

pitrou · 2008-02-16T19:34:21Z

The C library's tolower() and toupper() are used in a handful of source
files. It might make sense to replace some of those calls with
ascii-only versions of the corresponding functions.

Modules/_sre.c: return ((ch) < 256 ? (unsigned int)tolower((ch)) : ch);
Modules/_sqlite/cursor.c: *dst++ = tolower(*src++);
Modules/stropmodule.c: *s_new = tolower(c);
Modules/stropmodule.c: *s_new = toupper(c);
Modules/stropmodule.c: *s_new = toupper(c);
Modules/stropmodule.c: *s_new = tolower(c);
Modules/stropmodule.c: *s_new = toupper(c);
Modules/stropmodule.c: *s_new = tolower(c);
Modules/unicodedata.c: h = (h * scale) + (unsigned char)
toupper(Py_CHARMASK(s[i]));
Modules/unicodedata.c: if (toupper(Py_CHARMASK(name[i])) !=
buffer[i])
Modules/_tkinter.c: argv0[0] = tolower(Py_CHARMASK(argv0[0]));
Modules/binascii.c: c = tolower(c);
Objects/stringobject.c: s[i] = _tolower(c);
Objects/stringobject.c: s[i] = _toupper(c);
Objects/stringobject.c: c = toupper(c);
Objects/stringobject.c: c = tolower(c);
Objects/stringobject.c: *s_new = toupper(c);
Objects/stringobject.c: *s_new = tolower(c);
Objects/stringobject.c: *s_new = toupper(c);
Objects/stringobject.c: *s_new = tolower(c);
Parser/tokenizer.c: else buf[i] = tolower(c);
Python/codecs.c: ch = tolower(Py_CHARMASK(ch));
Python/dynload_win.c: first = tolower(*string1);
Python/dynload_win.c: second = tolower(*string2);
Python/pystrcmp.c: while ((--size > 0) && (tolower(*s1) == tolower(*s2))) {
Python/pystrcmp.c: return tolower(*s1) - tolower(*s2);
Python/pystrcmp.c: while (*s1 && (tolower(*s1++) == tolower(*s2++))) {
Python/pystrcmp.c: return (tolower(*s1) - tolower(*s2));

pitrou · 2008-02-16T19:58:25Z

As for the .upper() and .lower() methods, they are used in quite a bunch
of standard library modules :-/...

Lib/base64.py
Lib/BaseHTTPServer.py
Lib/bsddb/test/test_compare.py
Lib/bsddb/test/test_dbobj.py
Lib/CGIHTTPServer.py
Lib/cgi.py
Lib/compiler/ast.py
Lib/ConfigParser.py
Lib/cookielib.py
Lib/Cookie.py
Lib/csv.py
Lib/ctypes/test/test_byteswap.py
Lib/ctypes/util.py
Lib/decimal.py
Lib/distutils/command/bdist_rpm.py
Lib/distutils/command/bdist_wininst.py
Lib/distutils/command/register.py
Lib/distutils/msvc9compiler.py
Lib/distutils/msvccompiler.py
Lib/distutils/sysconfig.py
Lib/distutils/tests/test_dist.py
Lib/distutils/util.py
Lib/email/charset.py
Lib/email/encoders.py
Lib/email/header.py
Lib/email/init.py
Lib/email/message.py
Lib/email/_parseaddr.py
Lib/email/test/test_email.py
Lib/email/test/test_email_renamed.py
Lib/encodings/idna.py
Lib/encodings/punycode.py
Lib/formatter.py
Lib/ftplib.py
Lib/gettext.py
Lib/htmllib.py
Lib/HTMLParser.py
Lib/httplib.py
Lib/idlelib/configDialog.py
Lib/idlelib/EditorWindow.py
Lib/idlelib/IOBinding.py
Lib/idlelib/keybindingDialog.py
Lib/idlelib/PyShell.py
Lib/idlelib/SearchDialogBase.py
Lib/idlelib/tabbedpages.py
Lib/idlelib/TreeWidget.py
Lib/imaplib.py
Lib/inspect.py
Lib/lib-tk/turtle.py
Lib/locale.py
Lib/logging/handlers.py
Lib/logging/init.py
Lib/_LWPCookieJar.py
Lib/macpath.py
Lib/mailcap.py
Lib/markupbase.py
Lib/mhlib.py
Lib/mimetools.py
Lib/mimetypes.py
Lib/mimify.py
Lib/msilib/init.py
Lib/nntplib.py
Lib/ntpath.py
Lib/nturl2path.py
Lib/optparse.py
Lib/os2emxpath.py
Lib/os.py
Lib/pdb.py
Lib/plat-irix5/flp.py
Lib/plat-irix6/flp.py
Lib/plat-mac/buildtools.py
Lib/plat-mac/gensuitemodule.py
Lib/plat-riscos/riscospath.py
Lib/pyclbr.py
Lib/rfc822.py
Lib/robotparser.py
Lib/sgmllib.py
Lib/SimpleHTTPServer.py
Lib/smtpd.py
Lib/smtplib.py
Lib/socket.py
Lib/sqlite3/test/hooks.py
Lib/sre_constants.py
Lib/stringold.py
Lib/stringprep.py
Lib/string.py
Lib/_strptime.py
Lib/subprocess.py
Lib/test/regrtest.py
Lib/test/test_bigmem.py
Lib/test/test_codeccallbacks.py
Lib/test/test_codecs.py
Lib/test/test_cookielib.py
Lib/test/test_datetime.py
Lib/test/test_decimal.py
Lib/test/test_deque.py
Lib/test/test_descr.py
Lib/test/test_fileinput.py
Lib/test/test_grp.py
Lib/test/test_hmac.py
Lib/test/test_httplib.py
Lib/test/test_os.py
Lib/test/test_smtplib.py
Lib/test/test_sort.py
Lib/test/test_ssl.py
Lib/test/test_strop.py
Lib/test/test_strptime.py
Lib/test/test_support.py
Lib/test/test_ucn.py
Lib/test/test_unicodedata.py
Lib/test/test_urllib2.py
Lib/test/test_urllib.py
Lib/test/test_wsgiref.py
Lib/test/test_xmlrpc.py
Lib/urllib2.py
Lib/urllib.py
Lib/urlparse.py
Lib/UserString.py
Lib/uuid.py
Lib/warnings.py
Lib/webbrowser.py
Lib/wsgiref/handlers.py
Lib/wsgiref/headers.py
Lib/wsgiref/simple_server.py
Lib/wsgiref/util.py
Lib/wsgiref/validate.py
Lib/xml/dom/minidom.py
Lib/xml/dom/xmlbuilder.py
Lib/xmllib.py

pitrou · 2008-02-16T20:04:33Z

Even if we don't fix all uses of (?to)(lower|upper) in the source tree,
I think it's important that codec and locale lookup work properly when
the current locale defines non-latin case folding for latin characters.
Here is a patch.

Perhaps also the str type should grow ascii_lower() and ascii_upper()
methods, since many cases of using lower() and upper() actually assume
ascii semantics (e.g. for parsing of HTTP or SMTP headers).

malemburg · 2008-02-16T22:20:15Z

I agree that it's a bit unfortunate that the 8-bit string APIs in Python
use the locale aware C functions per default (this should really be
reversed: there should be locale-aware .upper() and .lower() methods and
the the standard ones should work just like the Unicode ones - without
dependency on the locale, using ASCII mappings), but for historical
reasons this cannot easily be changed.

.lower() and .upper() for 8-bit strings were always locale dependent and
before the addition of Unicode, setting the locale was the most common
way to make an application understand different character sets.

In Python 3k the problem will probably go away, since .lower() and
.upper() will then no longer depend on the locale.

Perhaps we should just convert a few of the cases you found to using
Unicode strings instead of 8-bit strings in 2.6 ?! That would both make
the code more portable and also provide a clear statement of "this is a
text string", making porting to Py3k easier.

jafo · 2008-03-19T21:44:49Z

Marc-Andre: How should we proceed with this bug? Discuss on python-dev
or c.l.python?

malemburg · 2008-03-20T10:20:48Z

Sean: I'd suggest to discuss this on python-dev.

Note that even if we do use Unicode for the cases in question, the
Turkish locale will still pose a problem - see bpo-1528802 for a discussion.

BreamoreBoy · 2010-07-26T12:24:23Z

Does anyone know if this was discussed on python-dev? I've tried searching the archives and didn't find anything, but that's not to say it isn't there.

vstinner · 2010-07-28T02:16:52Z

There is also a locale normalization function in unicodeobject.c: normalize_encoding(). This function uses "if (ISUPPER(*e)) *l++ = TOLOWER(e++);" which uses the Python, *locale-independent, implementation of ctype.

We should maybe use the ISUPPER / TOLOWER in codecs.c.

Anyway, a function should be fixed, but I don't know which one :-)

djc · 2010-10-27T10:30:20Z

We've included this patch in Gentoo for about two years now. Can we get some discussion going on doing something like this?

malemburg · 2010-10-27T11:27:26Z

Looking at this again, I think we should change the codec registry C code to use Py_TOLOWER() and the encoding search function code to use the .translate() approach that Antoine suggested.

vstinner · 2011-07-15T09:14:54Z

The decimal module has been fixed in Python 2.7, 3.2 and 3.3 for Turkish local: issue bpo-11830.

python-dev · 2011-07-24T00:43:24Z

New changeset 92d02de91cc9 by Antoine Pitrou in branch '3.2':
Issue bpo-1813: Fix codec lookup under Turkish locales.
http://hg.python.org/cpython/rev/92d02de91cc9

New changeset a77a4df54b95 by Antoine Pitrou in branch '3.2':
Add a test for issue bpo-1813: getlocale() failing under a Turkish locale
http://hg.python.org/cpython/rev/a77a4df54b95

New changeset fe0caf8c48d2 by Antoine Pitrou in branch 'default':
Add a test for issue bpo-1813: getlocale() failing under a Turkish locale
http://hg.python.org/cpython/rev/fe0caf8c48d2

python-dev · 2011-07-24T00:52:27Z

New changeset 739958134fe5 by Antoine Pitrou in branch '2.7':
Issue bpo-1813: Fix codec lookup and setting/getting locales under Turkish locales.
http://hg.python.org/cpython/rev/739958134fe5

pitrou · 2011-07-24T00:53:04Z

Finally fixed in 2.7, 3.2, 3.3!

skrah · 2011-07-26T22:50:18Z

The Fedora bot fails because here ...

  locale.setlocale(locale.LC_CTYPE, loc)

loc = ('tr_TR', 'ISO8859-9'), and apparently setlocale can only
handle "tr_TR", but not "tr_TR.ISO8859-9":

144 if (locale) {
145 /* set locale */
146 result = setlocale(category, locale);
147 if (!result) {
148 /* operation failed, no setting was changed */
149 PyErr_SetString(Error, "unsupported locale setting");
150 return NULL;
(gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
$8 = 0x0
(gdb) p result = setlocale(category, "tr_TR")
$9 = 0x96d770 "tr_TR"
(gdb) p locale
$10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
(gdb)

skrah · 2011-07-26T23:01:52Z

Stefan Krah <[email protected]> wrote:

(gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
$8 = 0x0
(gdb) p result = setlocale(category, "tr_TR")
$9 = 0x96d770 "tr_TR"
(gdb) p locale
$10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
(gdb)

Perhaps this is a bug in Fedora's setlocale that can't handle the turkish 'I'
in 'ISO' when CTYPE is turkish.

pitrou · 2011-07-26T23:02:52Z

Stefan Krah <[email protected]> wrote:
> (gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
> $8 = 0x0
> (gdb) p result = setlocale(category, "tr_TR")
> $9 = 0x96d770 "tr_TR"
> (gdb) p locale
> $10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
> (gdb)

Perhaps this is a bug in Fedora's setlocale that can't handle the turkish 'I'
in 'ISO' when CTYPE is turkish.

Perhaps indeed. Maybe you should try to report it.
It does look like an OS bug in any case.
(fortunately that buildbot is in the "unstable" bunch :-))

skrah · 2011-07-26T23:35:00Z

Yes, it's a bug. This works:

#include <stdio.h>
#include <locale.h>
int
main(void)
{
    char *s;
    printf("%s\n", setlocale(LC_CTYPE, "tr_TR.ISO8859-9"));
    printf("%s\n", setlocale(LC_CTYPE, NULL));
    s = setlocale(LC_CTYPE, "tr_TR.ISO8859-9");
    printf("%s\n", s ? s : "null");
    return 0;
}

But when I change the first setlocale call to "tr_TR", the result of
the last call is NULL.

bitdancer · 2011-07-27T18:42:42Z

I'm seeing this test failure in Gentoo, as well.

skrah · 2011-07-28T23:10:01Z

Fedora bug report:

https://bugzilla.redhat.com/show_bug.cgi?id=726536

skrah · 2011-08-02T09:41:45Z

Unrelated to the Fedora issue: The test is currently skipped on the
FreeBSD bot, but completes successfully with:

diff -r 0b52b6f1bfab Lib/test/test_locale.py
--- a/Lib/test/test_locale.py   Tue Aug 02 10:16:45 2011 +0200
+++ b/Lib/test/test_locale.py   Tue Aug 02 11:37:39 2011 +0200
@@ -399,7 +399,7 @@
         oldlocale = locale.setlocale(locale.LC_CTYPE)
         self.addCleanup(locale.setlocale, locale.LC_CTYPE, oldlocale)
         try:
-            locale.setlocale(locale.LC_CTYPE, 'tr_TR')
+            locale.setlocale(locale.LC_CTYPE, 'tr_TR.UTF-8')
         except locale.Error:
             # Unsupported locale on this system
             self.skipTest('test needs Turkish locale')

skrah · 2011-08-02T10:21:35Z

As I wrote on python-dev, this test also fails on Debian lenny, which has
the same setlocale() bug as Fedora.

So, indeed the test should be skipped on a multitude of platforms.

bitdancer · 2011-08-02T11:34:30Z

On Tue, 02 Aug 2011 12:12:37 +0200, Stefan Krah <[email protected]> wrote:

I suspect many buildbots are green because they don't have tr_TR and
tr_TR.iso8859-9 installed.

This is true for my Gentoo buildbots. Once we've figured out the
best way to handle this, I'll fix that (install the other locales) for
my two.

When I run the C test program I get null as the final output of that
regardless of whether I use 'tr_TR' or 'tr_TR.utf8'.

This is with glibc-2.13-r2 (the r2 is Gentoo's mod number).

As someone pointed out on python-dev, if this isn't fixable then it should be an expected failure, not a skip.

One question is, is there any platform on which the turkish locale is installed where this test actually works?

skrah · 2011-08-02T12:01:12Z

[Re-opening to fix the skips]

Yes, the test works on:

Ubuntu Lucid (libc-2.11.1), OpenSUSE (libc-2.11.1), FreeBSD-8.2

Failure:

Fedora 14 (libc-2.13), Debian lenny (libc-2.7), Gentoo (libc-2.13-r2)

So perhaps this test should be marked as expected failure on Linux
altogether (unless we test for the libc version).

pitrou · 2011-08-02T12:06:32Z

As someone pointed out on python-dev, if this isn't fixable then it
should be an expected failure, not a skip.

The Python bug is fixed, the problem is apparently some libcs have the
same bug as we did...

One question is, is there any platform on which the turkish locale is
installed where this test actually works?

Well, it works here (Mageia).

skrah · 2011-09-13T11:39:30Z

https://bugzilla.redhat.com/show_bug.cgi?id=726536 claims that the
glibc issue (which is relevant for skipping the test case) is fixed
in glibc-2.14.90-8.

I suspect the only way of running the test case reliably is whitelisting
a couple of known good glibc versions.

python-dev · 2012-02-02T16:00:00Z

New changeset a55ffb6c1993 by Stefan Krah in branch '3.2':
Issue bpo-1813: Revert workaround for a glibc bug on the Fedora buildbot.
http://hg.python.org/cpython/rev/a55ffb6c1993

New changeset 4244e4348362 by Stefan Krah in branch 'default':
Issue bpo-1813: merge changeset that reverts a glibc workaround for the
http://hg.python.org/cpython/rev/4244e4348362

New changeset 0b8917fc6db5 by Stefan Krah in branch '2.7':
Issue bpo-1813: backport changeset that reverts a glibc workaround for the
http://hg.python.org/cpython/rev/0b8917fc6db5

skrah · 2012-02-02T16:06:37Z

I've upgraded the Fedora buildbot to Fedora-16. The specific glibc
workaround should not be necessary any more.

So the test will now fail again on all systems that a) have the bug
and b) the tr_Tr locale.

arnimar mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Jan 12, 2008

arnimar mannequin added stdlib Python modules in the Lib dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 13, 2008

jafo mannequin assigned malemburg Mar 19, 2008

pitrou closed this as completed Jul 24, 2011

skrah mannequin reopened this Aug 2, 2011

Arfrever mannequin closed this as completed Feb 4, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

ambv mentioned this issue Oct 25, 2023

LC_CTYPE incorrectly references case sensitivity of "the functions of module string" #111276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codec lookup failing under turkish locale #46138

Codec lookup failing under turkish locale #46138

arnimar mannequin commented Jan 12, 2008

arnimar mannequin commented Jan 12, 2008

pitrou commented Feb 14, 2008

arnimar mannequin commented Feb 15, 2008

pitrou commented Feb 16, 2008

pitrou commented Feb 16, 2008

pitrou commented Feb 16, 2008

malemburg commented Feb 16, 2008

jafo mannequin commented Mar 19, 2008

malemburg commented Mar 20, 2008

BreamoreBoy mannequin commented Jul 26, 2010

vstinner commented Jul 28, 2010

djc commented Oct 27, 2010

malemburg commented Oct 27, 2010

vstinner commented Jul 15, 2011

python-dev mannequin commented Jul 24, 2011

python-dev mannequin commented Jul 24, 2011

pitrou commented Jul 24, 2011

skrah mannequin commented Jul 26, 2011

skrah mannequin commented Jul 26, 2011

pitrou commented Jul 26, 2011

skrah mannequin commented Jul 26, 2011

bitdancer commented Jul 27, 2011

skrah mannequin commented Jul 28, 2011

skrah mannequin commented Aug 2, 2011

skrah mannequin commented Aug 2, 2011

bitdancer commented Aug 2, 2011

skrah mannequin commented Aug 2, 2011

pitrou commented Aug 2, 2011

skrah mannequin commented Sep 13, 2011

python-dev mannequin commented Feb 2, 2012

skrah mannequin commented Feb 2, 2012

Codec lookup failing under turkish locale #46138

Codec lookup failing under turkish locale #46138

Comments

arnimar mannequin commented Jan 12, 2008

arnimar mannequin commented Jan 12, 2008

pitrou commented Feb 14, 2008

arnimar mannequin commented Feb 15, 2008

pitrou commented Feb 16, 2008

pitrou commented Feb 16, 2008

pitrou commented Feb 16, 2008

malemburg commented Feb 16, 2008

jafo mannequin commented Mar 19, 2008

malemburg commented Mar 20, 2008

BreamoreBoy mannequin commented Jul 26, 2010

vstinner commented Jul 28, 2010

djc commented Oct 27, 2010

malemburg commented Oct 27, 2010

vstinner commented Jul 15, 2011

python-dev mannequin commented Jul 24, 2011

python-dev mannequin commented Jul 24, 2011

pitrou commented Jul 24, 2011

skrah mannequin commented Jul 26, 2011

skrah mannequin commented Jul 26, 2011

pitrou commented Jul 26, 2011

skrah mannequin commented Jul 26, 2011

bitdancer commented Jul 27, 2011

skrah mannequin commented Jul 28, 2011

skrah mannequin commented Aug 2, 2011

skrah mannequin commented Aug 2, 2011

bitdancer commented Aug 2, 2011

skrah mannequin commented Aug 2, 2011

pitrou commented Aug 2, 2011

skrah mannequin commented Sep 13, 2011

python-dev mannequin commented Feb 2, 2012

skrah mannequin commented Feb 2, 2012