str should support a few other string en/decodings #1771

killerswan · 2012-02-07T09:35:25Z

I've been thinking a little bit about how best to include support for the various other string encodings that we'll need to avoid the pitfalls of things like native filesystem paths and to read more common text encodings. It seems like in addition to the UTF-8 and ASCII which we already handle, we'll likely want simple to and from methods for the following, perhaps under the umbrella of a string_encoding interface:

Latin-1
Windows-1252
whatever libuv on Windows needs... (16-bit anything-but-null wchar_t, like NTFS?)
UTF-16

These are likely to be only a small can of worms, so long as we don't make any plans to automatically recognize them or agonize about working with them internally...

What am I missing?

The text was updated successfully, but these errors were encountered:

kud1ing · 2012-02-07T09:56:41Z

As far as i understand, Haskell uses UTF-8 only and en/decodes during I/O.
Would that work for Rust?

killerswan · 2012-02-07T10:03:00Z

That's pretty much what I have in mind.

kud1ing · 2012-02-07T16:39:57Z

Somewhat related is #1557.

Haskell uses only '\n' internally, and converts during I/O.

killerswan · 2012-02-07T16:59:32Z

I care about operating system APIs because I have to, but don't want to think about 'locale' and would rather always output UTF-8 for stdout, writing files, and so on. I am not advocating that we dumb down our stuff to print Latin-1 data out, except in the cases where we actually have to be able to use APIs that require whatever weird alien stuff.

The \n vs. \r\n vs. \n\r (yes it exists) war is eternal, but significantly less important. The closest thing to a teletype machine that I know of is Cmd.exe...

graydon · 2012-02-09T02:14:03Z

Yeah, the str library probably needs a few common encoding-converters wired in, along with the assumption of a "full" iconv or libicu-like "convert to anything" layer further out (say in libstd).

I think you have a good list here. Un-tagging as [rfc] since this is not really a language change, just some library work. Totally legitimate library work mind you! Needs doing. All the wchar_t / unicode APIs in windows need such decoding.

(And at some point we will, indeed, need to think about locales. They're real. But I'm willing to leave that to a later cycle of stdlib design, or at least another bug.)

graydon · 2012-03-05T22:39:40Z

I landed UTF-16 helpers in 47e7a05. I believe that's "what uv needs on windows", in the sense that UTF-16 is the interpretation of the wchar_t-based "W" APIs in win32 accept. Unicode is our primary concept of text, and UTF-16 is the "extra" silly encoding we need to speak unicode in most places.

I don't think the latin-1-and-supersets stuff is really worth trying to support, or at least not in libcore. I'd prefer not to require libcore to model codepages or other locale artifacts. It's a huge task, we'll probably delegate most of it to libICU, and lots of software is never localized. For at least those reasons I'd prefer we leave locale stuff to libstd.

I do expect to modify all the windows API calls we use in libcore to call the "W" variants with UTF-16 input, not the "A" variants with (broken) UTF-8 as we're doing today.

Closing this.

…t-lang#1771) * Also set RUSTUP_TOOLCHAIN

graydon closed this as completed Mar 5, 2012

graydon mentioned this issue Mar 5, 2012

Modify the libc/os calls to win32 functions to use UTF-16 #1927

Closed

celinval pushed a commit to celinval/rust-dev that referenced this issue Jun 4, 2024

Fix undefined symbol errors when rustup defaults to nightly rust (rus…

79327dc

…t-lang#1771) * Also set RUSTUP_TOOLCHAIN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

str should support a few other string en/decodings #1771

str should support a few other string en/decodings #1771

killerswan commented Feb 7, 2012

kud1ing commented Feb 7, 2012

killerswan commented Feb 7, 2012

kud1ing commented Feb 7, 2012

killerswan commented Feb 7, 2012

graydon commented Feb 9, 2012

graydon commented Mar 5, 2012

str should support a few other string en/decodings #1771

str should support a few other string en/decodings #1771

Comments

killerswan commented Feb 7, 2012

kud1ing commented Feb 7, 2012

killerswan commented Feb 7, 2012

kud1ing commented Feb 7, 2012

killerswan commented Feb 7, 2012

graydon commented Feb 9, 2012

graydon commented Mar 5, 2012