Skip to content

str should support a few other string en/decodings #1771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
killerswan opened this issue Feb 7, 2012 · 6 comments
Closed

str should support a few other string en/decodings #1771

killerswan opened this issue Feb 7, 2012 · 6 comments

Comments

@killerswan
Copy link
Contributor

I've been thinking a little bit about how best to include support for the various other string encodings that we'll need to avoid the pitfalls of things like native filesystem paths and to read more common text encodings. It seems like in addition to the UTF-8 and ASCII which we already handle, we'll likely want simple to and from methods for the following, perhaps under the umbrella of a string_encoding interface:

  • Latin-1
  • Windows-1252
  • whatever libuv on Windows needs... (16-bit anything-but-null wchar_t, like NTFS?)
  • UTF-16

These are likely to be only a small can of worms, so long as we don't make any plans to automatically recognize them or agonize about working with them internally...

What am I missing?

@kud1ing
Copy link

kud1ing commented Feb 7, 2012

As far as i understand, Haskell uses UTF-8 only and en/decodes during I/O.
Would that work for Rust?

@killerswan
Copy link
Contributor Author

That's pretty much what I have in mind.

@kud1ing
Copy link

kud1ing commented Feb 7, 2012

Somewhat related is #1557.

Haskell uses only '\n' internally, and converts during I/O.

@killerswan
Copy link
Contributor Author

I care about operating system APIs because I have to, but don't want to think about 'locale' and would rather always output UTF-8 for stdout, writing files, and so on. I am not advocating that we dumb down our stuff to print Latin-1 data out, except in the cases where we actually have to be able to use APIs that require whatever weird alien stuff.

The \n vs. \r\n vs. \n\r (yes it exists) war is eternal, but significantly less important. The closest thing to a teletype machine that I know of is Cmd.exe...

@graydon
Copy link
Contributor

graydon commented Feb 9, 2012

Yeah, the str library probably needs a few common encoding-converters wired in, along with the assumption of a "full" iconv or libicu-like "convert to anything" layer further out (say in libstd).

I think you have a good list here. Un-tagging as [rfc] since this is not really a language change, just some library work. Totally legitimate library work mind you! Needs doing. All the wchar_t / unicode APIs in windows need such decoding.

(And at some point we will, indeed, need to think about locales. They're real. But I'm willing to leave that to a later cycle of stdlib design, or at least another bug.)

@graydon
Copy link
Contributor

graydon commented Mar 5, 2012

I landed UTF-16 helpers in 47e7a05. I believe that's "what uv needs on windows", in the sense that UTF-16 is the interpretation of the wchar_t-based "W" APIs in win32 accept. Unicode is our primary concept of text, and UTF-16 is the "extra" silly encoding we need to speak unicode in most places.

I don't think the latin-1-and-supersets stuff is really worth trying to support, or at least not in libcore. I'd prefer not to require libcore to model codepages or other locale artifacts. It's a huge task, we'll probably delegate most of it to libICU, and lots of software is never localized. For at least those reasons I'd prefer we leave locale stuff to libstd.

I do expect to modify all the windows API calls we use in libcore to call the "W" variants with UTF-16 input, not the "A" variants with (broken) UTF-8 as we're doing today.

Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants