-
Notifications
You must be signed in to change notification settings - Fork 1.6k
RFC: Allow non-ASCII characters in byte literals #455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6adb291
to
16c40e3
Compare
If this change is to be made, I suggest having a lint for these literals. "Warning: this byte literal contains non-ascii characters and will be assumed to be encoded in UTF-8. |
I disagree with this proposal. I really want to live in a world where of course everything is UTF-8, but this is not the case today. Mojibake is a common type of bug. Avoiding it requires a bit of rigor in separating Unicode data from bytes data, and for the latter keeping track of what encoding it’s supposed to be in. People give entire talks about this. The Rust type system helps here (with Byte literals are especially useful when the encoding of the data is unknown or not UTF-8 (though probably ASCII-compatible, since most encodings are). Therefore, having byte literals implicitly assume UTF-8 is doing a disservice to the developers IMO, as it might be a source of mojibake or other encoding-related bugs. If the data is known to be UTF-8, why not use
The fact that source code is stored as UTF-8 on disk is irrelevant. The parser works with Unicode input, and could use any encoding to encode the value of byte literals to bytes.
By the way, I think it should be renamed to |
I agree that tracking an encoding is important. If I am to deal with different encodings, I will always use As the presentation you linked says, the biggest problem related to str/unicode (or, str/bytes) that I had with Python 2 was that they coerces to each other. This style of implicit conversion is strictly prohibited in Rust, like you said. (the distinction between And yeah, as you said, we're already assuming the byte literals to be ASCII comptabile. (neither UTF-16 nor UTF-32) The ASCII code points in the literal are interpreted as their ASCII-encoded byte sequences. From a pedantic point of view, byte literals should be specified only using numbers, like Java does. But in reality, almost all encodings are ASCII-compatible and that's why we made an assumption on interpreting the literals. And what I suggest is that it is still quite safe if we extend it to UTF-8, IMHO. (It brings some convenience too, of course.) The downside of |
Right, that part doesn’t apply to Rust. But the more general point of "be careful what your encoding your bytes are in" does.
Since we don’t have a separate type for byte strings, in Rust they’re
Yes.
I understand that, but I disagree. Admittedly my argument is kinda weak given that I have the opposite position on assuming ASCII-compatibility.
It can. The macro expands to a bytes literal, no function involved. |
The "native" Windows encoding is not, arguably a very large market.
Byte literals currently don't have static lifetime so it can't be used in all situations where a &'static [u8] is needed. |
I tested it previously, but it failed with the following error: const a: &'static [u8] = bytes!("hello");
Is this intended? Or maybe it is a bug, but I could not understand what the error message says. 😢 |
@barosl: The bytes macro hasn't been fixed after the const change. This is an easy fix. |
Windows uses UTF-16 (kinda) internally for its APIs, but represented by 16-bit units
That sounds like a fixable bug, maybe? |
If we are to fix the If bytes!("Hello") // Encoded to UTF-8 as before
bytes!("Hello", "utf-16") // Encoded to UTF-16, maybe for the Windows users?
bytes!("Hello", "latin-1") // To handle HTTP headers correctly (It may seem pedantic, though) This kind of work can be done in compile time, right? And not every encoding has to be supported, IMO. Only supporting frequently used encodings will still be useful. |
That would limit the usefulness of bytes. |
Your proposal will allow neither local encodings like EUC-KR nor x86 executable code in byte literals. Then what's the point of extending it outside plain ASCII? The "motivation" section lacks convincing use cases. |
@barosl am I correct that this could be added backwards compatibly after 1.0 ? |
@pnkfelix Yup, I believe so. |
(also, reviewing the comments on this PR, I have not seen any posts apart from the author's in favor of this feature. Of course an obviously good idea can be adopted by the core team regardless of what the feedback is on the RFC PR, but I do not see this as an obviously good idea...) |
@pnkfelix Yeah, I admit that the benefit of this PR can be somewhat subjective, though I still think byte literals don't have to be ASCII-only. It is too restrictive, in my humble opinion. |
ping @pnkfelix, what's the status? |
I think we should postpone this. (or just close it outright; @nodakai 's note above makes me wonder if that is in fact the right answer.) |
In the interest of keeping the RFC queue in good shape, I'm going to go ahead and close this. @barosl, thanks for making this PR, and do feel free to continue discussing these ideas in other forums -- perhaps some consensus could be reached to make a change like this later on. |
I made a RFC proposal based on my thoughts in #454. Please review it!
Rendered