Skip to content

RFC: Allow non-ASCII characters in byte literals #455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

barosl
Copy link
Contributor

@barosl barosl commented Nov 9, 2014

I made a RFC proposal based on my thoughts in #454. Please review it!

Rendered

@Manishearth
Copy link
Member

If this change is to be made, I suggest having a lint for these literals. "Warning: this byte literal contains non-ascii characters and will be assumed to be encoded in UTF-8. #[allow(something)] to turn off"

@SimonSapin
Copy link
Contributor

I disagree with this proposal.

I really want to live in a world where of course everything is UTF-8, but this is not the case today.

Mojibake is a common type of bug. Avoiding it requires a bit of rigor in separating Unicode data from bytes data, and for the latter keeping track of what encoding it’s supposed to be in. People give entire talks about this. The Rust type system helps here (with str being a separate type from [u8]).

Byte literals are especially useful when the encoding of the data is unknown or not UTF-8 (though probably ASCII-compatible, since most encodings are). Therefore, having byte literals implicitly assume UTF-8 is doing a disservice to the developers IMO, as it might be a source of mojibake or other encoding-related bugs. If the data is known to be UTF-8, why not use String and str?

As the encoding of the source code is forced to UTF-8, we can directly know the representation of the non-ASCII characters in the byte literals.

The fact that source code is stored as UTF-8 on disk is irrelevant. The parser works with Unicode input, and could use any encoding to encode the value of byte literals to bytes.

The second one ["안녕".as_bytes()] works, though it needs an additional function call. It may get some optimization in near future, but is still a bit bothersome.

str::as_bytes() has zero cost at run-time. Its only downside is that it can not be used in a static context (e.g. to initialize a const), since we don’t have compile-time function evaluation yet.

By the way, I think it should be renamed to as_utf8 (rust-lang/rust#14131) to make the choice of encoding explicit.

@barosl
Copy link
Contributor Author

barosl commented Nov 10, 2014

@SimonSapin

I agree that tracking an encoding is important. If I am to deal with different encodings, I will always use str and encode/decode it using an appropriate encoding. But if the participating systems are ensured to use UTF-8, I will use the (UTF-8-assumed) byte literals happily.

As the presentation you linked says, the biggest problem related to str/unicode (or, str/bytes) that I had with Python 2 was that they coerces to each other. This style of implicit conversion is strictly prohibited in Rust, like you said. (the distinction between str and [u8]) So I think the risk is low.

And yeah, as you said, we're already assuming the byte literals to be ASCII comptabile. (neither UTF-16 nor UTF-32) The ASCII code points in the literal are interpreted as their ASCII-encoded byte sequences. From a pedantic point of view, byte literals should be specified only using numbers, like Java does. But in reality, almost all encodings are ASCII-compatible and that's why we made an assumption on interpreting the literals. And what I suggest is that it is still quite safe if we extend it to UTF-8, IMHO. (It brings some convenience too, of course.)

The downside of str::as_bytes() that cannot be used in a static context is what I haven't thought of. It is definitely an issue. And one more thing is, though I think it is a different issue, that even bytes! cannot be used in that context. How should we deal with it?

@SimonSapin
Copy link
Contributor

the biggest problem related to str/unicode (or, str/bytes) that I had with Python 2 was that they coerces to each other

Right, that part doesn’t apply to Rust. But the more general point of "be careful what your encoding your bytes are in" does.

byte literals should be specified only using numbers

Since we don’t have a separate type for byte strings, in Rust they’re [u8] so you can use array literals.

Almost all encodings are ASCII-compatible and that's why we made an assumption on interpreting the literals.

Yes.

And what I suggest is that it is still quite safe if we extend it to UTF-8

I understand that, but I disagree. Admittedly my argument is kinda weak given that I have the opposite position on assuming ASCII-compatibility.

even bytes! cannot be used in [static] context

It can. The macro expands to a bytes literal, no function involved.

@mahkoh
Copy link
Contributor

mahkoh commented Nov 11, 2014

Almost all encodings are ASCII-compatible

The "native" Windows encoding is not, arguably a very large market.

It can. The macro expands to a bytes literal

Byte literals currently don't have static lifetime so it can't be used in all situations where a &'static [u8] is needed.

@barosl
Copy link
Contributor Author

barosl commented Nov 11, 2014

even bytes! cannot be used in [static] context

It can. The macro expands to a bytes literal, no function involved.

I tested it previously, but it failed with the following error:

const a: &'static [u8] = bytes!("hello");
bytes.rs:2:26: 2:42 error: constants cannot refer to other statics, insert an intermediate constant instead
bytes.rs:2 const a: &'static [u8] = bytes!("hello");

Is this intended? Or maybe it is a bug, but I could not understand what the error message says. 😢

@mahkoh
Copy link
Contributor

mahkoh commented Nov 11, 2014

@barosl: The bytes macro hasn't been fixed after the const change. This is an easy fix.

@SimonSapin
Copy link
Contributor

Almost all encodings are ASCII-compatible

The "native" Windows encoding is not, arguably a very large market.

Windows uses UTF-16 (kinda) internally for its APIs, but represented by 16-bit units [u16] (rather than [u8] where you’d have to deal with endianness). So it’s incompatible with bytes literals anyway. I don’t believe it’s that common, even on Windows, to use UTF-16 for bytes-oriented I/O.

It can. The macro expands to a bytes literal

Byte literals currently don't have static lifetime so it can't be used in all situations where a &'static [u8] is needed.

That sounds like a fixable bug, maybe?

@barosl
Copy link
Contributor Author

barosl commented Nov 11, 2014

If we are to fix the bytes! macro, what will happen to its "deprecated" status? I thought it would be removed before we reach 1.0.

If bytes! is determined to remain in the future, what about this? We can add an (optional?) argument to bytes!, which gives a hint to encode a string to a byte literal. Like this:

bytes!("Hello") // Encoded to UTF-8 as before
bytes!("Hello", "utf-16") // Encoded to UTF-16, maybe for the Windows users?
bytes!("Hello", "latin-1") // To handle HTTP headers correctly (It may seem pedantic, though)

This kind of work can be done in compile time, right? And not every encoding has to be supported, IMO. Only supporting frequently used encodings will still be useful.

@mahkoh
Copy link
Contributor

mahkoh commented Nov 11, 2014

That would limit the usefulness of bytes.

@pnkfelix pnkfelix self-assigned this Nov 13, 2014
@nodakai
Copy link

nodakai commented Nov 19, 2014

Your proposal will allow neither local encodings like EUC-KR nor x86 executable code in byte literals. Then what's the point of extending it outside plain ASCII? The "motivation" section lacks convincing use cases.

@pnkfelix
Copy link
Member

@barosl am I correct that this could be added backwards compatibly after 1.0 ?

@barosl
Copy link
Contributor Author

barosl commented Dec 12, 2014

@pnkfelix Yup, I believe so.

@pnkfelix
Copy link
Member

(also, reviewing the comments on this PR, I have not seen any posts apart from the author's in favor of this feature. Of course an obviously good idea can be adopted by the core team regardless of what the feedback is on the RFC PR, but I do not see this as an obviously good idea...)

@barosl
Copy link
Contributor Author

barosl commented Dec 12, 2014

@pnkfelix Yeah, I admit that the benefit of this PR can be somewhat subjective, though I still think byte literals don't have to be ASCII-only. It is too restrictive, in my humble opinion.

@aturon
Copy link
Member

aturon commented Mar 5, 2015

ping @pnkfelix, what's the status?

@pnkfelix
Copy link
Member

pnkfelix commented Mar 5, 2015

I think we should postpone this.

(or just close it outright; @nodakai 's note above makes me wonder if that is in fact the right answer.)

@aturon
Copy link
Member

aturon commented Mar 10, 2015

In the interest of keeping the RFC queue in good shape, I'm going to go ahead and close this.

@barosl, thanks for making this PR, and do feel free to continue discussing these ideas in other forums -- perhaps some consensus could be reached to make a change like this later on.

@aturon aturon closed this Mar 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants