RFC: Allow non-ASCII characters in byte literals #455

barosl · 2014-11-09T14:44:34Z

I made a RFC proposal based on my thoughts in #454. Please review it!

Manishearth · 2014-11-09T21:39:13Z

If this change is to be made, I suggest having a lint for these literals. "Warning: this byte literal contains non-ascii characters and will be assumed to be encoded in UTF-8. #[allow(something)] to turn off"

SimonSapin · 2014-11-09T22:11:29Z

I disagree with this proposal.

I really want to live in a world where of course everything is UTF-8, but this is not the case today.

Mojibake is a common type of bug. Avoiding it requires a bit of rigor in separating Unicode data from bytes data, and for the latter keeping track of what encoding it’s supposed to be in. People give entire talks about this. The Rust type system helps here (with str being a separate type from [u8]).

Byte literals are especially useful when the encoding of the data is unknown or not UTF-8 (though probably ASCII-compatible, since most encodings are). Therefore, having byte literals implicitly assume UTF-8 is doing a disservice to the developers IMO, as it might be a source of mojibake or other encoding-related bugs. If the data is known to be UTF-8, why not use String and str?

As the encoding of the source code is forced to UTF-8, we can directly know the representation of the non-ASCII characters in the byte literals.

The fact that source code is stored as UTF-8 on disk is irrelevant. The parser works with Unicode input, and could use any encoding to encode the value of byte literals to bytes.

The second one ["안녕".as_bytes()] works, though it needs an additional function call. It may get some optimization in near future, but is still a bit bothersome.

str::as_bytes() has zero cost at run-time. Its only downside is that it can not be used in a static context (e.g. to initialize a const), since we don’t have compile-time function evaluation yet.

By the way, I think it should be renamed to as_utf8 (rust-lang/rust#14131) to make the choice of encoding explicit.

barosl · 2014-11-10T07:30:45Z

@SimonSapin

I agree that tracking an encoding is important. If I am to deal with different encodings, I will always use str and encode/decode it using an appropriate encoding. But if the participating systems are ensured to use UTF-8, I will use the (UTF-8-assumed) byte literals happily.

As the presentation you linked says, the biggest problem related to str/unicode (or, str/bytes) that I had with Python 2 was that they coerces to each other. This style of implicit conversion is strictly prohibited in Rust, like you said. (the distinction between str and [u8]) So I think the risk is low.

And yeah, as you said, we're already assuming the byte literals to be ASCII comptabile. (neither UTF-16 nor UTF-32) The ASCII code points in the literal are interpreted as their ASCII-encoded byte sequences. From a pedantic point of view, byte literals should be specified only using numbers, like Java does. But in reality, almost all encodings are ASCII-compatible and that's why we made an assumption on interpreting the literals. And what I suggest is that it is still quite safe if we extend it to UTF-8, IMHO. (It brings some convenience too, of course.)

The downside of str::as_bytes() that cannot be used in a static context is what I haven't thought of. It is definitely an issue. And one more thing is, though I think it is a different issue, that even bytes! cannot be used in that context. How should we deal with it?

SimonSapin · 2014-11-11T15:56:23Z

the biggest problem related to str/unicode (or, str/bytes) that I had with Python 2 was that they coerces to each other

Right, that part doesn’t apply to Rust. But the more general point of "be careful what your encoding your bytes are in" does.

byte literals should be specified only using numbers

Since we don’t have a separate type for byte strings, in Rust they’re [u8] so you can use array literals.

Almost all encodings are ASCII-compatible and that's why we made an assumption on interpreting the literals.

Yes.

And what I suggest is that it is still quite safe if we extend it to UTF-8

I understand that, but I disagree. Admittedly my argument is kinda weak given that I have the opposite position on assuming ASCII-compatibility.

even bytes! cannot be used in [static] context

It can. The macro expands to a bytes literal, no function involved.

mahkoh · 2014-11-11T16:18:31Z

Almost all encodings are ASCII-compatible

The "native" Windows encoding is not, arguably a very large market.

It can. The macro expands to a bytes literal

Byte literals currently don't have static lifetime so it can't be used in all situations where a &'static [u8] is needed.

barosl · 2014-11-11T16:23:00Z

even bytes! cannot be used in [static] context

It can. The macro expands to a bytes literal, no function involved.

I tested it previously, but it failed with the following error:

const a: &'static [u8] = bytes!("hello");

bytes.rs:2:26: 2:42 error: constants cannot refer to other statics, insert an intermediate constant instead
bytes.rs:2 const a: &'static [u8] = bytes!("hello");

Is this intended? Or maybe it is a bug, but I could not understand what the error message says. 😢

mahkoh · 2014-11-11T16:24:13Z

@barosl: The bytes macro hasn't been fixed after the const change. This is an easy fix.

SimonSapin · 2014-11-11T16:27:27Z

Almost all encodings are ASCII-compatible

The "native" Windows encoding is not, arguably a very large market.

Windows uses UTF-16 (kinda) internally for its APIs, but represented by 16-bit units [u16] (rather than [u8] where you’d have to deal with endianness). So it’s incompatible with bytes literals anyway. I don’t believe it’s that common, even on Windows, to use UTF-16 for bytes-oriented I/O.

It can. The macro expands to a bytes literal

Byte literals currently don't have static lifetime so it can't be used in all situations where a &'static [u8] is needed.

That sounds like a fixable bug, maybe?

barosl · 2014-11-11T16:46:43Z

If we are to fix the bytes! macro, what will happen to its "deprecated" status? I thought it would be removed before we reach 1.0.

If bytes! is determined to remain in the future, what about this? We can add an (optional?) argument to bytes!, which gives a hint to encode a string to a byte literal. Like this:

bytes!("Hello") // Encoded to UTF-8 as before
bytes!("Hello", "utf-16") // Encoded to UTF-16, maybe for the Windows users?
bytes!("Hello", "latin-1") // To handle HTTP headers correctly (It may seem pedantic, though)

This kind of work can be done in compile time, right? And not every encoding has to be supported, IMO. Only supporting frequently used encodings will still be useful.

mahkoh · 2014-11-11T16:47:43Z

That would limit the usefulness of bytes.

nodakai · 2014-11-19T00:21:48Z

Your proposal will allow neither local encodings like EUC-KR nor x86 executable code in byte literals. Then what's the point of extending it outside plain ASCII? The "motivation" section lacks convincing use cases.

pnkfelix · 2014-12-12T16:59:26Z

@barosl am I correct that this could be added backwards compatibly after 1.0 ?

barosl · 2014-12-12T17:00:24Z

@pnkfelix Yup, I believe so.

pnkfelix · 2014-12-12T17:01:12Z

(also, reviewing the comments on this PR, I have not seen any posts apart from the author's in favor of this feature. Of course an obviously good idea can be adopted by the core team regardless of what the feedback is on the RFC PR, but I do not see this as an obviously good idea...)

barosl · 2014-12-12T17:07:33Z

@pnkfelix Yeah, I admit that the benefit of this PR can be somewhat subjective, though I still think byte literals don't have to be ASCII-only. It is too restrictive, in my humble opinion.

aturon · 2015-03-05T06:28:42Z

ping @pnkfelix, what's the status?

pnkfelix · 2015-03-05T15:11:08Z

I think we should postpone this.

(or just close it outright; @nodakai 's note above makes me wonder if that is in fact the right answer.)

aturon · 2015-03-10T18:14:15Z

In the interest of keeping the RFC queue in good shape, I'm going to go ahead and close this.

@barosl, thanks for making this PR, and do feel free to continue discussing these ideas in other forums -- perhaps some consensus could be reached to make a change like this later on.

Add a proposal

16c40e3

barosl force-pushed the allow-non-ascii branch from 6adb291 to 16c40e3 Compare November 9, 2014 14:48

pnkfelix self-assigned this Nov 13, 2014

aturon closed this Mar 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Allow non-ASCII characters in byte literals #455

RFC: Allow non-ASCII characters in byte literals #455

barosl commented Nov 9, 2014

Manishearth commented Nov 9, 2014

SimonSapin commented Nov 9, 2014

barosl commented Nov 10, 2014

SimonSapin commented Nov 11, 2014

mahkoh commented Nov 11, 2014

barosl commented Nov 11, 2014

mahkoh commented Nov 11, 2014

SimonSapin commented Nov 11, 2014

barosl commented Nov 11, 2014

mahkoh commented Nov 11, 2014

nodakai commented Nov 19, 2014

pnkfelix commented Dec 12, 2014

barosl commented Dec 12, 2014

pnkfelix commented Dec 12, 2014

barosl commented Dec 12, 2014

aturon commented Mar 5, 2015

pnkfelix commented Mar 5, 2015

aturon commented Mar 10, 2015

RFC: Allow non-ASCII characters in byte literals #455

RFC: Allow non-ASCII characters in byte literals #455

Conversation

barosl commented Nov 9, 2014

Manishearth commented Nov 9, 2014

SimonSapin commented Nov 9, 2014

barosl commented Nov 10, 2014

SimonSapin commented Nov 11, 2014

mahkoh commented Nov 11, 2014

barosl commented Nov 11, 2014

mahkoh commented Nov 11, 2014

SimonSapin commented Nov 11, 2014

barosl commented Nov 11, 2014

mahkoh commented Nov 11, 2014

nodakai commented Nov 19, 2014

pnkfelix commented Dec 12, 2014

barosl commented Dec 12, 2014

pnkfelix commented Dec 12, 2014

barosl commented Dec 12, 2014

aturon commented Mar 5, 2015

pnkfelix commented Mar 5, 2015

aturon commented Mar 10, 2015