Description
RFC 69 rules that byte literals must contain ASCII characters only. This restriction has been there from the beginning, but I suggest changing it to allow any UTF-8 sequence in the byte literals.
Following the history, it seems that this restriction originally came from rust-lang/rust#4334. It states that Python only allows ASCII inside the byte literals, to make a "very clear distinction" between bytes and strings. However, the original poster also says that this restriction may not be necessary.
In Rust, the source code is guaranteed to be UTF-8. So nothing is blocking the compiler from interpreting the byte literals as UTF-8. On the other hand, Python had to be conservative on the UTF-8 assumption because it allows source code encodings other than UTF-8.
Some would say that even if the source code encoding is UTF-8, the encoding for byte literals may differ. But, we are already making an assumption on the string literals. They are also coerced to str
type, which is just a UTF-8-ensured [u8]
.
A need for "clear distinction" is not the case again, because those differences should be distinguished through the type system, not by allowing or forbiding some characters. "hello"
and b"hello"
are different, regardless of their characters("hello".) Likewise, "안녕"
and b"안녕"
are clearly distinguishable although their characters are same.