Skip to content

Allow non-ASCII characters in byte literals #454

Closed
@barosl

Description

@barosl

RFC 69 rules that byte literals must contain ASCII characters only. This restriction has been there from the beginning, but I suggest changing it to allow any UTF-8 sequence in the byte literals.

Following the history, it seems that this restriction originally came from rust-lang/rust#4334. It states that Python only allows ASCII inside the byte literals, to make a "very clear distinction" between bytes and strings. However, the original poster also says that this restriction may not be necessary.

In Rust, the source code is guaranteed to be UTF-8. So nothing is blocking the compiler from interpreting the byte literals as UTF-8. On the other hand, Python had to be conservative on the UTF-8 assumption because it allows source code encodings other than UTF-8.

Some would say that even if the source code encoding is UTF-8, the encoding for byte literals may differ. But, we are already making an assumption on the string literals. They are also coerced to str type, which is just a UTF-8-ensured [u8].

A need for "clear distinction" is not the case again, because those differences should be distinguished through the type system, not by allowing or forbiding some characters. "hello" and b"hello" are different, regardless of their characters("hello".) Likewise, "안녕" and b"안녕" are clearly distinguishable although their characters are same.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions