Description
Update: the first post (which you're reading) is imprecise in some ways, especially the use of the word "valid". Please skip ahead to this comment for a better take on the aims and goals here.
In generating some Unicode test data, I discovered that Zig doesn't allow surrogates, even when encoded with \u
.
Minimal reproduction:
const invalid_str = "\u{d800}";
The error (Zig 0.12.0): "error: unicode escape does not correspond to a valid codepoint".
The error message is not correct, the UTF-16 surrogates are valid codepoints, in category Other: surrogate. Here's a property page for U+D800.
It makes sense to me that the parser finding random surrogates in the string should reject that as not well-formed, just like it would balk on control character, or a bare \xff
. Such random garbage is most likely a mistake. The same applies to overlong encodings, you could argue that they're "well encoded" in that they match the necessary pattern for a UTF-8 encoded point, but the standard specifically forbids them. Unlike the surrogates, these are not codepoints.
But Zig does not demand that string data is well formed UTF-8, the \x
byte encoding can represent arbitrary bytes within a string. In fact, "\u{01}"
is valid, when it would not be if embedded raw in the string.
It doesn't make sense to me that a codepoint specifically encoded as e.g. \u{d800}
would create a compile error. That's the author affirmatively requesting that said codepoint be decoded into UTF-8 and added to the sequence, it's not the sort of thing which happens by accident. It has an exact interpretation, .{0xed, 0xao, 0x80}
, which is validly-encoded UTF-8. I'll contrast this with \u{80000000}
, which can't be turned into bytes: this is genuinely invalid. .{0xc0, 0xaf}
is also invalid, despite having the superficial qualities of a UTF-8 codepoint, since it's overlong. There's no way to represent those in U+
notation, so this is a bit off topic, the point is that I could stuff it into a string with \xco\xaf
and that would be fine with the compiler.
There's no reason for the compiler to contain extra logic to prevent surrogates from being represented using \u
notation, when it's specifically requested. In my case, it means that if I want test data which covers the entire three-byte range of Unicode, I must detect the surrogate ranges and special-case encode them as \x
sequences. Or perhaps someone might be writing a fixup tool which detects an invalid encoding of surrogate pairs and produces correct UTF-8. There, too, the test data will have surrogates in it, and this behavior is an arbitrary limitation to work around. TL;DR, the compiler should accept \u
notation for all codepoints in Unicode, surrogates included.