Skip to content

Allow non-ascii identifiers #4151

Closed
Closed
@thejoshwolfe

Description

@thejoshwolfe

Here is a concrete proposal for #3947 (comment) .

Background

All Zig code is always encoded in UTF-8, and this proposal does not change that.

This proposal does not change the interpretation of ASCII codepoints anywhere in Zig code.

The only non-ascii codepoints with special handling in Zig before this proposal are: U+0085 (NEL), U+2028 (LS), U+2029 (PS). This proposal does not change the interpretation of these codepoints; they are not allowed in identifiers.

Proposal

Zig's current lexical rule for identifiers is:

IDENTIFIER
    <- !keyword ("c" !["\\] / [A-Zabd-z_]) [A-Za-z0-9_]* skip
     / "@\"" string_char* "\""                            skip

This proposal adds the codepoints listed in the table below to both the ranges [A-Zabd-z_] and [A-Za-z0-9_] in the above rule.

00A0
00A8
00AA
00AD
00AF
00B2..00B5
00B7..00BA
00BC..00BE
00C0..00D6
00D8..00F6
00F8..200D
202A..202F
203F..2040
2054
205F..218F
2460..24FF
2776..2793
2C00..2DFF
2E80..3000
3004..3007
3021..302F
3031..D7FF
F900..FD3D
FD40..FDCF
FDF0..FE44
FE47..FFFD
10000..1FFFD
20000..2FFFD
30000..3FFFD
40000..4FFFD
50000..5FFFD
60000..6FFFD
70000..7FFFD
80000..8FFFD
90000..9FFFD
A0000..AFFFD
B0000..BFFFD
C0000..CFFFD
D0000..DFFFD
E0000..EFFFD

Explanation

This set of codepoints was determined by following the recommendation here: https://unicode.org/reports/tr31/#Immutable_Identifier_Syntax . Specifically, this is the set of all characters except characters meeting any of these criteria:

  • Pattern_White_Space=True
  • Pattern_Syntax=True
  • General_Category=Private_Use, Surrogate, or Control
  • Noncharacter_Code_Point=True

Unicode Character Data version 5.2.0 was used to generate this list, but this list can remain stable forever despite future versions to Unicode Character Data, as per the recommendation and discussion in tr31 linked above. (EDIT: @daurnimator pointed out that this is many major versions behind, but even using the latest version 12.1.0, the list of codepoints in this proposal is identical.)

The code I used to generate the above set of codepoints can be found here: https://github.com/ziglang/zig/blob/6f8e2fad94fde6c9a8c4ca52d964d0616690ee4c/tools/gen_id_char_table.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalThis issue suggests modifications. If it also has the "accepted" label then it is planned.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions