Skip to content

rustc_lexer's definition of ids are more general than lang ref's spec #85809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
osa1 opened this issue May 29, 2021 · 3 comments
Closed

rustc_lexer's definition of ids are more general than lang ref's spec #85809

osa1 opened this issue May 29, 2021 · 3 comments

Comments

@osa1
Copy link
Contributor

osa1 commented May 29, 2021

This is how Rust identifiers' lexical syntax is defined: https://doc.rust-lang.org/reference/identifiers.html
This is how the lexer for Rust identifiers is implemented:

/// True if `c` is valid as a first character of an identifier.
/// See [Rust language reference](https://doc.rust-lang.org/reference/identifiers.html) for
/// a formal definition of valid identifier name.
pub fn is_id_start(c: char) -> bool {
// This is XID_Start OR '_' (which formally is not a XID_Start).
// We also add fast-path for ascii idents
('a'..='z').contains(&c)
|| ('A'..='Z').contains(&c)
|| c == '_'
|| (c > '\x7f' && unicode_xid::UnicodeXID::is_xid_start(c))
}
/// True if `c` is valid as a non-first character of an identifier.
/// See [Rust language reference](https://doc.rust-lang.org/reference/identifiers.html) for
/// a formal definition of valid identifier name.
pub fn is_id_continue(c: char) -> bool {
// This is exactly XID_Continue.
// We also add fast-path for ascii idents
('a'..='z').contains(&c)
|| ('A'..='Z').contains(&c)
|| ('0'..='9').contains(&c)
|| c == '_'
|| (c > '\x7f' && unicode_xid::UnicodeXID::is_xid_continue(c))
}
/// The passed string is lexically an identifier.
pub fn is_ident(string: &str) -> bool {
let mut chars = string.chars();
if let Some(start) = chars.next() {
is_id_start(start) && chars.all(is_id_continue)
} else {
false
}
}

The specification says it should start with ASCII alphabetic and continue with ASCII alphanumeric or underscore. But the implementation uses http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax which is much more general than that as far as I understand.

I think one of code or lang ref should be updated, but I'm not sure which one.

(I didn't check lexing for other tokens, it might be useful to compare others with the language reference's definitions too)

@PatchMixolydic
Copy link
Contributor

The reference is outdated due to the stabilization of non-ASCII idents in 1.53 (tracking issue #55467).

@osa1
Copy link
Contributor Author

osa1 commented May 29, 2021

Interesting, thanks @PatchMixolydic. I guess it would be helpful for the reader if we at least have a line in those documentations pointing to #55467.

@PatchMixolydic
Copy link
Contributor

This seems to be fixed in the nightly version of the Reference.

@osa1 osa1 closed this as completed May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants