From 8d1e7bd9f4b812f95b45f2f2732017f6bcea959b Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:15:50 -0700 Subject: [PATCH 01/38] Add some missing Syntax markers --- src/trait-bounds.md | 1 + src/types.md | 1 + 2 files changed, 2 insertions(+) diff --git a/src/trait-bounds.md b/src/trait-bounds.md index badbda186..ec862ddbc 100644 --- a/src/trait-bounds.md +++ b/src/trait-bounds.md @@ -148,6 +148,7 @@ r[bound.higher-ranked] ## Higher-ranked trait bounds r[bound.higher-ranked.syntax] +> **Syntax**\ > _ForLifetimes_ :\ >    `for` [_GenericParams_] diff --git a/src/types.md b/src/types.md index 3b6ea8cfd..352d34462 100644 --- a/src/types.md +++ b/src/types.md @@ -103,6 +103,7 @@ r[type.name.parenthesized] ### Parenthesized types r[type.name.parenthesized.syntax] +> **Syntax**\ > _ParenthesizedType_ :\ >    `(` [_Type_] `)` From acd95af66ea0c3fe105de657668df7494f80e2ec Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:18:42 -0700 Subject: [PATCH 02/38] Add missing rules for syntax blocks --- src/attributes.md | 1 + src/comments.md | 3 ++- src/expressions/array-expr.md | 1 + src/expressions/block-expr.md | 2 ++ src/expressions/loop-expr.md | 1 + src/expressions/operator-expr.md | 1 + src/keywords.md | 1 + src/patterns.md | 1 + src/tokens.md | 1 + 9 files changed, 11 insertions(+), 1 deletion(-) diff --git a/src/attributes.md b/src/attributes.md index d81e50ced..89345f411 100644 --- a/src/attributes.md +++ b/src/attributes.md @@ -161,6 +161,7 @@ Various built-in attributes use different subsets of the meta item syntax to specify their inputs. The following grammar rules show some commonly used forms: +r[attributes.meta.builtin.syntax] > **Syntax**\ > _MetaWord_:\ >    [IDENTIFIER] diff --git a/src/comments.md b/src/comments.md index b409d6117..208fe9a8d 100644 --- a/src/comments.md +++ b/src/comments.md @@ -1,6 +1,7 @@ -r[comments.syntax] +r[comments] # Comments +r[comments.syntax] > **Lexer**\ > LINE_COMMENT :\ >       `//` (~\[`/` `!` `\n`] | `//`) ~`\n`\*\ diff --git a/src/expressions/array-expr.md b/src/expressions/array-expr.md index aa6a911b2..67bde3cb5 100644 --- a/src/expressions/array-expr.md +++ b/src/expressions/array-expr.md @@ -66,6 +66,7 @@ const EMPTY: Vec = Vec::new(); r[expr.array.index] ## Array and slice indexing expressions +r[expr.array.index.syntax] > **Syntax**\ > _IndexExpression_ :\ >    [_Expression_] `[` [_Expression_] `]` diff --git a/src/expressions/block-expr.md b/src/expressions/block-expr.md index a0ea1419a..9d8758f92 100644 --- a/src/expressions/block-expr.md +++ b/src/expressions/block-expr.md @@ -222,10 +222,12 @@ if false { r[expr.block.unsafe] ## `unsafe` blocks +r[expr.block.unsafe.syntax] > **Syntax**\ > _UnsafeBlockExpression_ :\ >    `unsafe` _BlockExpression_ +r[expr.block.unsafe.intro] _See [`unsafe` blocks] for more information on when to use `unsafe`_. A block of code can be prefixed with the `unsafe` keyword to permit [unsafe operations]. diff --git a/src/expressions/loop-expr.md b/src/expressions/loop-expr.md index c3e3c1529..b7853b556 100644 --- a/src/expressions/loop-expr.md +++ b/src/expressions/loop-expr.md @@ -292,6 +292,7 @@ A `break` expression is only permitted in the body of a loop, and has one of the r[expr.loop.block-labels] ## Labelled block expressions +r[expr.loop.block-labels.syntax] > **Syntax**\ > _LabelBlockExpression_ :\ >    [_BlockExpression_] diff --git a/src/expressions/operator-expr.md b/src/expressions/operator-expr.md index 5fc1a93ce..9e7328e87 100644 --- a/src/expressions/operator-expr.md +++ b/src/expressions/operator-expr.md @@ -54,6 +54,7 @@ r[expr.operator.int-overflow.shift] r[expr.operator.borrow] ## Borrow operators +r[expr.operator.borrow.syntax] > **Syntax**\ > _BorrowExpression_ :\ >       (`&`|`&&`) [_Expression_]\ diff --git a/src/keywords.md b/src/keywords.md index cf2c5f8ac..525a49cd2 100644 --- a/src/keywords.md +++ b/src/keywords.md @@ -99,6 +99,7 @@ The following keywords are reserved beginning in the 2018 edition. > **Lexer 2018+**\ > KW_TRY : `try` +r[lex.keywords.reserved.edition2024] The following keywords are reserved beginning in the 2024 edition. > **Lexer 2024+**\ diff --git a/src/patterns.md b/src/patterns.md index 515c1fd58..8ece5d380 100644 --- a/src/patterns.md +++ b/src/patterns.md @@ -420,6 +420,7 @@ The wildcard pattern is always irrefutable. r[patterns.rest] ## Rest patterns +r[patterns.rest.syntax] > **Syntax**\ > _RestPattern_ :\ >    `..` diff --git a/src/tokens.md b/src/tokens.md index 2bb860ceb..d381aec5a 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -685,6 +685,7 @@ Examples of floating-point literals which are not accepted as literal expression r[lex.token.literal.reserved] #### Reserved forms similar to number literals +r[lex.token.literal.reserved.syntax] > **Lexer**\ > RESERVED_NUMBER :\ >       BIN_LITERAL \[`2`-`9`​]\ From f20820439e36b8f571ed427597b8d2d98929537b Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:23:52 -0700 Subject: [PATCH 03/38] Fix some minor grammar formatting issues Just fixing some small consistency and spacing mistakes. --- src/conditional-compilation.md | 8 ++++---- src/expressions/match-expr.md | 2 +- src/paths.md | 4 ++-- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/conditional-compilation.md b/src/conditional-compilation.md index 4d9a4ea3b..62e771006 100644 --- a/src/conditional-compilation.md +++ b/src/conditional-compilation.md @@ -12,16 +12,16 @@ r[cfg.syntax] > _ConfigurationOption_ :\ >    [IDENTIFIER] (`=` ([STRING_LITERAL] | [RAW_STRING_LITERAL]))? > -> _ConfigurationAll_\ +> _ConfigurationAll_ :\ >    `all` `(` _ConfigurationPredicateList_? `)` > -> _ConfigurationAny_\ +> _ConfigurationAny_ :\ >    `any` `(` _ConfigurationPredicateList_? `)` > -> _ConfigurationNot_\ +> _ConfigurationNot_ :\ >    `not` `(` _ConfigurationPredicate_ `)` > -> _ConfigurationPredicateList_\ +> _ConfigurationPredicateList_ :\ >    _ConfigurationPredicate_ (`,` _ConfigurationPredicate_)\* `,`? r[cfg.general] diff --git a/src/expressions/match-expr.md b/src/expressions/match-expr.md index 9071d7bef..28c71d03a 100644 --- a/src/expressions/match-expr.md +++ b/src/expressions/match-expr.md @@ -9,7 +9,7 @@ r[expr.match.syntax] >       _MatchArms_?\ >    `}` > ->_Scrutinee_ :\ +> _Scrutinee_ :\ >    [_Expression_]_except struct expression_ > > _MatchArms_ :\ diff --git a/src/paths.md b/src/paths.md index 7ef97f88a..5d1743036 100644 --- a/src/paths.md +++ b/src/paths.md @@ -145,10 +145,10 @@ r[paths.type.syntax] >    _PathIdentSegment_ (`::`? ([_GenericArgs_] | _TypePathFn_))? > > _TypePathFn_ :\ -> `(` _TypePathFnInputs_? `)` (`->` [_TypeNoBounds_])? +>    `(` _TypePathFnInputs_? `)` (`->` [_TypeNoBounds_])? > > _TypePathFnInputs_ :\ -> [_Type_] (`,` [_Type_])\* `,`? +>    [_Type_] (`,` [_Type_])\* `,`? r[paths.type.intro] Type paths are used within type definitions, trait bounds, type parameter bounds, From b53a9eeed9c534cb868ef6a58f01fb866a9b0ab6 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:24:44 -0700 Subject: [PATCH 04/38] Fix CfgAttribute name This rule was misnamed, colliding with the existing CfgAttrAttribute. --- src/conditional-compilation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/conditional-compilation.md b/src/conditional-compilation.md index 62e771006..5e7fe0775 100644 --- a/src/conditional-compilation.md +++ b/src/conditional-compilation.md @@ -314,7 +314,7 @@ r[cfg.attr] r[cfg.attr.syntax] > **Syntax**\ -> _CfgAttrAttribute_ :\ +> _CfgAttribute_ :\ >    `cfg` `(` _ConfigurationPredicate_ `)` From 27e1ec97a75267d3c9efb9c91c7509eff98d11db Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:32:08 -0700 Subject: [PATCH 05/38] Rename IsolatedCR to CR This renames IsolatedCR to CR. I felt like it wasn't exactly necessary since we have rewritten things so that it is clear that there is an input transformation which resolves this (`input.crlf`). We also never really defined what it meant. I also felt like there was room for confusion. For example, an input containing `CR CR LF LF` would get normalized to `CR LF`. The `CR` there is not isolated. --- src/comments.md | 11 ++++------- src/notation.md | 7 +++++++ src/tokens.md | 12 ++++++------ 3 files changed, 17 insertions(+), 13 deletions(-) diff --git a/src/comments.md b/src/comments.md index 208fe9a8d..189343a73 100644 --- a/src/comments.md +++ b/src/comments.md @@ -14,25 +14,22 @@ r[comments.syntax] >    | `/***/` > > INNER_LINE_DOC :\ ->    `//!` ~\[`\n` _IsolatedCR_]\* +>    `//!` ~\[`\n` CR]\* > > INNER_BLOCK_DOC :\ ->    `/*!` ( _BlockCommentOrDoc_ | ~\[`*/` _IsolatedCR_] )\* `*/` +>    `/*!` ( _BlockCommentOrDoc_ | ~\[`*/` CR] )\* `*/` > > OUTER_LINE_DOC :\ ->    `///` (~`/` ~\[`\n` _IsolatedCR_]\*)? +>    `///` (~`/` ~\[`\n` CR]\*)? > > OUTER_BLOCK_DOC :\ >    `/**` (~`*` | _BlockCommentOrDoc_ ) -> (_BlockCommentOrDoc_ | ~\[`*/` _IsolatedCR_])\* `*/` +> (_BlockCommentOrDoc_ | ~\[`*/` CR])\* `*/` > > _BlockCommentOrDoc_ :\ >       BLOCK_COMMENT\ >    | OUTER_BLOCK_DOC\ >    | INNER_BLOCK_DOC -> -> _IsolatedCR_ :\ ->    \\r r[comments.normal] ## Non-doc comments diff --git a/src/notation.md b/src/notation.md index cb3d8f606..f3554602a 100644 --- a/src/notation.md +++ b/src/notation.md @@ -35,6 +35,13 @@ When such a string in `monospace` font occurs inside the grammar, it is an implicit reference to a single member of such a string table production. See [tokens] for more information. +## Common productions + +The following are common definitions used in the grammar. + +> **Lexer**\ +> CR : U+000D + [binary operators]: expressions/operator-expr.md#arithmetic-and-logical-binary-operators [keywords]: keywords.md [tokens]: tokens.md diff --git a/src/tokens.md b/src/tokens.md index d381aec5a..7f4227375 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -157,7 +157,7 @@ r[lex.token.literal.str.syntax] > **Lexer**\ > STRING_LITERAL :\ >    `"` (\ ->       ~\[`"` `\` _IsolatedCR_]\ +>       ~\[`"` `\` CR]\ >       | QUOTE_ESCAPE\ >       | ASCII_ESCAPE\ >       | UNICODE_ESCAPE\ @@ -220,7 +220,7 @@ r[lex.token.literal.str-raw.syntax] >    `r` RAW_STRING_CONTENT SUFFIX? > > RAW_STRING_CONTENT :\ ->       `"` ( ~ _IsolatedCR_ )* (non-greedy) `"`\ +>       `"` ( ~ CR )* (non-greedy) `"`\ >    | `#` RAW_STRING_CONTENT `#` r[lex.token.literal.str-raw.intro] @@ -285,7 +285,7 @@ r[lex.token.str-byte.syntax] >    `b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )\* `"` SUFFIX? > > ASCII_FOR_STRING :\ ->    _any ASCII (i.e 0x00 to 0x7F), except_ `"`, `\` _and IsolatedCR_ +>    _any ASCII (i.e 0x00 to 0x7F) except_ `"`, `\`, _or CR_ r[lex.token.str-byte.intro] A non-raw _byte string literal_ is a sequence of ASCII characters and _escapes_, @@ -337,7 +337,7 @@ r[lex.token.str-byte-raw.syntax] >    | `#` RAW_BYTE_STRING_CONTENT `#` > > ASCII_FOR_RAW :\ ->    _any ASCII (i.e. 0x00 to 0x7F) except IsolatedCR_ +>    _any ASCII (i.e. 0x00 to 0x7F) except CR_ r[lex.token.str-byte-raw.intro] Raw byte string literals do not process any escapes. They start with the @@ -377,7 +377,7 @@ r[lex.token.str-c.syntax] > **Lexer**\ > C_STRING_LITERAL :\ >    `c"` (\ ->       ~\[`"` `\` _IsolatedCR_ _NUL_]\ +>       ~\[`"` `\` CR _NUL_]\ >       | BYTE_ESCAPE _except `\0` or `\x00`_\ >       | UNICODE_ESCAPE _except `\u{0}`, `\u{00}`, …, `\u{000000}`_\ >       | STRING_CONTINUE\ @@ -453,7 +453,7 @@ r[lex.token.str-c-raw.syntax] >    `cr` RAW_C_STRING_CONTENT SUFFIX? > > RAW_C_STRING_CONTENT :\ ->       `"` ( ~ _IsolatedCR_ _NUL_ )* (non-greedy) `"`\ +>       `"` ( ~\[CR NUL] )* (non-greedy) `"`\ >    | `#` RAW_C_STRING_CONTENT `#` r[lex.token.str-c-raw.intro] From c1faa76e17306d5a65131b35d7498b9ffaaa75ab Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:37:57 -0700 Subject: [PATCH 06/38] Name common ascii control characters This removes all backslash escaped characters. This helps to avoid confusing similarities with a literal backslash followed by a character versus the interpreted escaped character. --- src/comments.md | 6 +++--- src/notation.md | 6 ++++++ src/tokens.md | 6 +++--- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/src/comments.md b/src/comments.md index 189343a73..92a82190a 100644 --- a/src/comments.md +++ b/src/comments.md @@ -4,7 +4,7 @@ r[comments] r[comments.syntax] > **Lexer**\ > LINE_COMMENT :\ ->       `//` (~\[`/` `!` `\n`] | `//`) ~`\n`\*\ +>       `//` (~\[`/` `!` LF] | `//`) ~LF\*\ >    | `//` > > BLOCK_COMMENT :\ @@ -14,13 +14,13 @@ r[comments.syntax] >    | `/***/` > > INNER_LINE_DOC :\ ->    `//!` ~\[`\n` CR]\* +>    `//!` ~\[LF CR]\* > > INNER_BLOCK_DOC :\ >    `/*!` ( _BlockCommentOrDoc_ | ~\[`*/` CR] )\* `*/` > > OUTER_LINE_DOC :\ ->    `///` (~`/` ~\[`\n` CR]\*)? +>    `///` (~`/` ~\[LF CR]\*)? > > OUTER_BLOCK_DOC :\ >    `/**` (~`*` | _BlockCommentOrDoc_ ) diff --git a/src/notation.md b/src/notation.md index f3554602a..f15cefa0b 100644 --- a/src/notation.md +++ b/src/notation.md @@ -40,6 +40,12 @@ production. See [tokens] for more information. The following are common definitions used in the grammar. > **Lexer**\ +> NUL : U+0000 +> +> TAB : U+0009 +> +> LF : U+000A +> > CR : U+000D [binary operators]: expressions/operator-expr.md#arithmetic-and-logical-binary-operators diff --git a/src/tokens.md b/src/tokens.md index 7f4227375..0b870f0c3 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -133,7 +133,7 @@ r[lex.token.literal.char] r[lex.token.literal.char.syntax] > **Lexer**\ > CHAR_LITERAL :\ ->    `'` ( ~\[`'` `\` \\n \\r \\t] | QUOTE_ESCAPE | ASCII_ESCAPE | UNICODE_ESCAPE ) `'` SUFFIX? +>    `'` ( ~\[`'` `\` LF CR TAB] | QUOTE_ESCAPE | ASCII_ESCAPE | UNICODE_ESCAPE ) `'` SUFFIX? > > QUOTE_ESCAPE :\ >    `\'` | `\"` @@ -165,7 +165,7 @@ r[lex.token.literal.str.syntax] >    )\* `"` SUFFIX? > > STRING_CONTINUE :\ ->    `\` _followed by_ \\n +>    `\` _followed by_ LF r[lex.token.literal.str.intro] A _string literal_ is a sequence of any Unicode characters enclosed within two @@ -262,7 +262,7 @@ r[lex.token.byte.syntax] >    `b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX? > > ASCII_FOR_CHAR :\ ->    _any ASCII (i.e. 0x00 to 0x7F), except_ `'`, `\`, \\n, \\r or \\t +>    _any ASCII (i.e. 0x00 to 0x7F) except_ `'`, `\`, LF, CR, or TAB > > BYTE_ESCAPE :\ >       `\x` HEX_DIGIT HEX_DIGIT\ From 94acc5e925281fe3e5270fc73399ef9569387230 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:38:57 -0700 Subject: [PATCH 07/38] Remove "followed by" in STRING_CONTINUE I don't exactly know why this was placed there, but we operate under the assumption that all lexical characters immediately follow one another. --- src/tokens.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/tokens.md b/src/tokens.md index 0b870f0c3..b1fa23091 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -165,7 +165,7 @@ r[lex.token.literal.str.syntax] >    )\* `"` SUFFIX? > > STRING_CONTINUE :\ ->    `\` _followed by_ LF +>    `\` LF r[lex.token.literal.str.intro] A _string literal_ is a sequence of any Unicode characters enclosed within two From 86a49fc0ab8ea0df3357f4f3f4ea78366de9bc86 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:46:58 -0700 Subject: [PATCH 08/38] Introduce a new "prose" terminal This introduces a new terminal kind that I'm calling a "prose" which describes what the terminal is. This is inspired by the IETF format which uses angle brackets to describe terminals in English. --- src/tokens.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/tokens.md b/src/tokens.md index b1fa23091..7ea5748ac 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -262,7 +262,7 @@ r[lex.token.byte.syntax] >    `b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX? > > ASCII_FOR_CHAR :\ ->    _any ASCII (i.e. 0x00 to 0x7F) except_ `'`, `\`, LF, CR, or TAB +>    \ > > BYTE_ESCAPE :\ >       `\x` HEX_DIGIT HEX_DIGIT\ @@ -285,7 +285,7 @@ r[lex.token.str-byte.syntax] >    `b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )\* `"` SUFFIX? > > ASCII_FOR_STRING :\ ->    _any ASCII (i.e 0x00 to 0x7F) except_ `"`, `\`, _or CR_ +>    \ r[lex.token.str-byte.intro] A non-raw _byte string literal_ is a sequence of ASCII characters and _escapes_, @@ -337,7 +337,7 @@ r[lex.token.str-byte-raw.syntax] >    | `#` RAW_BYTE_STRING_CONTENT `#` > > ASCII_FOR_RAW :\ ->    _any ASCII (i.e. 0x00 to 0x7F) except CR_ +>    \ r[lex.token.str-byte-raw.intro] Raw byte string literals do not process any escapes. They start with the @@ -693,10 +693,10 @@ r[lex.token.literal.reserved.syntax] >    | ( BIN_LITERAL | OCT_LITERAL | HEX_LITERAL ) `.` \ >          _(not immediately followed by `.`, `_` or an XID_Start character)_\ >    | ( BIN_LITERAL | OCT_LITERAL ) (`e`|`E`)\ ->    | `0b` `_`\* _end of input or not BIN_DIGIT_\ ->    | `0o` `_`\* _end of input or not OCT_DIGIT_\ ->    | `0x` `_`\* _end of input or not HEX_DIGIT_\ ->    | DEC_LITERAL ( . DEC_LITERAL)? (`e`|`E`) (`+`|`-`)? _end of input or not DEC_DIGIT_ +>    | `0b` `_`\* \\ +>    | `0o` `_`\* \\ +>    | `0x` `_`\* \\ +>    | DEC_LITERAL ( . DEC_LITERAL)? (`e`|`E`) (`+`|`-`)? \ r[lex.token.literal.reserved.intro] The following lexical forms similar to number literals are _reserved forms_. From a547e37e393f3440030657ed82feeb9714f1ea8e Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:50:57 -0700 Subject: [PATCH 09/38] Normalize suffix capitalization The grammar almost always uses lowercase, so let's standardize on that. --- src/identifiers.md | 4 ++-- src/tokens.md | 10 +++++----- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/identifiers.md b/src/identifiers.md index 864ab6499..fecb8b96a 100644 --- a/src/identifiers.md +++ b/src/identifiers.md @@ -7,9 +7,9 @@ r[ident.syntax] >       XID_Start XID_Continue\*\ >    | `_` XID_Continue+ > -> RAW_IDENTIFIER : `r#` IDENTIFIER_OR_KEYWORD *Except `crate`, `self`, `super`, `Self`* +> RAW_IDENTIFIER : `r#` IDENTIFIER_OR_KEYWORD *except `crate`, `self`, `super`, `Self`* > -> NON_KEYWORD_IDENTIFIER : IDENTIFIER_OR_KEYWORD *Except a [strict] or [reserved] keyword* +> NON_KEYWORD_IDENTIFIER : IDENTIFIER_OR_KEYWORD *except a [strict] or [reserved] keyword* > > IDENTIFIER :\ > NON_KEYWORD_IDENTIFIER | RAW_IDENTIFIER diff --git a/src/tokens.md b/src/tokens.md index 7ea5748ac..dd8ba6c1a 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -750,7 +750,7 @@ r[lex.token.life.syntax] >    | RAW_LIFETIME > > RAW_LIFETIME :\ ->    `'r#` [IDENTIFIER_OR_KEYWORD][identifier] *Except `crate`, `self`, `super`, `Self`* +>    `'r#` [IDENTIFIER_OR_KEYWORD][identifier] *except `crate`, `self`, `super`, `Self`* > _(not immediately followed by `'`)_ > > RESERVED_RAW_LIFETIME : `'r#_` @@ -849,10 +849,10 @@ r[lex.token.reserved-prefix] r[lex.token.reserved-prefix.syntax] > **Lexer 2021+**\ -> RESERVED_TOKEN_DOUBLE_QUOTE : ( IDENTIFIER_OR_KEYWORD _Except `b` or `c` or `r` or `br` or `cr`_ | `_` ) `"`\ -> RESERVED_TOKEN_SINGLE_QUOTE : ( IDENTIFIER_OR_KEYWORD _Except `b`_ | `_` ) `'`\ -> RESERVED_TOKEN_POUND : ( IDENTIFIER_OR_KEYWORD _Except `r` or `br` or `cr`_ | `_` ) `#`\ -> RESERVED_TOKEN_LIFETIME : `'` (IDENTIFIER_OR_KEYWORD _Except `r`_ | _) `#` +> RESERVED_TOKEN_DOUBLE_QUOTE : ( IDENTIFIER_OR_KEYWORD _except `b` or `c` or `r` or `br` or `cr`_ | `_` ) `"`\ +> RESERVED_TOKEN_SINGLE_QUOTE : ( IDENTIFIER_OR_KEYWORD _except `b`_ | `_` ) `'`\ +> RESERVED_TOKEN_POUND : ( IDENTIFIER_OR_KEYWORD _except `r` or `br` or `cr`_ | `_` ) `#`\ +> RESERVED_TOKEN_LIFETIME : `'` ( IDENTIFIER_OR_KEYWORD _except `r`_ | `_` ) `#` r[lex.token.reserved-prefix.intro] Some lexical forms known as _reserved prefixes_ are reserved for future use. From 36c8d997033aa47fd139998398abceae5e8f0440 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:53:00 -0700 Subject: [PATCH 10/38] Remove parentheses around suffixes This helps to standardize how suffixes are written. Normally they do not use parentheses, and visually I don't think they entirely necessary. --- src/tokens.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/tokens.md b/src/tokens.md index dd8ba6c1a..bf1e060e6 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -632,7 +632,7 @@ r[lex.token.literal.float.syntax] > **Lexer**\ > FLOAT_LITERAL :\ >       DEC_LITERAL `.` -> _(not immediately followed by `.`, `_` or an XID_Start character)_\ +> _not immediately followed by `.`, `_` or an XID_Start character_\ >    | DEC_LITERAL `.` DEC_LITERAL SUFFIX_NO_E?\ >    | DEC_LITERAL (`.` DEC_LITERAL)? FLOAT_EXPONENT SUFFIX? > @@ -691,7 +691,7 @@ r[lex.token.literal.reserved.syntax] >       BIN_LITERAL \[`2`-`9`​]\ >    | OCT_LITERAL \[`8`-`9`​]\ >    | ( BIN_LITERAL | OCT_LITERAL | HEX_LITERAL ) `.` \ ->          _(not immediately followed by `.`, `_` or an XID_Start character)_\ +>          _not immediately followed by `.`, `_` or an XID_Start character_\ >    | ( BIN_LITERAL | OCT_LITERAL ) (`e`|`E`)\ >    | `0b` `_`\* \\ >    | `0o` `_`\* \\ @@ -739,22 +739,22 @@ r[lex.token.life.syntax] > **Lexer**\ > LIFETIME_TOKEN :\ >       `'` [IDENTIFIER_OR_KEYWORD][identifier] -> _(not immediately followed by `'`)_\ +> _not immediately followed by `'`_\ >    | `'_` -> _(not immediately followed by `'`)_\ +> _not immediately followed by `'`_\ >    | RAW_LIFETIME > > LIFETIME_OR_LABEL :\ >       `'` [NON_KEYWORD_IDENTIFIER][identifier] -> _(not immediately followed by `'`)_\ +> _not immediately followed by `'`_\ >    | RAW_LIFETIME > > RAW_LIFETIME :\ >    `'r#` [IDENTIFIER_OR_KEYWORD][identifier] *except `crate`, `self`, `super`, `Self`* -> _(not immediately followed by `'`)_ +> _not immediately followed by `'`_ > > RESERVED_RAW_LIFETIME : `'r#_` -> _(not immediately followed by `'`)_ +> _not immediately followed by `'`_ r[lex.token.life.intro] Lifetime parameters and [loop labels] use LIFETIME_OR_LABEL tokens. Any From 35c098a78ed9f3bf92465bb3594b0d9a897244ab Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:54:07 -0700 Subject: [PATCH 11/38] Fix nonterminals of ConstParam These two nonterminals were using the wrong name for the productions for BlockExpression and LiteralExpression. --- src/items/generics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/items/generics.md b/src/items/generics.md index 2819980c6..53de44b7e 100644 --- a/src/items/generics.md +++ b/src/items/generics.md @@ -17,7 +17,7 @@ r[items.generics.syntax] >    [IDENTIFIER] ( `:` [_TypeParamBounds_]? )? ( `=` [_Type_] )? > > _ConstParam_:\ ->    `const` [IDENTIFIER] `:` [_Type_] ( `=` _[Block][block]_ | [IDENTIFIER] | -?[LITERAL] )? +>    `const` [IDENTIFIER] `:` [_Type_] ( `=` _[BlockExpression][block]_ | [IDENTIFIER] | `-`?[_LiteralExpression_][literal] )? r[items.generics.syntax.intro] [Functions], [type aliases], [structs], [enumerations], [unions], [traits], and From 556df6babb366e873e11ce8b0b5cc27d98aa9d03 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 20:58:56 -0700 Subject: [PATCH 12/38] Rewrite how keywords are listed This changes the keyword listings so that they are just lists instead of lexer rules. We never used the named rules, and I don't foresee us ever doing that. Also, the `IDENTIFIER_OR_KEYWORD` rule meant that we never needed to explicitly identify these keywords as lexer tokens. This helps avoid problems when building the grammar graph for missing connections. --- src/keywords.md | 125 +++++++++++++++++++++++------------------------- 1 file changed, 59 insertions(+), 66 deletions(-) diff --git a/src/keywords.md b/src/keywords.md index 525a49cd2..1413fd620 100644 --- a/src/keywords.md +++ b/src/keywords.md @@ -24,50 +24,50 @@ be used as the names of: * [Crates] r[lex.keywords.strict.list] -> **Lexer:**\ -> KW_AS : `as`\ -> KW_BREAK : `break`\ -> KW_CONST : `const`\ -> KW_CONTINUE : `continue`\ -> KW_CRATE : `crate`\ -> KW_ELSE : `else`\ -> KW_ENUM : `enum`\ -> KW_EXTERN : `extern`\ -> KW_FALSE : `false`\ -> KW_FN : `fn`\ -> KW_FOR : `for`\ -> KW_IF : `if`\ -> KW_IMPL : `impl`\ -> KW_IN : `in`\ -> KW_LET : `let`\ -> KW_LOOP : `loop`\ -> KW_MATCH : `match`\ -> KW_MOD : `mod`\ -> KW_MOVE : `move`\ -> KW_MUT : `mut`\ -> KW_PUB : `pub`\ -> KW_REF : `ref`\ -> KW_RETURN : `return`\ -> KW_SELFVALUE : `self`\ -> KW_SELFTYPE : `Self`\ -> KW_STATIC : `static`\ -> KW_STRUCT : `struct`\ -> KW_SUPER : `super`\ -> KW_TRAIT : `trait`\ -> KW_TRUE : `true`\ -> KW_TYPE : `type`\ -> KW_UNSAFE : `unsafe`\ -> KW_USE : `use`\ -> KW_WHERE : `where`\ -> KW_WHILE : `while` +The following keywords are in all editions: + +- `as` +- `break` +- `const` +- `continue` +- `crate` +- `else` +- `enum` +- `extern` +- `false` +- `fn` +- `for` +- `if` +- `impl` +- `in` +- `let` +- `loop` +- `match` +- `mod` +- `move` +- `mut` +- `pub` +- `ref` +- `return` +- `self` +- `Self` +- `static` +- `struct` +- `super` +- `trait` +- `true` +- `type` +- `unsafe` +- `use` +- `where` +- `while` r[lex.keywords.strict.edition2018] The following keywords were added beginning in the 2018 edition. -> **Lexer 2018+**\ -> KW_ASYNC : `async`\ -> KW_AWAIT : `await`\ -> KW_DYN : `dyn` +- `async` +- `await` +- `dyn` r[lex.keywords.reserved] ## Reserved keywords @@ -79,31 +79,28 @@ current programs forward compatible with future versions of Rust by forbidding them to use these keywords. r[lex.keywords.reserved.list] -> **Lexer**\ -> KW_ABSTRACT : `abstract`\ -> KW_BECOME : `become`\ -> KW_BOX : `box`\ -> KW_DO : `do`\ -> KW_FINAL : `final`\ -> KW_MACRO : `macro`\ -> KW_OVERRIDE : `override`\ -> KW_PRIV : `priv`\ -> KW_TYPEOF : `typeof`\ -> KW_UNSIZED : `unsized`\ -> KW_VIRTUAL : `virtual`\ -> KW_YIELD : `yield` +- `abstract` +- `become` +- `box` +- `do` +- `final` +- `macro` +- `override` +- `priv` +- `typeof` +- `unsized` +- `virtual` +- `yield` r[lex.keywords.reserved.edition2018] The following keywords are reserved beginning in the 2018 edition. -> **Lexer 2018+**\ -> KW_TRY : `try` +- `try` r[lex.keywords.reserved.edition2024] The following keywords are reserved beginning in the 2024 edition. -> **Lexer 2024+**\ -> KW_GEN : `gen` +- `gen` r[lex.keywords.weak] ## Weak keywords @@ -112,15 +109,11 @@ r[lex.keywords.weak.intro] These keywords have special meaning only in certain contexts. For example, it is possible to declare a variable or method with the name `union`. -> **Lexer**\ -> KW_MACRO_RULES : `macro_rules`\ -> KW_UNION : `union`\ -> KW_STATICLIFETIME : `'static`\ -> KW_SAFE : `safe`\ -> KW_RAW : `raw` -> -> **Lexer 2015**\ -> KW_DYN : `dyn` +- `'static` +- `macro_rules` +- `raw` +- `safe` +- `union` r[lex.keywords.weak.macro_rules] * `macro_rules` is used to create custom [macros]. From 963339e9d018bb64f2072aee17babdce9b041d62 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 21:00:22 -0700 Subject: [PATCH 13/38] Fix `dyn` edition presentation Per our style, edition differences are supposed to be separated out into an edition block. --- src/keywords.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/src/keywords.md b/src/keywords.md index 1413fd620..2fd8046f1 100644 --- a/src/keywords.md +++ b/src/keywords.md @@ -131,19 +131,18 @@ r[lex.keywords.weak.lifetime-static] fn invalid_lifetime_parameter<'static>(s: &'static str) -> &'static str { s } ``` -r[lex.keywords.weak.dyn] -* In the 2015 edition, [`dyn`] is a keyword when used in a type position - followed by a path that does not start with `::` or `<`, a lifetime, a question mark, a `for` - keyword or an opening parenthesis. - - Beginning in the 2018 edition, `dyn` has been promoted to a strict keyword. - r[lex.keywords.weak.safe] * `safe` is used for functions and statics, which has meaning in [external blocks]. r[lex.keywords.weak.raw] * `raw` is used for [raw borrow operators], and is only a keyword when matching a raw borrow operator form (such as `&raw const expr` or `&raw mut expr`). +r[lex.keywords.weak.dyn.edition2018] +> [!EDITION-2018] +> In the 2015 edition, [`dyn`] is a keyword when used in a type position followed by a path that does not start with `::` or `<`, a lifetime, a question mark, a `for` keyword or an opening parenthesis. +> +> Beginning in the 2018 edition, `dyn` has been promoted to a strict keyword. + [items]: items.md [Variables]: variables.md [Type parameters]: types/parameters.md From 1c6587084c2db0cc12bf02af27fec5b59ef67fe0 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 21:07:19 -0700 Subject: [PATCH 14/38] Add grammar rule for XID_Start and XID_Continue These were defined in prose below, but defining them here allows us to easily refer and link to them. --- src/identifiers.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/identifiers.md b/src/identifiers.md index fecb8b96a..f5845240b 100644 --- a/src/identifiers.md +++ b/src/identifiers.md @@ -7,6 +7,10 @@ r[ident.syntax] >       XID_Start XID_Continue\*\ >    | `_` XID_Continue+ > +> XID_Start : \<`XID_Start` defined by Unicode\> +> +> XID_Continue : \<`XID_Continue` defined by Unicode\> +> > RAW_IDENTIFIER : `r#` IDENTIFIER_OR_KEYWORD *except `crate`, `self`, `super`, `Self`* > > NON_KEYWORD_IDENTIFIER : IDENTIFIER_OR_KEYWORD *except a [strict] or [reserved] keyword* From 420f4d322ea22ce6ce45afca1e463cc4b7ff6828 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 21:11:59 -0700 Subject: [PATCH 15/38] Add a lexer rule for punctuation This is intended to help define what a "token" is via the grammar (and to fill a missing hole in our token definition). I waffled on how to define delimiters, whether they should be separate somehow. In practice I think it should be fine to clump them all together. This mainly only matters for TokenTree which already excludes the delimiters. --- src/tokens.md | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/src/tokens.md b/src/tokens.md index bf1e060e6..8d1612805 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -777,6 +777,64 @@ r[lex.token.life.raw.edition2021] r[lex.token.punct] ## Punctuation +r[lex.token.punct.syntax] +> **Lexer**\ +> Token :\ +> PUNCTUATION :\ +>       `=`\ +>    | `<`\ +>    | `<=`\ +>    | `==`\ +>    | `!=`\ +>    | `>=`\ +>    | `>`\ +>    | `&&`\ +>    | `||`\ +>    | `!`\ +>    | `~`\ +>    | `+`\ +>    | `-`\ +>    | `*`\ +>    | `/`\ +>    | `%`\ +>    | `^`\ +>    | `&`\ +>    | `|`\ +>    | `<<`\ +>    | `>>`\ +>    | `+=`\ +>    | `-=`\ +>    | `*=`\ +>    | `/=`\ +>    | `%=`\ +>    | `^=`\ +>    | `&=`\ +>    | `|=`\ +>    | `<<=`\ +>    | `>>=`\ +>    | `@`\ +>    | `.`\ +>    | `..`\ +>    | `...`\ +>    | `..=`\ +>    | `,`\ +>    | `;`\ +>    | `:`\ +>    | `::`\ +>    | `->`\ +>    | `<-`\ +>    | `=>`\ +>    | `#`\ +>    | `$`\ +>    | `?`\ +>    | `_`\ +>    | `{`\ +>    | `}`\ +>    | `[`\ +>    | `]`\ +>    | `(`\ +>    | `)` + r[lex.token.punct.intro] Punctuation symbol tokens are listed here for completeness. Their individual usages and meanings are defined in the linked pages. From a8e1afb492ef832d6df78afb7a02026d23ef6f5e Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 21:16:08 -0700 Subject: [PATCH 16/38] Add a grammar rule for reserved tokens This adds a grammar rule that collects all the reserved token forms into a single production rule so that we can define what a "token" is by referring to this. --- src/tokens.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/src/tokens.md b/src/tokens.md index 8d1612805..76cfa53a1 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -902,6 +902,26 @@ them are referred to as "token trees" in [macros]. The three types of brackets | `[` `]` | Square brackets | | `(` `)` | Parentheses | +r[lex.token.reserved] +## Reserved tokens + +r[lex.token.reserved.intro] +Several token forms are reserved for future use. It is an error for the source input to match one of these forms. + +r[lex.token.reserved.syntax] + +> **Lexer**\ +> RESERVED_TOKEN :\ +>       RESERVED_GUARDED_STRING_LITERAL\ +>    | RESERVED_NUMBER\ +>    | RESERVED_POUNDS\ +>    | RESERVED_RAW_IDENTIFIER\ +>    | RESERVED_RAW_LIFETIME\ +>    | RESERVED_TOKEN_DOUBLE_QUOTE\ +>    | RESERVED_TOKEN_LIFETIME\ +>    | RESERVED_TOKEN_POUND\ +>    | RESERVED_TOKEN_SINGLE_QUOTE + r[lex.token.reserved-prefix] ## Reserved prefixes From 13996e634c29ef46b7c55ce93ed3bbaf8b40ebd3 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 21:18:03 -0700 Subject: [PATCH 17/38] Define the Token rule This defines a Token in the grammar so that we can easily refer to it (and to make it easier to see what all the tokens are). --- src/tokens.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/src/tokens.md b/src/tokens.md index 76cfa53a1..d7660c0ec 100644 --- a/src/tokens.md +++ b/src/tokens.md @@ -1,6 +1,25 @@ r[lex.token] # Tokens +r[lex.token.syntax] +> **Lexer**\ +> Token :\ +>       IDENTIFIER_OR_KEYWORD\ +>    | RAW_IDENTIFIER\ +>    | CHAR_LITERAL\ +>    | STRING_LITERAL\ +>    | RAW_STRING_LITERAL\ +>    | BYTE_LITERAL\ +>    | BYTE_STRING_LITERAL\ +>    | RAW_BYTE_STRING_LITERAL\ +>    | C_STRING_LITERAL\ +>    | RAW_C_STRING_LITERAL\ +>    | INTEGER_LITERAL\ +>    | FLOAT_LITERAL\ +>    | LIFETIME_TOKEN\ +>    | PUNCTUATION\ +>    | RESERVED_TOKEN + r[lex.token.intro] Tokens are primitive productions in the grammar defined by regular (non-recursive) languages. Rust source input can be broken down From 65febd6b0917d46347a9873d05dd84555441864d Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Wed, 9 Apr 2025 21:22:39 -0700 Subject: [PATCH 18/38] Remove escape rule We no longer represent characters via escape sequences. These can be confused with the literal two bytes of backslash followed by a character. See the "common productions" list for how these are now referred to. --- src/notation.md | 1 - 1 file changed, 1 deletion(-) diff --git a/src/notation.md b/src/notation.md index f15cefa0b..7e915f5e6 100644 --- a/src/notation.md +++ b/src/notation.md @@ -9,7 +9,6 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets: | CAPITAL | KW_IF, INTEGER_LITERAL | A token produced by the lexer | | _ItalicCamelCase_ | _LetStatement_, _Item_ | A syntactical production | | `string` | `x`, `while`, `*` | The exact character(s) | -| \\x | \\n, \\r, \\t, \\0 | The character represented by this escape | | x? | `pub`? | An optional item | | x\* | _OuterAttribute_\* | 0 or more of x | | x+ | _MacroMatch_+ | 1 or more of x | From 2baaa05c58d664a854b57cfabe787c2c6b1f784b Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Thu, 10 Apr 2025 10:55:13 -0700 Subject: [PATCH 19/38] Introduce a new grammar renderer This adds an extension to mdbook-spec that will parse code-blocks in a BNF-style grammar into a rendered format, in both markdown or as railroad diagrams. --- mdbook-spec/Cargo.lock | 16 + mdbook-spec/Cargo.toml | 2 + mdbook-spec/src/grammar.rs | 388 ++++++++++++++++++ mdbook-spec/src/grammar/parser.rs | 442 +++++++++++++++++++++ mdbook-spec/src/grammar/render_markdown.rs | 228 +++++++++++ mdbook-spec/src/grammar/render_railroad.rs | 235 +++++++++++ mdbook-spec/src/lib.rs | 6 + 7 files changed, 1317 insertions(+) create mode 100644 mdbook-spec/src/grammar.rs create mode 100644 mdbook-spec/src/grammar/parser.rs create mode 100644 mdbook-spec/src/grammar/render_markdown.rs create mode 100644 mdbook-spec/src/grammar/render_railroad.rs diff --git a/mdbook-spec/Cargo.lock b/mdbook-spec/Cargo.lock index c983d9842..1dea1df7b 100644 --- a/mdbook-spec/Cargo.lock +++ b/mdbook-spec/Cargo.lock @@ -412,6 +412,7 @@ dependencies = [ "once_cell", "pathdiff", "pulldown-cmark", + "railroad", "regex", "semver", "serde_json", @@ -569,6 +570,15 @@ dependencies = [ "proc-macro2", ] +[[package]] +name = "railroad" +version = "0.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0ecedffc46c1b2cb04f4b80e094eae6b3f3f470a9635f1f396dd5206428f6b58" +dependencies = [ + "unicode-width", +] + [[package]] name = "regex" version = "1.11.1" @@ -780,6 +790,12 @@ version = "1.0.14" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "adb9e6ca4f869e1180728b7950e35922a7fc6397f7b641499e8f3ef06e50dc83" +[[package]] +name = "unicode-width" +version = "0.1.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7dd6e30e90baa6f72411720665d41d89b9a3d039dc45b8faea1ddd07f617f6af" + [[package]] name = "utf8parse" version = "0.2.2" diff --git a/mdbook-spec/Cargo.toml b/mdbook-spec/Cargo.toml index 4422573a8..8bd02e444 100644 --- a/mdbook-spec/Cargo.toml +++ b/mdbook-spec/Cargo.toml @@ -5,6 +5,7 @@ edition = "2024" license = "MIT OR Apache-2.0" description = "An mdBook preprocessor to help with the Rust specification." repository = "https://github.com/rust-lang/spec/" +default-run = "mdbook-spec" # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html @@ -15,6 +16,7 @@ once_cell = "1.19.0" pathdiff = "0.2.1" # Try to keep in sync with mdbook. pulldown-cmark = { version = "0.10.3", default-features = false } +railroad = { version = "0.3.2", default-features = false } regex = "1.9.4" semver = "1.0.21" serde_json = "1.0.113" diff --git a/mdbook-spec/src/grammar.rs b/mdbook-spec/src/grammar.rs new file mode 100644 index 000000000..6901ef461 --- /dev/null +++ b/mdbook-spec/src/grammar.rs @@ -0,0 +1,388 @@ +//! Support for rendering the grammar. + +use crate::{Diagnostics, warn_or_err}; +use mdbook::book::{Book, BookItem, Chapter}; +use regex::{Captures, Regex}; +use std::collections::{HashMap, HashSet}; +use std::fmt::Write; +use std::path::PathBuf; +use std::sync::LazyLock; + +mod parser; +mod render_markdown; +mod render_railroad; + +#[derive(Debug, Default)] +pub struct Grammar { + pub productions: HashMap, + /// The order that the production names were discovered. + pub name_order: Vec, +} + +#[derive(Debug)] +pub struct Production { + name: String, + /// Category is from the markdown lang string, and defines how it is + /// grouped and organized on the summary page. + category: String, + expression: Expression, + /// The path to the chapter where this is defined. + path: PathBuf, +} + +#[derive(Debug)] +struct Expression { + kind: ExpressionKind, + /// Suffix is the `_foo_` part that is shown as a subscript. + suffix: Option, + /// A footnote is a markdown footnote link. + footnote: Option, +} + +#[derive(Debug)] +enum ExpressionKind { + /// `( A B C )` + Grouped(Box), + /// `A | B | C` + Alt(Vec), + /// `A B C` + Sequence(Vec), + /// `A?` + Optional(Box), + /// `A*` + Repeat(Box), + /// `A*?` + RepeatNonGreedy(Box), + /// `A+` + RepeatPlus(Box), + /// `A+?` + RepeatPlusNonGreedy(Box), + /// `A{2..4}` + RepeatRange(Box, Option, Option), + /// `NonTerminal` + Nt(String), + /// `` `string` `` + Terminal(String), + /// `` + Prose(String), + /// An LF followed by the given number of spaces. + /// + /// Used by the renderer to help format and structure the grammar. + Break(usize), + /// ``[`A`-`Z` `_` LF]`` + Charset(Vec), + /// ``~[` ` LF]`` + NegExpression(Box), + /// `U+0060` + Unicode(String), +} + +#[derive(Debug)] +enum Characters { + /// `LF` + Named(String), + /// `` `_` `` + Terminal(String), + /// `` `A`-`Z` `` + Range(char, char), +} + +impl Grammar { + fn visit_nt(&self, callback: &mut dyn FnMut(&str)) { + for p in self.productions.values() { + p.expression.visit_nt(callback); + } + } +} + +impl Expression { + fn visit_nt(&self, callback: &mut dyn FnMut(&str)) { + match &self.kind { + ExpressionKind::Grouped(e) + | ExpressionKind::Optional(e) + | ExpressionKind::Repeat(e) + | ExpressionKind::RepeatNonGreedy(e) + | ExpressionKind::RepeatPlus(e) + | ExpressionKind::RepeatPlusNonGreedy(e) + | ExpressionKind::RepeatRange(e, _, _) + | ExpressionKind::NegExpression(e) => { + e.visit_nt(callback); + } + ExpressionKind::Alt(es) | ExpressionKind::Sequence(es) => { + for e in es { + e.visit_nt(callback); + } + } + ExpressionKind::Nt(nt) => { + callback(&nt); + } + ExpressionKind::Terminal(_) + | ExpressionKind::Prose(_) + | ExpressionKind::Break(_) + | ExpressionKind::Unicode(_) => {} + ExpressionKind::Charset(set) => { + for ch in set { + match ch { + Characters::Named(s) => callback(s), + Characters::Terminal(_) | Characters::Range(_, _) => {} + } + } + } + } + } + + fn is_break(&self) -> bool { + matches!(self.kind, ExpressionKind::Break(_)) + } +} + +static GRAMMAR_RE: LazyLock = + LazyLock::new(|| Regex::new(r"(?ms)^```grammar,([^\n]+)\n(.*?)^```").unwrap()); +static NAMES_RE: LazyLock = + LazyLock::new(|| Regex::new(r"(?m)^([A-Za-z0-9_]+)(?: \([^)]+\))? ->").unwrap()); + +/// Loads the [`Grammar`] from the book. +pub fn load_grammar(book: &Book, diag: &mut Diagnostics) -> Grammar { + let mut grammar = Grammar::default(); + for item in book.iter() { + let BookItem::Chapter(ch) = item else { + continue; + }; + if ch.is_draft_chapter() { + continue; + } + let path = ch.path.as_ref().unwrap().to_owned(); + for cap in GRAMMAR_RE.captures_iter(&ch.content) { + let category = &cap[1]; + let input = &cap[2]; + if let Err(e) = parser::parse_grammar(input, &mut grammar, category, &path) { + warn_or_err!(diag, "failed to parse grammar in {path:?}: {e}"); + } + } + } + check_undefined_nt(&grammar, diag); + check_unexpected_roots(&grammar, diag); + grammar +} + +/// Checks for nonterminals that are used but not defined. +fn check_undefined_nt(grammar: &Grammar, diag: &mut Diagnostics) { + grammar.visit_nt(&mut |nt| { + if !grammar.productions.contains_key(nt) { + warn_or_err!(diag, "non-terminal `{nt}` is used but not defined"); + } + }); +} + +/// This checks that all the grammar roots are what we expect. +/// +/// This is intended to help catch any unexpected misspellings, orphaned +/// productions, or general mistakes. +fn check_unexpected_roots(grammar: &Grammar, diag: &mut Diagnostics) { + let mut set: HashSet<_> = grammar.name_order.iter().map(|s| s.as_str()).collect(); + grammar.visit_nt(&mut |nt| { + set.remove(nt); + }); + // TODO: We may want to rethink how some of these are structured. + let expected: HashSet<_> = [ + "CfgAttrAttribute", + "CfgAttribute", + "Crate", + "INNER_LINE_DOC", + "LINE_COMMENT", + "MetaListIdents", + "MetaListNameValueStr", + "MetaListPaths", + "MetaWord", + "OUTER_LINE_DOC", + ] + .into_iter() + .collect(); + if set != expected { + let new: Vec<_> = set.symmetric_difference(&expected).collect(); + let removed: Vec<_> = expected.symmetric_difference(&set).collect(); + if !new.is_empty() { + warn_or_err!( + diag, + "New grammar production detected that is not used in any other production.\n\ + If this is expected, add it to the `check_unexpected_roots` function.\n\ + If not, make sure it is spelled correctly and used in another production.\n\ + The new names are: {new:?}\n" + ); + } else if !removed.is_empty() { + warn_or_err!( + diag, + "Old grammar production root seems to have been removed.\n\ + If this is expected, remove it from the `check_unexpected_roots` function.\n\ + The removed names are: {removed:?}\n" + ); + } else { + unreachable!("unexpected"); + } + } +} + +/// Replaces the text grammar in the given chapter with the rendered version. +pub fn insert_grammar(grammar: &Grammar, chapter: &Chapter, diag: &mut Diagnostics) -> String { + let link_map = make_relative_link_map(grammar, chapter); + + let mut content = GRAMMAR_RE + .replace_all(&chapter.content, |cap: &Captures<'_>| { + let names: Vec<_> = NAMES_RE + .captures_iter(&cap[2]) + .map(|cap| cap.get(1).unwrap().as_str()) + .collect(); + let for_lexer = &cap[1] == "lexer"; + render_names(grammar, &names, &link_map, for_lexer, chapter, diag) + }) + .to_string(); + + // Make all production names easily linkable. + let is_summary = is_summary(chapter); + for (name, path) in &link_map { + let id = render_markdown::markdown_id(name, is_summary); + if is_summary { + // On the summary page, link to the production on the summary page. + writeln!(content, "[{name}]: #{id}").unwrap(); + } else { + // This includes two variants, one for convenience (like + // `[ArrayExpression]`), and one with the `grammar-` prefix to + // disambiguate links that have the same name as a rule (rules + // take precedence). + writeln!( + content, + "[{name}]: {path}#{id}\n\ + [grammar-{name}]: {path}#{id}" + ) + .unwrap(); + } + } + content +} + +/// Creates a map of production name -> relative link path. +fn make_relative_link_map(grammar: &Grammar, chapter: &Chapter) -> HashMap { + let current_path = chapter.path.as_ref().unwrap().parent().unwrap(); + grammar + .productions + .values() + .map(|p| { + let relative = pathdiff::diff_paths(&p.path, current_path).unwrap(); + // Adjust paths for Windows. + let relative = relative.display().to_string().replace('\\', "/"); + (p.name.clone(), relative) + }) + .collect() +} + +/// Helper to take a list of production names and to render all of those to a +/// mixture of markdown and HTML. +fn render_names( + grammar: &Grammar, + names: &[&str], + link_map: &HashMap, + for_lexer: bool, + chapter: &Chapter, + diag: &mut Diagnostics, +) -> String { + let for_summary = is_summary(chapter); + let mut output = String::new(); + output.push_str( + "
\n\ + \n", + ); + if for_lexer { + output.push_str("**Lexer**\n"); + } else { + output.push_str("**Syntax**\n"); + } + output.push_str("
\n"); + + // Convert the link map to add the id. + let updated_link_map = |get_id: fn(&str, bool) -> String| -> HashMap { + link_map + .iter() + .map(|(name, path)| { + let id = get_id(name, for_summary); + let path = if for_summary { + format!("#{id}") + } else { + format!("{path}#{id}") + }; + (name.clone(), path) + }) + .collect() + }; + + let markdown_link_map = updated_link_map(render_markdown::markdown_id); + if let Err(e) = grammar.render_markdown(&names, &markdown_link_map, &mut output, for_summary) { + warn_or_err!( + diag, + "grammar failed in chapter {:?}: {e}", + chapter.source_path.as_ref().unwrap() + ); + } + + output.push_str( + "\n\ + \n\ +
\n\ +
\n\ + \n", + ); + + // Modify the link map so that it contains the exact destination needed to + // link to the railroad productions, and to accommodate the summary + // chapter. + let railroad_link_map = updated_link_map(render_railroad::railroad_id); + if let Err(e) = grammar.render_railroad(&names, &railroad_link_map, &mut output, for_summary) { + warn_or_err!( + diag, + "grammar failed in chapter {:?}: {e}", + chapter.source_path.as_ref().unwrap() + ); + } + + output.push_str("
\n"); + + output +} + +pub fn is_summary(chapter: &Chapter) -> bool { + chapter.name == "Grammar summary" +} + +/// Inserts the summary of all grammar rules into the grammar summary chapter. +pub fn insert_summary(grammar: &Grammar, chapter: &Chapter, diag: &mut Diagnostics) -> String { + let link_map = make_relative_link_map(grammar, chapter); + let mut seen = HashSet::new(); + let categories: Vec<_> = grammar + .name_order + .iter() + .map(|name| &grammar.productions[name].category) + .filter(|cat| seen.insert(*cat)) + .collect(); + let mut grammar_summary = String::new(); + for category in categories { + let mut chars = category.chars(); + let cap = chars.next().unwrap().to_uppercase().collect::() + chars.as_str(); + write!(grammar_summary, "\n## {cap} summary\n\n").unwrap(); + let names: Vec<_> = grammar + .name_order + .iter() + .filter(|name| grammar.productions[*name].category == *category) + .map(|s| s.as_str()) + .collect(); + let for_lexer = category == "lexer"; + let s = render_names(grammar, &names, &link_map, for_lexer, chapter, diag); + grammar_summary.push_str(&s); + } + + chapter + .content + .replace("{{ grammar-summary }}", &grammar_summary) +} diff --git a/mdbook-spec/src/grammar/parser.rs b/mdbook-spec/src/grammar/parser.rs new file mode 100644 index 000000000..6197b11de --- /dev/null +++ b/mdbook-spec/src/grammar/parser.rs @@ -0,0 +1,442 @@ +//! A parser of the ENBF-like grammar. + +use super::{Characters, Expression, ExpressionKind, Grammar, Production}; +use regex::{Captures, Regex}; +use std::fmt; +use std::fmt::Display; +use std::path::Path; +use std::sync::LazyLock; + +struct Parser<'a> { + input: &'a str, + index: usize, +} + +pub struct Error { + message: String, + line: String, + lineno: usize, + col: usize, +} + +impl Display for Error { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::result::Result<(), std::fmt::Error> { + let lineno = format!("{}", self.lineno); + let space = " ".repeat(lineno.len() + 1); + let col = " ".repeat(self.col); + let line = &self.line; + let message = &self.message; + write!(f, "\n{space}|\n{lineno} | {line}\n{space}|{col}^ {message}") + } +} + +macro_rules! bail { + ($parser:expr, $($arg:tt)*) => {{ + let mut msg = String::new(); + fmt::write(&mut msg, format_args!($($arg)*)).unwrap(); + return Err($parser.error(msg)); + }}; +} + +type Result = std::result::Result; + +pub fn parse_grammar( + input: &str, + grammar: &mut Grammar, + category: &str, + path: &Path, +) -> Result<()> { + let mut parser = Parser { input, index: 0 }; + loop { + let p = parser.parse_production(category, path)?; + grammar.name_order.push(p.name.clone()); + if let Some(dupe) = grammar.productions.insert(p.name.clone(), p) { + bail!(parser, "duplicate production {} in grammar", dupe.name); + } + parser.take_while(&|ch| ch == '\n'); + if parser.eof() { + break; + } + } + Ok(()) +} + +impl Parser<'_> { + fn take_while(&mut self, f: &dyn Fn(char) -> bool) -> &str { + let mut upper = 0; + let i = self.index; + let mut ci = self.input[i..].chars(); + while let Some(ch) = ci.next() { + if !f(ch) { + break; + } + upper += ch.len_utf8(); + } + self.index += upper; + &self.input[i..i + upper] + } + + /// If the input matches the given regex, it is returned and the head is moved forward. + /// + /// Note that regexes must start with `^`. + fn take_re(&mut self, re: &Regex) -> Option> { + if let Some(cap) = re.captures(&self.input[self.index..]) { + self.index += cap[0].len(); + Some(cap) + } else { + None + } + } + + /// Returns whether or not the given string is next, and advances the head if it is. + fn take_str(&mut self, s: &str) -> bool { + if self.input[self.index..].starts_with(s) { + self.index += s.len(); + true + } else { + false + } + } + + /// Returns the next byte, or None if eof. + fn peek(&mut self) -> Option { + if self.index >= self.input.len() { + None + } else { + Some(self.input.as_bytes()[self.index]) + } + } + + fn eof(&mut self) -> bool { + self.index >= self.input.len() + } + + /// Expects the next input to be the given string, and advances the head. + fn expect(&mut self, s: &str, err: &str) -> Result<()> { + if !self.input[self.index..].starts_with(s) { + bail!(self, "{err}"); + }; + self.index += s.len(); + Ok(()) + } + + fn error(&mut self, message: String) -> Error { + let (line, lineno, col) = translate_position(self.input, self.index); + Error { + message, + line: line.to_string(), + lineno, + col, + } + } + + /// Advances zero or more spaces. + fn space0(&mut self) -> &str { + self.take_while(&|ch| ch == ' ') + } + + fn parse_production(&mut self, category: &str, path: &Path) -> Result { + let name = self + .parse_name() + .ok_or_else(|| self.error("expected production name".to_string()))?; + self.expect(" ->", "expected -> arrow")?; + let Some(expression) = self.parse_expression()? else { + bail!(self, "expected an expression"); + }; + Ok(Production { + name, + category: category.to_string(), + expression, + path: path.to_owned(), + }) + } + + fn parse_name(&mut self) -> Option { + let name = self.take_while(&|c: char| c.is_alphanumeric() || c == '_'); + if name.is_empty() { + None + } else { + Some(name.to_string()) + } + } + + fn parse_expression(&mut self) -> Result> { + static ALT_RE: LazyLock = LazyLock::new(|| Regex::new(r"^ *\| *").unwrap()); + + let mut es = Vec::new(); + loop { + let Some(e) = self.parse_seq()? else { break }; + es.push(e); + if self.take_re(&ALT_RE).is_none() { + break; + } + } + match es.len() { + 0 => Ok(None), + 1 => Ok(Some(es.pop().unwrap())), + _ => Ok(Some(Expression { + kind: ExpressionKind::Alt(es), + suffix: None, + footnote: None, + })), + } + } + + fn parse_seq(&mut self) -> Result> { + let mut es = Vec::new(); + loop { + self.space0(); + let Some(e) = self.parse_expr1()? else { + break; + }; + es.push(e); + } + match es.len() { + 0 => Ok(None), + 1 => Ok(Some(es.pop().unwrap())), + _ => Ok(Some(Expression { + kind: ExpressionKind::Sequence(es), + suffix: None, + footnote: None, + })), + } + } + + fn parse_expr1(&mut self) -> Result> { + let Some(next) = self.peek() else { + return Ok(None); + }; + + let mut kind = if self.take_str("U+") { + self.parse_unicode()? + } else if self.input[self.index..] + .chars() + .next() + .map(|ch| ch.is_alphanumeric()) + .unwrap_or(false) + { + self.parse_nonterminal() + .expect("first char already checked") + } else if self.take_str("\n") { + if self.eof() || self.take_str("\n") { + return Ok(None); + } + let space = self.take_while(&|ch| ch == ' '); + if space.len() == 0 { + bail!(self, "expected indentation on next line"); + } + ExpressionKind::Break(space.len()) + } else if next == b'`' { + self.parse_terminal()? + } else if next == b'[' { + self.parse_charset()? + } else if next == b'<' { + self.parse_prose()? + } else if next == b'(' { + self.parse_grouped()? + } else if next == b'~' { + self.parse_neg_expression()? + } else { + return Ok(None); + }; + + static REPEAT_RE: LazyLock = + LazyLock::new(|| Regex::new(r"^ ?(\*\?|\+\?|\?|\*|\+)").unwrap()); + static RANGE_RE: LazyLock = + LazyLock::new(|| Regex::new(r"^\{([0-9]+)?\.\.([0-9]+)?\}").unwrap()); + if let Some(cap) = self.take_re(&REPEAT_RE) { + kind = match &cap[1] { + "?" => ExpressionKind::Optional(box_kind(kind)), + "*" => ExpressionKind::Repeat(box_kind(kind)), + "*?" => ExpressionKind::RepeatNonGreedy(box_kind(kind)), + "+" => ExpressionKind::RepeatPlus(box_kind(kind)), + "+?" => ExpressionKind::RepeatPlusNonGreedy(box_kind(kind)), + s => panic!("unexpected `{s}`"), + }; + } else if let Some(cap) = self.take_re(&RANGE_RE) { + let a = cap.get(1).map(|m| m.as_str().parse::().unwrap()); + let b = cap.get(2).map(|m| m.as_str().parse::().unwrap()); + kind = ExpressionKind::RepeatRange(box_kind(kind), a, b); + } + + let suffix = self.parse_suffix()?; + let footnote = self.parse_footnote()?; + + Ok(Some(Expression { + kind, + suffix, + footnote, + })) + } + + fn parse_nonterminal(&mut self) -> Option { + let nt = self.parse_name()?; + Some(ExpressionKind::Nt(nt)) + } + + fn parse_terminal(&mut self) -> Result { + static TERMINAL_RE: LazyLock = + LazyLock::new(|| Regex::new(r"^`([^`\n]+)`").unwrap()); + match self.take_re(&TERMINAL_RE) { + Some(cap) => Ok(ExpressionKind::Terminal(cap[1].to_string())), + None => bail!(self, "unterminated terminal, expected closing backtick"), + } + } + + fn parse_charset(&mut self) -> Result { + self.expect("[", "expected opening [")?; + let mut characters = Vec::new(); + loop { + self.space0(); + let Some(ch) = self.parse_characters() else { + break; + }; + characters.push(ch); + } + if characters.is_empty() { + bail!(self, "expected at least one character in character group"); + } + self.space0(); + self.expect("]", "expected closing ]")?; + Ok(ExpressionKind::Charset(characters)) + } + + fn parse_characters(&mut self) -> Option { + static RANGE_RE: LazyLock = LazyLock::new(|| Regex::new(r"^`(.)`-`(.)`").unwrap()); + static TERMINAL_RE: LazyLock = LazyLock::new(|| Regex::new("^`([^`\n]+)`").unwrap()); + if let Some(cap) = self.take_re(&RANGE_RE) { + let a = cap[1].chars().next().unwrap(); + let b = cap[2].chars().next().unwrap(); + Some(Characters::Range(a, b)) + } else if let Some(cap) = self.take_re(&TERMINAL_RE) { + Some(Characters::Terminal(cap[1].to_string())) + } else { + let name = self.parse_name()?; + Some(Characters::Named(name)) + } + } + + fn parse_prose(&mut self) -> Result { + static PROSE_RE: LazyLock = LazyLock::new(|| Regex::new(r"^<([^>\n]+)>").unwrap()); + match self.take_re(&PROSE_RE) { + Some(cap) => Ok(ExpressionKind::Prose(cap[1].to_string())), + None => bail!(self, "unterminated prose, expected closing `>`"), + } + } + + fn parse_grouped(&mut self) -> Result { + self.expect("(", "expected opening `(`")?; + self.space0(); + let Some(e) = self.parse_expression()? else { + bail!(self, "expected expression in parenthesized group"); + }; + self.space0(); + self.expect(")", "expected closing `)`")?; + Ok(ExpressionKind::Grouped(Box::new(e))) + } + + fn parse_neg_expression(&mut self) -> Result { + self.expect("~", "expected ~")?; + let Some(next) = self.peek() else { + bail!(self, "expected expression after ~"); + }; + let kind = match next { + b'[' => self.parse_charset()?, + b'`' => self.parse_terminal()?, + _ => self.parse_nonterminal().ok_or_else(|| { + self.error("expected a charset, terminal, or name after ~ negation".to_string()) + })?, + }; + Ok(ExpressionKind::NegExpression(box_kind(kind))) + } + + fn parse_unicode(&mut self) -> Result { + static UNICODE_RE: LazyLock = LazyLock::new(|| Regex::new(r"^[A-Z0-9]{4}").unwrap()); + + match self.take_re(&UNICODE_RE) { + Some(s) => Ok(ExpressionKind::Unicode(s[0].to_string())), + None => bail!(self, "expected 4 hexadecimal uppercase digits after U+"), + } + } + + fn parse_suffix(&mut self) -> Result> { + if !self.take_str(" _") { + return Ok(None); + } + let mut in_backtick = false; + let start = self.index; + loop { + let Some(next) = self.peek() else { + bail!(self, "failed to find end of _ suffixed text"); + }; + self.index += 1; + match next { + b'\n' => bail!(self, "failed to find end of _ suffixed text"), + b'`' => in_backtick = !in_backtick, + b'_' if !in_backtick => { + if self + .peek() + .map(|b| matches!(b, b'\n' | b' ')) + .unwrap_or(true) + { + break; + } + } + _ => {} + } + } + Ok(Some(self.input[start..self.index - 1].to_string())) + } + + fn parse_footnote(&mut self) -> Result> { + static FOOTNOTE_RE: LazyLock = + LazyLock::new(|| Regex::new(r"^([^\]\n]+)]").unwrap()); + if !self.take_str("[^") { + return Ok(None); + } + match self.take_re(&FOOTNOTE_RE) { + Some(cap) => Ok(Some(cap[1].to_string())), + None => bail!(self, "unterminated footnote, expected closing `]`"), + } + } +} + +fn box_kind(kind: ExpressionKind) -> Box { + Box::new(Expression { + kind, + suffix: None, + footnote: None, + }) +} + +/// Helper to translate a byte index to a `(line, line_no, col_no)` (1-based). +fn translate_position(input: &str, index: usize) -> (&str, usize, usize) { + if input.is_empty() { + return ("", 0, 0); + } + let index = index.min(input.len()); + + let mut line_start = 0; + let mut line_number = 0; + for line in input.lines() { + let line_end = line_start + line.len(); + if index >= line_start && index <= line_end { + let column_number = index - line_start + 1; + return (line, line_number + 1, column_number); + } + line_start = line_end + 1; + line_number += 1; + } + ("", line_number + 1, 0) +} + +#[test] +fn translate_tests() { + assert_eq!(translate_position("", 0), ("", 0, 0)); + assert_eq!(translate_position("test", 0), ("test", 1, 1)); + assert_eq!(translate_position("test", 3), ("test", 1, 4)); + assert_eq!(translate_position("test", 4), ("test", 1, 5)); + assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5)); + assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1)); + assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0)); +} diff --git a/mdbook-spec/src/grammar/render_markdown.rs b/mdbook-spec/src/grammar/render_markdown.rs new file mode 100644 index 000000000..f8964c2dd --- /dev/null +++ b/mdbook-spec/src/grammar/render_markdown.rs @@ -0,0 +1,228 @@ +//! Renders the grammar to markdown. + +use super::{Characters, Expression, ExpressionKind, Production}; +use crate::grammar::Grammar; +use anyhow::bail; +use regex::Regex; +use std::borrow::Cow; +use std::collections::HashMap; +use std::fmt::Write; +use std::sync::LazyLock; + +impl Grammar { + pub fn render_markdown( + &self, + names: &[&str], + link_map: &HashMap, + output: &mut String, + for_summary: bool, + ) -> anyhow::Result<()> { + let mut iter = names.into_iter().peekable(); + while let Some(name) = iter.next() { + let prod = match self.productions.get(*name) { + Some(p) => p, + None => bail!("could not find grammar production named `{name}`"), + }; + prod.render_markdown(link_map, output, for_summary); + if iter.peek().is_some() { + output.push_str("\n"); + } + } + Ok(()) + } +} + +/// The HTML id for the production. +pub fn markdown_id(name: &str, for_summary: bool) -> String { + if for_summary { + format!("grammar-summary-{}", name) + } else { + format!("grammar-{}", name) + } +} + +impl Production { + fn render_markdown( + &self, + link_map: &HashMap, + output: &mut String, + for_summary: bool, + ) { + write!( + output, + "{name} → ", + id = markdown_id(&self.name, for_summary), + name = self.name + ) + .unwrap(); + self.expression.render_markdown(link_map, output); + output.push('\n'); + } +} + +impl Expression { + /// Returns the last [`ExpressionKind`] of this expression. + fn last(&self) -> &ExpressionKind { + match &self.kind { + ExpressionKind::Alt(es) | ExpressionKind::Sequence(es) => es.last().unwrap().last(), + ExpressionKind::Grouped(_) + | ExpressionKind::Optional(_) + | ExpressionKind::Repeat(_) + | ExpressionKind::RepeatNonGreedy(_) + | ExpressionKind::RepeatPlus(_) + | ExpressionKind::RepeatPlusNonGreedy(_) + | ExpressionKind::RepeatRange(_, _, _) + | ExpressionKind::Nt(_) + | ExpressionKind::Terminal(_) + | ExpressionKind::Prose(_) + | ExpressionKind::Break(_) + | ExpressionKind::Charset(_) + | ExpressionKind::NegExpression(_) + | ExpressionKind::Unicode(_) => &self.kind, + } + } + + fn render_markdown(&self, link_map: &HashMap, output: &mut String) { + match &self.kind { + ExpressionKind::Grouped(e) => { + output.push_str("( "); + e.render_markdown(link_map, output); + if !matches!(e.last(), ExpressionKind::Break(_)) { + output.push(' '); + } + output.push(')'); + } + ExpressionKind::Alt(es) => { + let mut iter = es.iter().peekable(); + while let Some(e) = iter.next() { + e.render_markdown(link_map, output); + if iter.peek().is_some() { + if !matches!(e.last(), ExpressionKind::Break(_)) { + output.push(' '); + } + output.push_str("| "); + } + } + } + ExpressionKind::Sequence(es) => { + let mut iter = es.iter().peekable(); + while let Some(e) = iter.next() { + e.render_markdown(link_map, output); + if iter.peek().is_some() && !matches!(e.last(), ExpressionKind::Break(_)) { + output.push(' '); + } + } + } + ExpressionKind::Optional(e) => { + e.render_markdown(link_map, output); + output.push_str("?"); + } + ExpressionKind::Repeat(e) => { + e.render_markdown(link_map, output); + output.push_str("\\*"); + } + ExpressionKind::RepeatNonGreedy(e) => { + e.render_markdown(link_map, output); + output.push_str("\\* (non-greedy)"); + } + ExpressionKind::RepeatPlus(e) => { + e.render_markdown(link_map, output); + output.push_str("+"); + } + ExpressionKind::RepeatPlusNonGreedy(e) => { + e.render_markdown(link_map, output); + output.push_str("+ (non-greedy)"); + } + ExpressionKind::RepeatRange(e, a, b) => { + e.render_markdown(link_map, output); + write!( + output, + "{}..{}", + a.map(|v| v.to_string()).unwrap_or_default(), + b.map(|v| v.to_string()).unwrap_or_default(), + ) + .unwrap(); + } + ExpressionKind::Nt(nt) => { + let dest = link_map.get(nt).map_or("missing", |d| d.as_str()); + write!(output, "[{nt}]({dest})").unwrap(); + } + ExpressionKind::Terminal(t) => { + write!( + output, + "{}", + markdown_escape(t) + ) + .unwrap(); + } + ExpressionKind::Prose(s) => { + write!(output, "\\<{s}\\>").unwrap(); + } + ExpressionKind::Break(indent) => { + output.push_str("\\\n"); + output.push_str(&" ".repeat(*indent)); + } + ExpressionKind::Charset(set) => charset_render_markdown(set, link_map, output), + ExpressionKind::NegExpression(e) => { + output.push('~'); + e.render_markdown(link_map, output); + } + ExpressionKind::Unicode(s) => { + output.push_str("U+"); + output.push_str(s); + } + } + if let Some(suffix) = &self.suffix { + write!(output, "{suffix}").unwrap(); + } + if let Some(footnote) = &self.footnote { + // The ZeroWidthSpace is to avoid conflicts with markdown link references. + write!(output, "​[^{footnote}]").unwrap(); + } + } +} + +fn charset_render_markdown( + set: &[Characters], + link_map: &HashMap, + output: &mut String, +) { + output.push_str("\\["); + let mut iter = set.iter().peekable(); + while let Some(chars) = iter.next() { + chars.render_markdown(link_map, output); + if iter.peek().is_some() { + output.push(' '); + } + } + output.push(']'); +} + +impl Characters { + fn render_markdown(&self, link_map: &HashMap, output: &mut String) { + match self { + Characters::Named(s) => { + let dest = link_map.get(s).map_or("missing", |d| d.as_str()); + write!(output, "[{s}]({dest})").unwrap(); + } + Characters::Terminal(s) => write!( + output, + "{}", + markdown_escape(s) + ) + .unwrap(), + Characters::Range(a, b) => write!( + output, + "{a}\ + -{b}" + ) + .unwrap(), + } + } +} + +/// Escapes characters that markdown would otherwise interpret. +fn markdown_escape(s: &str) -> Cow<'_, str> { + static ESC_RE: LazyLock = LazyLock::new(|| Regex::new(r#"[\\`_*\[\](){}'"]"#).unwrap()); + ESC_RE.replace_all(s, r"\$0") +} diff --git a/mdbook-spec/src/grammar/render_railroad.rs b/mdbook-spec/src/grammar/render_railroad.rs new file mode 100644 index 000000000..1255959c6 --- /dev/null +++ b/mdbook-spec/src/grammar/render_railroad.rs @@ -0,0 +1,235 @@ +//! Converts a [`Grammar`] to an SVG railroad diagram. + +use super::{Characters, Expression, ExpressionKind, Production}; +use crate::grammar::Grammar; +use anyhow::bail; +use railroad::*; +use regex::Regex; +use std::collections::HashMap; +use std::fmt::Write; +use std::sync::LazyLock; + +impl Grammar { + pub fn render_railroad( + &self, + names: &[&str], + link_map: &HashMap, + output: &mut String, + for_summary: bool, + ) -> anyhow::Result<()> { + for name in names { + let prod = match self.productions.get(*name) { + Some(p) => p, + None => bail!("could not find grammar production named `{name}`"), + }; + prod.render_railroad(link_map, output, for_summary); + } + Ok(()) + } +} + +/// The HTML id for the production. +pub fn railroad_id(name: &str, for_summary: bool) -> String { + if for_summary { + format!("railroad-summary-{}", name) + } else { + format!("railroad-{}", name) + } +} + +impl Production { + fn render_railroad( + &self, + link_map: &HashMap, + output: &mut String, + for_summary: bool, + ) { + let mut dia = self.make_diagram(false, link_map); + // If the diagram is very wide, try stacking it to reduce the width. + // This 900 is somewhat arbitrary based on looking at productions that + // looked too squished. If your diagram is still too squished, + // consider adding more rules to shorten it. + if dia.width() > 900 { + dia = self.make_diagram(true, link_map); + } + writeln!( + output, + "
{dia}
", + width = dia.width(), + id = railroad_id(&self.name, for_summary), + ) + .unwrap(); + } + + fn make_diagram( + &self, + stack: bool, + link_map: &HashMap, + ) -> Diagram> { + let n = self.expression.render_railroad(stack, link_map); + let seq: Sequence> = + Sequence::new(vec![Box::new(SimpleStart), n.unwrap(), Box::new(SimpleEnd)]); + let vert = VerticalGrid::>::new(vec![ + Box::new(Comment::new(self.name.clone())), + Box::new(seq), + ]); + + Diagram::new(Box::new(vert)) + } +} + +impl Expression { + fn render_railroad( + &self, + stack: bool, + link_map: &HashMap, + ) -> Option> { + let n: Box = match &self.kind { + ExpressionKind::Grouped(e) => { + // I don't think this needs anything special. The grouped + // expression is usually an Alt or Optional or something like + // that which ends up as a distinct railroad node. But I'm not + // sure. + e.render_railroad(stack, link_map)? + } + ExpressionKind::Alt(es) => { + let choices: Vec<_> = es + .iter() + .map(|e| e.render_railroad(stack, link_map)) + .filter_map(|n| n) + .collect(); + Box::new(Choice::>::new(choices)) + } + ExpressionKind::Sequence(es) => { + let make_seq = |es: &[Expression]| { + let seq: Vec<_> = es + .iter() + .map(|e| e.render_railroad(stack, link_map)) + .filter_map(|n| n) + .collect(); + let seq: Sequence> = Sequence::new(seq); + Box::new(seq) + }; + + // If `stack` is true, split the sequence on Breaks and stack them vertically. + if stack { + // First, trim a Break from the front and back. + let es = if matches!( + es.first(), + Some(e) if e.is_break() + ) { + &es[1..] + } else { + &es[..] + }; + let es = if matches!( + es.last(), + Some(e) if e.is_break() + ) { + &es[..es.len() - 1] + } else { + &es[..] + }; + + let mut breaks: Vec<_> = + es.split(|e| e.is_break()).map(|es| make_seq(es)).collect(); + // If there aren't any breaks, don't bother stacking. + if breaks.len() == 1 { + breaks.pop().unwrap() + } else { + Box::new(Stack::new(breaks)) + } + } else { + make_seq(es) + } + } + ExpressionKind::Optional(e) => { + let n = e.render_railroad(stack, link_map)?; + Box::new(Optional::new(n)) + } + ExpressionKind::Repeat(e) => { + let n = e.render_railroad(stack, link_map)?; + Box::new(Repeat::new(railroad::Empty, n)) + } + ExpressionKind::RepeatNonGreedy(e) => { + let n = e.render_railroad(stack, link_map)?; + let r = Box::new(Repeat::new(railroad::Empty, n)); + let lbox = LabeledBox::new(r, Comment::new("non-greedy".to_string())); + Box::new(lbox) + } + ExpressionKind::RepeatPlus(e) => { + let n = e.render_railroad(stack, link_map)?; + Box::new(Repeat::new(n, railroad::Empty)) + } + ExpressionKind::RepeatPlusNonGreedy(e) => { + let n = e.render_railroad(stack, link_map)?; + let r = Repeat::new(n, railroad::Empty); + let lbox = LabeledBox::new(r, Comment::new("non-greedy".to_string())); + Box::new(lbox) + } + ExpressionKind::RepeatRange(e, a, b) => { + let n = e.render_railroad(stack, link_map)?; + let cmt = match (a, b) { + (Some(a), Some(b)) => format!("repeat between {a} and {b} times"), + (None, Some(b)) => format!("repeat at most {b} times"), + (Some(a), None) => format!("repeat at least {a} times"), + (None, None) => panic!("infinite repeat should use *"), + }; + let r = Repeat::new(n, Comment::new(cmt)); + Box::new(r) + } + ExpressionKind::Nt(nt) => node_for_nt(link_map, nt), + ExpressionKind::Terminal(t) => Box::new(Terminal::new(t.clone())), + ExpressionKind::Prose(s) => Box::new(Terminal::new(s.clone())), + ExpressionKind::Break(_) => return None, + ExpressionKind::Charset(set) => { + let ns: Vec<_> = set.iter().map(|c| c.render_railroad(link_map)).collect(); + Box::new(Choice::>::new(ns)) + } + ExpressionKind::NegExpression(e) => { + let n = e.render_railroad(stack, link_map)?; + let lbox = LabeledBox::new(n, Comment::new("any character except".to_string())); + Box::new(lbox) + } + ExpressionKind::Unicode(s) => Box::new(Terminal::new(format!("U+{}", s))), + }; + if let Some(suffix) = &self.suffix { + let suffix = strip_markdown(suffix); + let lbox = LabeledBox::new(n, Comment::new(suffix)); + return Some(Box::new(lbox)); + } + // Note: Footnotes aren't supported. They could be added as a comment + // on a vertical stack or a LabeledBox or something like that, but I + // don't feel like bothering. + Some(n) + } +} + +impl Characters { + fn render_railroad(&self, link_map: &HashMap) -> Box { + match self { + Characters::Named(s) => node_for_nt(link_map, s), + Characters::Terminal(s) => Box::new(Terminal::new(s.clone())), + Characters::Range(a, b) => Box::new(Terminal::new(format!("{a}-{b}"))), + } + } +} + +fn node_for_nt(link_map: &HashMap, name: &str) -> Box { + let dest = link_map + .get(name) + .map(|path| path.to_string()) + .unwrap_or_else(|| format!("missing")); + let n = NonTerminal::new(name.to_string()); + Box::new(Link::new(n, dest)) +} + +/// Removes some markdown so it can be rendered as text. +fn strip_markdown(s: &str) -> String { + // Right now this just removes markdown linkifiers, but more can be added if needed. + static LINK_RE: LazyLock = + LazyLock::new(|| Regex::new(r"(?s)\[([^\]]+)\](?:\[[^\]]*\]|\([^)]*\))?").unwrap()); + LINK_RE.replace_all(s, "$1").to_string() +} diff --git a/mdbook-spec/src/lib.rs b/mdbook-spec/src/lib.rs index a36e441d3..f26c98ccc 100644 --- a/mdbook-spec/src/lib.rs +++ b/mdbook-spec/src/lib.rs @@ -14,6 +14,7 @@ use std::io; use std::ops::Range; use std::path::PathBuf; +pub mod grammar; mod rules; mod std_links; mod test_links; @@ -268,6 +269,7 @@ impl Preprocessor for Spec { if diag.deny_warnings && self.rust_root.is_none() { bail!("error: SPEC_RUST_ROOT environment variable must be set"); } + let grammar = grammar::load_grammar(&book, &mut diag); let rules = self.collect_rules(&book, &mut diag); let tests = self.collect_tests(&rules); let summary_table = test_links::make_summary_table(&book, &tests, &rules); @@ -293,6 +295,10 @@ impl Preprocessor for Spec { if ch.name == "Test summary" { ch.content = ch.content.replace("{{summary-table}}", &summary_table); } + if grammar::is_summary(ch) { + ch.content = grammar::insert_summary(&grammar, &ch, &mut diag); + } + ch.content = grammar::insert_grammar(&grammar, &ch, &mut diag); }); // Final pass will resolve everything as a std link (or error if the From 216bd246a7c589b3bf53191e1783a9d7ad84f22a Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Thu, 10 Apr 2025 10:56:09 -0700 Subject: [PATCH 20/38] Add the javascript hooks for handling the new railroad grammar This adds the hooks to toggle the visibility of the railroad grammar. The status is stored in localstorage to keep it sticky. --- theme/reference.js | 49 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) diff --git a/theme/reference.js b/theme/reference.js index 44a237034..b520de63a 100644 --- a/theme/reference.js +++ b/theme/reference.js @@ -22,3 +22,52 @@ function spec_toggle_tests(rule_id) { el.classList.remove('popup-hidden'); } } + +function toggle_grammar() { + const grammarRailroad = get_railroad(); + set_railroad(!grammarRailroad); + update_railroad(); +} + +function get_railroad() { + let grammarRailroad = null; + try { + grammarRailroad = localStorage.getItem('grammar-railroad'); + } catch (e) { + // Ignore error. + } + grammarRailroad = grammarRailroad === 'true' ? true : false; + return grammarRailroad; +} + +function set_railroad(newValue) { + try { + localStorage.setItem('grammar-railroad', newValue); + } catch (e) { + // Ignore error. + } +} + +function update_railroad() { + const grammarRailroad = get_railroad(); + const railroads = document.querySelectorAll('.grammar-railroad'); + railroads.forEach(element => { + if (grammarRailroad) { + element.classList.remove('grammar-hidden'); + } else { + element.classList.add('grammar-hidden'); + } + }); + const buttons = document.querySelectorAll('.grammar-toggle'); + buttons.forEach(button => { + if (grammarRailroad) { + button.innerText = "Hide Railroad"; + } else { + button.innerText = "Show Railroad"; + } + }); +} + +(function railroad_onload() { + update_railroad(); +})(); From ab8d215fe44d143b09c2d1da745e1a4a0bcc73a1 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Thu, 10 Apr 2025 10:56:28 -0700 Subject: [PATCH 21/38] Add styling for the new grammar and railroad diagrams --- theme/reference.css | 154 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 154 insertions(+) diff --git a/theme/reference.css b/theme/reference.css index 4dbd0a6e3..12f6c557a 100644 --- a/theme/reference.css +++ b/theme/reference.css @@ -551,3 +551,157 @@ main > .rule { z-index: 1000; padding: 1rem; } + +/* The box that contains the grammar. */ +.grammar-container { + font-family: var(--mono-font); + /* Enable absolute positioning for the target chevron. */ + position: relative; + background-color: var(--quote-bg); + border-block-start: .1em solid var(--quote-border); + border-block-end: .1em solid var(--quote-border); + margin-top: 4px; + margin-bottom: 4px; + padding: 0 20px; +} + +/* English words inside the grammar. */ +.grammar-text { + font-family: "Open Sans", sans-serif; +} + +/* Places a box around literals to differentiate from other grammar punctuation like | and ( . */ +.grammar-literal { + font-family: var(--mono-font); + border-radius: 4px; + border: solid 1px var(--theme-popup-border); + font-weight: bold; + font-size: var(--code-font-size); + padding: 1px 4px; + color: var(--inline-code-color); +} + +.light .grammar-literal { + background-color: #fafafa +} +.rust .grammar-literal { + background-color: #dedede +} +.coal .grammar-literal { + background-color: #1d1f21; +} +.navy .grammar-literal { + background-color: #1d1f21; +} +.ayu .grammar-literal { + background-color: #191f26 +} + +.grammar-production:target, .railroad-production:target { + scroll-margin-top: 50vh; +} + +.railroad-production { + /* Enables absolute positioning of the target chevron. */ + position: relative; +} + +/* Adds an indicator to the targeted production name. */ +.grammar-production:target::before, .railroad-production:target::before { + content: "»"; + position: absolute; + left: 3px; + font-size: 2rem; + font-weight: bolder; + /* For some reason, the vertical alignment is slightly off center. This helps + with that alignment. It was too difficult to try to fix that via + absolute positioning. */ + line-height: 1; +} + +/* Overrides the positioning of the chevron from the rule above. */ +.railroad-production:target::before { + left: -20px; + top: 8px; +} + +/* The toggle button. */ +.grammar-toggle { + width: 120px; + padding: 5px 0px; + border-radius: 5px; + cursor: pointer; +} + +/* This is used to toggle the hidden status of the railroad diagrams. */ +.grammar-hidden { + display: none; +} + +:root { + --railroad-background-color: hsl(30, 20%, 95%); + --railroad-background-image: linear-gradient(to right, rgba(30, 30, 30, .05) 1px, transparent 1px), + linear-gradient(to bottom, rgba(30, 30, 30, .05) 1px, transparent 1px); + --railroad-path-stroke: black; + --railroad-rect-stroke: black; + --railroad-rect-fill: hsl(-290, 70%, 90%); +} + +.coal, .navy, .ayu { + --railroad-background-color: hsl(230, 10%, 20%); + --railroad-background-image: linear-gradient(to right, rgba(150, 150, 150, .05) 1px, transparent 1px), + linear-gradient(to bottom, rgba(150, 150, 150, .05) 1px, transparent 1px); + --railroad-path-stroke: hsl(200, 10%, 60%); + --railroad-text-fill: hsl(230, 30%, 80%); + --railroad-rect-stroke: hsl(200, 10%, 50%); + --railroad-rect-fill: hsl(230, 20%, 20%); +} + +svg.railroad { + background-color: var(--railroad-background-color); + background-size: 15px 15px; + background-image: var(--railroad-background-image); +} + +svg.railroad rect.railroad_canvas { + stroke-width: 0px; + fill: none; +} + +svg.railroad path { + stroke-width: 3px; + stroke: var(--railroad-path-stroke); + fill: none; +} + +svg.railroad .debug { + stroke-width: 1px; + stroke: red; +} + +svg.railroad text { + font: 14px monospace; + text-anchor: middle; + fill: var(--railroad-text-fill); +} + +svg.railroad .nonterminal text { + font-weight: bold; +} + +svg.railroad text.comment { + font: italic 12px monospace; +} + +svg.railroad rect { + stroke-width: 3px; + stroke: var(--railroad-rect-stroke); + fill: var(--railroad-rect-fill); +} + +svg.railroad g.labeledbox>rect { + stroke-width: 1px; + stroke: grey; + stroke-dasharray: 5px; + fill: rgba(90, 90, 150, .1); +} From 6c55e500e62351eabcf36dcad7085370ba35d973 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Thu, 10 Apr 2025 10:57:04 -0700 Subject: [PATCH 22/38] Add a summary chapter that shows all of the grammar productions on one page --- src/SUMMARY.md | 1 + src/grammar.md | 5 +++++ 2 files changed, 6 insertions(+) create mode 100644 src/grammar.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 60787c339..980a61ef0 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -132,6 +132,7 @@ - [The Rust runtime](runtime.md) - [Appendices](appendices.md) + - [Grammar summary](grammar.md) - [Macro Follow-Set Ambiguity Formal Specification](macro-ambiguity.md) - [Influences](influences.md) - [Test summary](test-summary.md) diff --git a/src/grammar.md b/src/grammar.md new file mode 100644 index 000000000..fa7ca7819 --- /dev/null +++ b/src/grammar.md @@ -0,0 +1,5 @@ +# Grammar summary + +The following is a summary of the grammar production rules. + +{{ grammar-summary }} From ea629b430c257dc33ac9c7497d77b39d3bf8f241 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Thu, 10 Apr 2025 10:57:38 -0700 Subject: [PATCH 23/38] Add some documentation for how to write grammar rules --- docs/authoring.md | 4 ++ docs/grammar.md | 120 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 124 insertions(+) create mode 100644 docs/grammar.md diff --git a/docs/authoring.md b/docs/authoring.md index 73ca9dbed..a465309c8 100644 --- a/docs/authoring.md +++ b/docs/authoring.md @@ -214,3 +214,7 @@ r[foo.bar.edition2021] > [!EDITION-2021] > Describe what changed in 2021. ``` + +## Grammar + +See [Grammar](grammar.md) for details on how to write grammar rules. diff --git a/docs/grammar.md b/docs/grammar.md new file mode 100644 index 000000000..8033a313e --- /dev/null +++ b/docs/grammar.md @@ -0,0 +1,120 @@ +# Grammar + +The Reference grammar is written in markdown code blocks using a modified BNF-like syntax (with a blend of regex and other arbitrary things). The `mdbook-spec` extension parses these rules and converts them to a renderable format, including railroad diagrams. + +The code block should have a lang string with the word "grammar", a comma, and the category of the grammar, like this: + +~~~ +```grammar,items +ProductionName -> SomeExpression +``` +~~~ + +The category is used to group similar productions on the grammar summary page in the appendix. + +## Grammar syntax + +The syntax for the grammar itself is pretty close to what is described in the [Notation chapter](../src/notation.md), though there are some rendering differences. + +The syntax for the grammar itself (written in itself, hopefully that's not too confusing) is: + +``` +Grammar -> Production+ + +BACKTICK -> U+0060 + +LF -> U+000A + +Production -> Name ` ->` Expression + +Name -> + + +Expression -> Sequence (` `* `|` ` `* Sequence)* + +Sequence -> (` `* AdornedExpr)+ + +AdornedExpr -> ExprRepeat Suffix? Footnote? + +Suffix -> ` _` * `_` + +Footnote -> `[^` ~[`]` LF]+ `]` + +ExprRepeat -> + Expr1 `?` + | Expr1 `*?` + | Expr1 `*` + | Expr1 `+?` + | Expr1 `+` + | Expr1 `{` Range? `..` Range? `}` + +Range -> [0-9]+ + +Expr1 -> + Unicode + | NonTerminal + | Break + | Terminal + | Charset + | Prose + | Group + | NegativeExpression + +Unicode -> `U+` [`A`-`Z` `0`-`9`]4..4 + +NonTerminal -> Name + +Break -> LF ` `+ + +Terminal -> BACKTICK ~[LF]+ BACKTICK + +Charset -> `[` (` `* Characters)+ ` `* `]` + +Characters -> + CharacterRange + | CharacterTerminal + | CharacterName + +CharacterRange -> BACKTICK BACKTICK `-` BACKTICK BACKTICK + +CharacterTerminal -> Terminal + +CharacterName -> Name + +Prose -> `<` ~[`>` LF]+ `>` + +Group -> `(` ` `* Expression ` `* `)` + +NegativeExpression -> `~` ( Charset | Terminal | NonTerminal ) +``` + +The general format is a series of productions separated by blank lines. The expressions are: + +| Expression | Example | Description | +|------------|---------|-------------| +| Unicode | U+0060 | A single unicode character. | +| NonTerminal | FunctionParameters | A reference to another production by name. | +| Break | | This is used internally by the renderer to detect line breaks and indentation. | +| Terminal | \`example\` | This is a sequence of exact characters, surrounded by backticks | +| Charset | [ \`A\`-\`Z\` \`0\`-\`9\` \`_\` ] | A choice from a set of characters, space separated. There are three different forms. | +| CharacterRange | [ \`A\`-\`Z\` ] | A range of characters, each character should be in backticks. +| CharacterTerminal | [ \`x\` ] | A single character, surrounded by backticks. | +| CharacterName | [ LF ] | A nonterminal, referring to another production. | +| Prose | \ | This is an English description of what should be matched, surrounded in angle brackets. | +| Group | (\`,\` Parameter)+ | This groups an expression for the purpose of precedence, such as applying a repetition operator to a sequence of other expressions. +| NegativeExpression | ~[\` \` LF] | Matches anything except the given Charset, Terminal, or Nonterminal. | +| Sequence | \`fn\` Name Parameters | A sequence of expressions, where they must match in order. | +| Alternation | Expr1 \| Expr2 | Matches only one of the given expressions, separated by the vertical pipe character. | +| Suffix | \_except \[LazyBooleanExpression\]\_ | This adds a suffix to the previous expression to provide an additional English description to it, rendered in subscript. This can have limited markdown, but try to avoid anything except basics like links. | +| Footnote | \[^extern-safe\] | This adds a footnote, which can supply some extra information that may be helpful to the user. The footnote itself should be defined outside of the code block like a normal markdown footnote. | +| Optional | Expr? | The preceding expression is optional. | +| Repeat | Expr* | The preceding expression is repeated 0 or more times. | +| Repeat (non-greedy) | Expr*? | The preceding expression is repeated 0 or more times without being greedy. | +| RepeatPlus | Expr+ | The preceding expression is repeated 1 or more times. | +| RepeatPlus (non-greedy) | Expr+? | The preceding expression is repeated 1 or more times without being greedy. | +| RepeatRange | Expr{2..4} | The preceding expression is repeated between the range of times specified. Either bounds can be excluded, which works just like Rust ranges. | + +## Automatic linking + +The plugin automatically adds markdown link definitions for all the production names on every page. If you want to link directly to a production name, all you need to do is surround it in square brackets, like `[ArrayExpression]`. + +In some cases there might be name collisions with the automatic linking of rule names. In that case, disambiguate with the `grammar-` prefix, such as `[Type][grammar-Type]`. You can also do that if you just feel like being more explicit. From a954c17a58ab3d5f0535a535dbd1a1059ce4cd78 Mon Sep 17 00:00:00 2001 From: Eric Huss Date: Thu, 10 Apr 2025 13:42:13 -0700 Subject: [PATCH 24/38] Fix rule reference links with multiple spaces This fixes it so that rule links work correctly if there is more than one space in a reference definition. --- mdbook-spec/src/lib.rs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mdbook-spec/src/lib.rs b/mdbook-spec/src/lib.rs index f26c98ccc..0f14819fc 100644 --- a/mdbook-spec/src/lib.rs +++ b/mdbook-spec/src/lib.rs @@ -27,7 +27,7 @@ static ADMONITION_RE: Lazy = Lazy::new(|| { /// A primitive regex to find link reference definitions. static MD_LINK_REFERENCE_DEFINITION: Lazy = - Lazy::new(|| Regex::new(r"(?m)^\[(?