Skip to content

[RFC] "Template"/Macro/Generic rules #261

@CAD97

Description

@CAD97

Motivation

One of the advertised features of LALRPOP, macros/templates/generics are a useful tool for factoring out common parts of your grammar. The common example is CommaSeparated<production> to represent production ~ ("," ~ production)* ~ ","?. (This can also be written (production ~ ",")* ~ production?, but I prefer the former formulation.)

In this RFC I lay out how a design for generic rules might look in pest, and attempt to make a case for their implementation.

A proposal for standard casing

In the 2.0 version of pest, the standard casing sees builtin rules in SHOUT_CASE and user rules recommended to be in snake_case. This RFC proposes that generic rules could be TitleCase by convention, along with their arguments, to distinguish them from normal rules.

Guide-Level Explanation

(Shamelessly adapted from the LALRPOP book, which is licensed MIT/Apache as is LALRPOP itself)

When writing grammars we encounter repetitive constructs that might normally be copy-and-pasted. A common example is something like a "comma-separated list". If we want to parse a comma-separated list of expressions, it might look something like:

expressions = { expression ~ ("," ~ expression)* ~ ","? }

But what happens if later we want a comma-separated list of terms, or anything else? For this, pest offers generic rules. By using a generic rule, we can factor out this common functionality into one place.

expressions = { CommaSeparated<expression> }
terms = { CommaSeparated<term> }

CommaSeparated<Rule> = _{ Rule ~ ("," ~ Rule)* ~ ","? }

Because CommaSeparated is marked as a silent rule with a _, this means this is functionally equivalent to inlining its structure into both expressions and terms. If a generic rule is not silenced, it will be included in the output structure just like any other rule.

Implementation-Level Explanation

There are two ways to handle generic rules. In the first, we treat it as a template, and generate multiple parsing functions for each instantiation. In the second, we pass along the generics to the Rust code.

I will explain via walking through in pseudocode the following example (note that no rules are silent, unlike above):

expressions = { CommaSeparated<expression> }
terms = { CommaSeparated<term> }

CommaSeparated<Rule> = { Rule ~ ("," ~ Rule)* ~ ","? }

Template desugaring

For each unique rule that is passed into a generic rule, desugar to a new rule instantiated with the concrete rule(s) passed to the generic rule.

expressions = { CommaSeparated__expression }
terms = { CommaSeparated__term }

CommaSeparated__expression = { expression ~ ("," ~ expression)* ~ ","? }
CommaSeparated__term = { term ~ ("," ~ term)* ~ ","? }

The generated rules do not correspond to unique Rule enum variants in the output, however; all generated rules from the same generic rule map to the same Rule enum variant.

Generic implementation

In addition to the parser state, the generated function for parsing this rule takes an argument representing what rule is passed as its generic argument. It then calls said function for any time the generic argument is present in the definition.

fn CommaSeparated(
    state: Box<::pest::ParserState<Rule>>,
    Rule: impl Fn(Box<::pest::ParserState<Rule>>) -> ::pest::ParseResult<Box<::pest::ParserState<Rule>>>,
) -> ::pest::ParseResult<Box<::pest::ParserState<Rule>>> {
    Rule(state)?
        .many(0.., |state| Rule(state.literal(",")?))?
        .optional(|state| state.literal(","))
}

Grammar changes

The terminal rule is changed to accommodate generic rules:

-terminal = _{ _push | identifier | string | insensitive_string | range }
+terminal = _{ _push | (identifier ~ ("<" ~ CommaSeparated<terminal> ~ ">")?) | string | insensitive_string | range }

In the future, we may wish to relax this such that a generic rule can take a term or even expression instead. Conversely, we may wish to only accept one terminal instead of a list to begin with.

Prior Art

  • LALRPOP, as linked above

Unresolved Questions

  • Do we want to pass along the genericness to the generated Rust code, or "monomorphize" the rules ourselves?
  • How generic should the generics be allowed to be? Do we want to support the more generic Separated<Term, By> = { Term ~ (By ~ Term)* ~ By? }? Do we need to support it?
  • Does paramaterizing generics over term or expression instead of terminal give any more convenience to the user? Any generic rule can be expressed solely by taking a terminal by just defining a silent terminal to be the desired more complicated expression.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions