Skip to content

More robust interpolation parsing #597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Jan 23, 2024

This rearchitects our Markdown parser to be more robust. The intent is to fix issues such as:

Rather than implementing inline expression parsing as a markdown-it plugin (which depends on markdown-it’s internal tokenization), we implement a preprocessing step that converts inline expressions ${…} to HTML and extracts the JavaScript source expressions. This HTML is then ignored by markdown-it since HTML is allowed within Markdown. We will continue to parse fenced code blocks using markdown-it, since these are already supported by Markdown, and since these are always generate nodes.

Furthermore, by employing the same HTML parsing state machine as Hypertext Literal, we can be exact about where the expressions start and end, and in what context they need to be evaluated (for example, as an attribute or as an node).

For example, this inline expression:

Hello, ${'world'}!

is compiled to:

Hello, <!-- o:1 -->!

In addition, the 'red' JavaScript expression is extracted so that it can be used to define a runtime variable, which is then displayed on the client, replacing the generated comment. (Note that we could generate <span id="cell-1"></span> instead of the comment here, and we may end up doing that, but I’ll need to change the client to support evaluating dynamic attributes anyway, so I’m opportunistically seeing if this helps remove the wrapper span #11.)

Similarly, this:

<div class=${'red'}>color</div>

is compiled to:

<div o:class="1">color</div>

where o: is a special prefix that denotes a dynamic attribute that will be computed on the client. The exact compiled HTML syntax is still to be determined — the current approach will require using a TreeWalker to find comments and attributes, and then bind them with the runtime variable with the associated identifier.

The last thing to fix here is probably to diff the parsed HTML rather than diffing the parsed Markdown pieces. This would alleviate the requirement that each Markdown “piece” corresponds to exactly one HTML element, which introduces the blank line quirk. Of course, not all of these depend on each other, so I might try to decouple them and approach this more incrementally.

@mbostock mbostock changed the title parseInterpolate More robust interpolation parsing Jan 23, 2024
@mbostock
Copy link
Member Author

mbostock commented Feb 3, 2024

If we also eliminate the wrapper span #11, then this should in theory also be able to support cases like this:

<style>

body {
  background: ${color};
}

</style>

where color is a reactive variable! Pretty amazing.

@mkotelnikov
Copy link

Hello,

@statewalker/tknz - is a tokenizer (parser) for HTML / MD / syntaxes.

  • It is very small (30k non minimised, non compressed) with no dependencies
  • Produces well formed ASTs from documents with exact start/end positions for each token
  • Opening/closing HTML tags are balanced (by default)
  • Allows embedding of inline codes
  • Produces well-formed hierarchy of document sections based on headers

It seems that it covers most issues mentioned here.
You can check it here: https://observablehq.com/@kotelnikov/statewalker-tknz

@mbostock
Copy link
Member Author

mbostock commented Jun 1, 2024

We now diff HTML, and #1416 removes the wrapper span, so two of the pieces have fallen into place.

Yet one limitation I see with this approach is the assumption that the HTML tokenizer state machine can be applied to Markdown (as-is). I think it can in about 98% of cases because nearly all of Markdown is text context, but here’s at least one example where it fails:

[link](https://example.com/${"path"})

In the above case, ${"path"} would appear to the HTML tokenizer as the text context, but in fact it’s interpolating into the href attribute. There’s also the more pathological case where the same expression in interpolated into both the text and an attribute simultaneously:

<https://example.com/${"path"}>

And I have no idea how auto-linkify should work in this case…

https://example.com/${"path"}

These cases aren’t currently handled in main, either: the ${"path"} isn’t recognized as an inline expression. And it’s pretty easy to rewrite this as HTML if for some reason you wanted this dynamic behavior for some reason. But it suggests that the tokenizer would at least need to recognize Markdown’s link syntax. I probably need to read the CommonMark spec to see if there are other cases.

This was referenced Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants