prevent unexpected post-processing & simplify internal feed. #679

kivikakk · 2025-11-02T05:40:28Z

Some minor fixes and cleanups, and the main course: prevent unexpected post-processing when elements are escaped.

This brought to my attention by - \[x] hello with tasklist (only) enabled raising an assertion: it's still trying to parse it as a tasklist, and the tasklist code asserts the text element's sourcepos start matches the containing paragraph's. In this case, it doesn't: the escape isn't included, so it throws for a mismatch.

The solution here seems to be to let the Escaped node actually always enter the document tree, preventing this from getting any post-processing treatment at all. We remove them after Text node unification and post-processing (unless the user wants them left in).

…essing matches. This prevents a sourcepos assertion failing in tasklists when the leading `[` is escaped. I think we expect the escaping of it should prevent parsing as a task list. This means our AST assertions include "escaped" elements, which I like the explicitness of, but let's not include them in XML output without the render option. This leaves a failing case which I've tried a few approaches to addressing, none working out so far: "\™" when outputting to CommonMark drops the "\", because when we're outputting the "&" (in a lone Escaped element), `nextb` is None. Buffering the character output by one (to reliably see the next character to be output) introduces all kinds of skew issues which make me think I'd be better off writing gateware.

When neither the parse nor render option is enabled, we then remove them post-hoc from the tree. We have to do it *after* all Text node processing is done, so we don't unify an Escaped->Text with a later Text and post-process it as if it were regular non-Escaped Text. It's not too much bookkeeping, and should be reasonably sound*. We still have the remaining issue of these Escaped elements, when left in the tree, preventing the CommonMark formatter from seeing that it needs to escape a given character when the following node is Escaped. \* famous last words.

github-actions · 2025-11-02T05:45:56Z

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./bench.sh ./comrak-834aa08`	109.2 ± 1.3	107.2	112.7	1.84 ± 0.05
`./bench.sh ./comrak-main`	107.4 ± 1.0	105.8	110.8	1.81 ± 0.04
`./bench.sh ./pulldown-cmark`	59.3 ± 1.3	58.0	66.2	1.00
`./bench.sh ./cmark-gfm`	94.1 ± 6.3	83.5	106.2	1.59 ± 0.11
`./bench.sh ./markdown-it`	262.7 ± 4.3	257.1	280.8	4.43 ± 0.12

Run on Sun Nov 2 05:45:55 UTC 2025

kivikakk · 2025-11-02T06:07:48Z

Cursed. The following Markdown:

[email protected]_

formats to CommonMark as follows:

a.b-c\[email protected]\_

The status quo is that this is post-processed the same as the original input; one big text node with the escapes parsed out/removed:

document
- paragraph
  - text [email protected]_

With autolinks turned on, these aren't autolinked; the trailing underscore disqualifies it (I guess).

With this PR, the roundtripped CommonMark parses as:

document
- paragraph
  - text a.b-c
  - escaped
    - text _
  - text [email protected]
  - escaped
    - text _

This does get parsed as an autolink, and so we fail a roundtrip test. Grr.

I'm kind of inclined to think this parse is actually correct (although different to the status quo), and it's the formatter that's changing the meaning here and should be corrected. But that's a Hard problem, as it's difficult to fully generally determine when a character needs to be escaped to be reinterpreted the same way, seeing as preceding and following characters will influence its interpretation.

kivikakk · 2025-11-02T06:21:31Z

This change is necessary for escaped_char_spans to not bifurcate the interpretation of post-processed extensions in Comrak's eyes anyways: those Escaped elements being inserted only when the option is turned on visibly influence the interpretation. So we have to do it unilaterally, and remove the elements from the AST if not wanted, if we don't want it to actually change the meaning of the document when turned on.

The alternative is to track such characters without modifying the tree (e.g. vec of character indices), but that becomes hard to maintain. Hrmmmm.

kivikakk · 2025-11-02T06:27:55Z

This does get parsed as an autolink, and so we fail a roundtrip test. Grr.

I've decided this parse is correct: without the backslash, the _ is part of the surrounding text and the autolink rules say the underscore disqualifies it:

., -, and _ can occur on both sides of the @, but only . may occur at the end of the email address, in which case it will not be considered part of the address:

With the backslash, (a) it's not a _ at the end, it's a backslash, and (b) even if we say it's not a backslash, it's not a _, it's an escaped _. The purpose of escaping is to prevent the usual interpretation.

See #679 (comment).

kivikakk · 2025-11-02T06:39:32Z

It occurs to me that the answer here is to not keep the autolinks extension turned on while doing roundtrip tests: any GFM extension autolinks recognised in the original text will be written out as core spec links (of some variety). Doing further extension autolink parsing on the resulting roundtripped text may well not be stable, because we don't roundtrip GFM extension autolinks to begin with!

…sts.

This stops us from creating a larger intermediate value when there's a Text node on either side of an Escaped Text node; we append the Escaped Text to the left-hand Text, and then the right Text on the left-hand Text. Previously, we were appending the Escaped Text to the right-hand Text, and then appending _that_ to the left-hand Text.

See kivikakk#679 (comment).

kivikakk added 7 commits November 2, 2025 15:25

maybe don't allow arbitrary usize range here.

f1c4906

cm: do not recurse into output.

a381ac9

arena_tree: add some pretty-printing around nodes.

eaf755e

nodes: Sourcepos/Ast debug implementations easier to read.

dced134

add escaped_char_spans to fuzzer.

a458705

assert this behaviour.

b825e24

See #679 (comment).

kivikakk added 2 commits November 2, 2025 17:45

vendor/cmark-gfm: disable autolink after initial roundtrip in spec te…

b7afc3f

…sts.

kivikakk force-pushed the push-pmyomryuyqnw branch from 39cd9d4 to d5f88e4 Compare November 2, 2025 07:18

kivikakk added 4 commits November 2, 2025 18:58

vendor/cmark-gfm: silence some SyntaxWarnings.

55a19df

simplify parser feed.

e306ca1

simplify and normalise newline handling.

c9305aa

don't recycle the linebuf immediately.

d2d0cbb

kivikakk changed the title ~~prevent unexpected post-processing.~~ prevent unexpected post-processing & simplify internal feed. Nov 2, 2025

kivikakk force-pushed the push-pmyomryuyqnw branch from 8435939 to d2d0cbb Compare November 2, 2025 10:39

kivikakk enabled auto-merge November 2, 2025 10:39

kivikakk merged commit 46c38de into main Nov 2, 2025
88 checks passed

kivikakk deleted the push-pmyomryuyqnw branch November 2, 2025 10:42

druskus20 pushed a commit to druskus20/comrak that referenced this pull request Nov 5, 2025

assert this behaviour.

ddb8804

See kivikakk#679 (comment).

BrewTestBot mentioned this pull request Nov 13, 2025

comrak 0.48.0 Homebrew/homebrew-core#254238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

prevent unexpected post-processing & simplify internal feed. #679

prevent unexpected post-processing & simplify internal feed. #679

Uh oh!

kivikakk commented Nov 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

prevent unexpected post-processing & simplify internal feed. #679

prevent unexpected post-processing & simplify internal feed. #679

Uh oh!

Conversation

kivikakk commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

kivikakk commented Nov 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kivikakk commented Nov 2, 2025 •

edited

Loading