Skip to content

Conversation

@kivikakk
Copy link
Owner

@kivikakk kivikakk commented Nov 2, 2025

Some minor fixes and cleanups, and the main course: prevent unexpected post-processing when elements are escaped.

This brought to my attention by - \[x] hello with tasklist (only) enabled raising an assertion: it's still trying to parse it as a tasklist, and the tasklist code asserts the text element's sourcepos start matches the containing paragraph's. In this case, it doesn't: the escape isn't included, so it throws for a mismatch.

The solution here seems to be to let the Escaped node actually always enter the document tree, preventing this from getting any post-processing treatment at all. We remove them after Text node unification and post-processing (unless the user wants them left in).

…essing matches.

This prevents a sourcepos assertion failing in tasklists when the
leading `[` is escaped.  I think we expect the escaping of it should
prevent parsing as a task list.

This means our AST assertions include "escaped" elements, which I like
the explicitness of, but let's not include them in XML output without
the render option.

This leaves a failing case which I've tried a few approaches to
addressing, none working out so far: "\™" when outputting to
CommonMark drops the "\", because when we're outputting the "&" (in a
lone Escaped element), `nextb` is None.  Buffering the character output
by one (to reliably see the next character to be output) introduces
all kinds of skew issues which make me think I'd be better off writing
gateware.
When neither the parse nor render option is enabled, we then remove
them post-hoc from the tree. We have to do it *after* all Text node
processing is done, so we don't unify an Escaped->Text with a later Text
and post-process it as if it were regular non-Escaped Text.  It's not
too much bookkeeping, and should be reasonably sound*.

We still have the remaining issue of these Escaped elements, when left
in the tree, preventing the CommonMark formatter from seeing that it
needs to escape a given character when the following node is Escaped.

\* famous last words.
@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2025

Command Mean [ms] Min [ms] Max [ms] Relative
./bench.sh ./comrak-834aa08 109.2 ± 1.3 107.2 112.7 1.84 ± 0.05
./bench.sh ./comrak-main 107.4 ± 1.0 105.8 110.8 1.81 ± 0.04
./bench.sh ./pulldown-cmark 59.3 ± 1.3 58.0 66.2 1.00
./bench.sh ./cmark-gfm 94.1 ± 6.3 83.5 106.2 1.59 ± 0.11
./bench.sh ./markdown-it 262.7 ± 4.3 257.1 280.8 4.43 ± 0.12

Run on Sun Nov 2 05:45:55 UTC 2025

@kivikakk
Copy link
Owner Author

kivikakk commented Nov 2, 2025

Cursed. The following Markdown:

formats to CommonMark as follows:

The status quo is that this is post-processed the same as the original input; one big text node with the escapes parsed out/removed:

With autolinks turned on, these aren't autolinked; the trailing underscore disqualifies it (I guess).

With this PR, the roundtripped CommonMark parses as:

This does get parsed as an autolink, and so we fail a roundtrip test. Grr.

I'm kind of inclined to think this parse is actually correct (although different to the status quo), and it's the formatter that's changing the meaning here and should be corrected. But that's a Hard problem, as it's difficult to fully generally determine when a character needs to be escaped to be reinterpreted the same way, seeing as preceding and following characters will influence its interpretation.

@kivikakk
Copy link
Owner Author

kivikakk commented Nov 2, 2025

This change is necessary for escaped_char_spans to not bifurcate the interpretation of post-processed extensions in Comrak's eyes anyways: those Escaped elements being inserted only when the option is turned on visibly influence the interpretation. So we have to do it unilaterally, and remove the elements from the AST if not wanted, if we don't want it to actually change the meaning of the document when turned on.

The alternative is to track such characters without modifying the tree (e.g. vec of character indices), but that becomes hard to maintain. Hrmmmm.

@kivikakk
Copy link
Owner Author

kivikakk commented Nov 2, 2025

This does get parsed as an autolink, and so we fail a roundtrip test. Grr.

I've decided this parse is correct: without the backslash, the _ is part of the surrounding text and the autolink rules say the underscore disqualifies it:

., -, and _ can occur on both sides of the @, but only . may occur at the end of the email address, in which case it will not be considered part of the address:

With the backslash, (a) it's not a _ at the end, it's a backslash, and (b) even if we say it's not a backslash, it's not a _, it's an escaped _. The purpose of escaping is to prevent the usual interpretation.

@kivikakk
Copy link
Owner Author

kivikakk commented Nov 2, 2025

It occurs to me that the answer here is to not keep the autolinks extension turned on while doing roundtrip tests: any GFM extension autolinks recognised in the original text will be written out as core spec links (of some variety). Doing further extension autolink parsing on the resulting roundtripped text may well not be stable, because we don't roundtrip GFM extension autolinks to begin with!

This stops us from creating a larger intermediate value when there's a
Text node on either side of an Escaped Text node; we append the Escaped
Text to the left-hand Text, and then the right Text on the left-hand
Text.

Previously, we were appending the Escaped Text to the right-hand Text,
and then appending _that_ to the left-hand Text.
@kivikakk kivikakk changed the title prevent unexpected post-processing. prevent unexpected post-processing & simplify internal feed. Nov 2, 2025
@kivikakk kivikakk enabled auto-merge November 2, 2025 10:39
@kivikakk kivikakk merged commit 46c38de into main Nov 2, 2025
88 checks passed
@kivikakk kivikakk deleted the push-pmyomryuyqnw branch November 2, 2025 10:42
druskus20 pushed a commit to druskus20/comrak that referenced this pull request Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants