-
-
Notifications
You must be signed in to change notification settings - Fork 169
prevent unexpected post-processing & simplify internal feed. #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…essing matches. This prevents a sourcepos assertion failing in tasklists when the leading `[` is escaped. I think we expect the escaping of it should prevent parsing as a task list. This means our AST assertions include "escaped" elements, which I like the explicitness of, but let's not include them in XML output without the render option. This leaves a failing case which I've tried a few approaches to addressing, none working out so far: "\™" when outputting to CommonMark drops the "\", because when we're outputting the "&" (in a lone Escaped element), `nextb` is None. Buffering the character output by one (to reliably see the next character to be output) introduces all kinds of skew issues which make me think I'd be better off writing gateware.
When neither the parse nor render option is enabled, we then remove them post-hoc from the tree. We have to do it *after* all Text node processing is done, so we don't unify an Escaped->Text with a later Text and post-process it as if it were regular non-Escaped Text. It's not too much bookkeeping, and should be reasonably sound*. We still have the remaining issue of these Escaped elements, when left in the tree, preventing the CommonMark formatter from seeing that it needs to escape a given character when the following node is Escaped. \* famous last words.
Run on Sun Nov 2 05:45:55 UTC 2025 |
|
Cursed. The following Markdown: formats to CommonMark as follows: The status quo is that this is post-processed the same as the original input; one big text node with the escapes parsed out/removed:
With autolinks turned on, these aren't autolinked; the trailing underscore disqualifies it (I guess). With this PR, the roundtripped CommonMark parses as:
This does get parsed as an autolink, and so we fail a roundtrip test. Grr. I'm kind of inclined to think this parse is actually correct (although different to the status quo), and it's the formatter that's changing the meaning here and should be corrected. But that's a Hard problem, as it's difficult to fully generally determine when a character needs to be escaped to be reinterpreted the same way, seeing as preceding and following characters will influence its interpretation. |
|
This change is necessary for The alternative is to track such characters without modifying the tree (e.g. vec of character indices), but that becomes hard to maintain. Hrmmmm. |
I've decided this parse is correct: without the backslash, the
With the backslash, (a) it's not a |
|
It occurs to me that the answer here is to not keep the autolinks extension turned on while doing roundtrip tests: any GFM extension autolinks recognised in the original text will be written out as core spec links (of some variety). Doing further extension autolink parsing on the resulting roundtripped text may well not be stable, because we don't roundtrip GFM extension autolinks to begin with! |
This stops us from creating a larger intermediate value when there's a Text node on either side of an Escaped Text node; we append the Escaped Text to the left-hand Text, and then the right Text on the left-hand Text. Previously, we were appending the Escaped Text to the right-hand Text, and then appending _that_ to the left-hand Text.
39cd9d4 to
d5f88e4
Compare
8435939 to
d2d0cbb
Compare
Some minor fixes and cleanups, and the main course: prevent unexpected post-processing when elements are escaped.
This brought to my attention by
- \[x] hellowith tasklist (only) enabled raising an assertion: it's still trying to parse it as a tasklist, and the tasklist code asserts the text element's sourcepos start matches the containing paragraph's. In this case, it doesn't: the escape isn't included, so it throws for a mismatch.The solution here seems to be to let the
Escapednode actually always enter the document tree, preventing this from getting any post-processing treatment at all. We remove them after Text node unification and post-processing (unless the user wants them left in).