Skip to content

md_in_html can process content out of order if block HTML is nested under it #1502

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
facelessuser opened this issue Jan 17, 2025 · 8 comments · Fixed by #1503
Closed

md_in_html can process content out of order if block HTML is nested under it #1502

facelessuser opened this issue Jan 17, 2025 · 8 comments · Fixed by #1503

Comments

@facelessuser
Copy link
Collaborator

I have Block plugins that expect a start /// start and end ///. In almost all cases, it seems it is a reasonable assumption that Python Markdown will process the start before the end allowing the plugin to work.

In a case like below, this holds true. I add some logging to show when the processed lines are encountered.

/// 1

<p>
Hello i'm a 'p' in a tab which in in a div!
</p>

/// 2

Order logged:

/// 1
/// 2

A problem is introduced though when md_in_html is used.

<div class="my-div" markdown>

/// 1

<p>
Hello i'm a 'p' in a tab which in in a div!
</p>

/// 2

</div>

Order logged:

/// 2
/// 1

As we can see the order of processing is now out of order, and since our extension relies on this order, the extension also breaks.

If the wrapping tag is inline, there is no problem. If the nested tag is inline, there is also no problem. So this is only when block tags are used within a block tag being processed by md_in_html.

@facelessuser
Copy link
Collaborator Author

facelessuser commented Jan 17, 2025

Let me be clear, this is a processing order problem, not an insertion order problem. The order is inserted properly, but the unexpected order of when a block processor sees these lines is wrong which makes it difficult for an extension to discern when and where the lines actually occurred.

@waylan
Copy link
Member

waylan commented Jan 21, 2025

Just curious, does this still occur if your start and end tags are not separated by raw HTML? For example with this input:

<div class="my-div" markdown>

/// 1

Some plain text

/// 2

</div>

Or how about if there is not blank lines?

<div class="my-div" markdown>

/// 1
Some plain text
/// 2

</div>

I'm just trying to work out what might be causing this.

@waylan
Copy link
Member

waylan commented Jan 21, 2025

Something that is helpful when debugging raw HTML is to check the htmlStash.rawHtmlBlocks and seeing what is actually recognized as a single block. Sometimes that shows that what you or I would expect to be a single block is actually multiple blocks. But I don't see that happening here. Everything is all one block.

>>> md = markdown.Markdown()
>>> md.convert('<div class="my-div" markdown>\n\n/// 1\n\n<p>\nHello i\'m a \'p\' in a tab which in in a div!\n</p>\n\n/// 2\n\n<div>')
'<div class="my-div" markdown>\n\n/// 1\n\n<p>\nHello i\'m a \'p\' in a tab which in in a div!\n</p>\n\n/// 2\n\n<div>'
>>> md.htmlStash.rawHtmlBlocks
[ '<div class="my-div" markdown>\n\n/// 1\n\n<p>\nHello i\'m a \'p\' in a tab which in in a div!\n</p>\n\n/// 2\n\n<div>\n\n']

@facelessuser
Copy link
Collaborator Author

It only happens with raw HTML between them.

@facelessuser
Copy link
Collaborator Author

I'll try to take a look at this sometime soon as well. I had a separate issue related to the opposite (md_in_html appearing in a fenced block) but that was resolved in my library, so this is the one issue that is not resolvable in my own library and seems fundamental to either md_in_html or the interaction with that extension and the HTML parser. I haven't really looked very close yet, but I'll see if I can make some sense of this.

@facelessuser
Copy link
Collaborator Author

It seems that I can probably fix the ordering issue, but it looks like it won't matter as the real problem is how "markdown" HTML blocks work.

Apparently, when processing a "markdown" block, the nested block HTML elements inside it are already real HTML elements. So a plugin like the fenced Block plugins I developed won't work with this setup because the elements are not being passed as placeholders that are expanded as they are processed. A situation like the following, where a nested HTML block is fenced, the fenced block will never see the HTML element as a child under it.

<div markdown>

/// block
<div>Nested</div>
///

</div>

Ideally md_in_html would simply expand nested block HTML from placeholders under parent "markdown" HTML blocks as they are encountered in the normal parsing pipeline. That would make Markdown processing straightforward and work like it does everywhere else and simplify the md_in_html processing such that ordering wasn't really a problem.

If we were just to accept that nested HTML will be real elements in these case, I'd probably have to rework md_in_html processing to fix the ordering quirk and have Block extensions have a step where they look for children that appear after them while they haven't found the end fence and try to move them within. I feel like this may be less reliable, but maybe not impossible to pull off.

I'll have to work on some custom solutions locally in my project to see what is the most viable and cleanest approach before proposing anything. In the worst case, if the changes are not desired here, I may have to provide a custom md_in_html alternative. Not ideal, but we'll have to see once I can nail down a solution. Haven't the Markdown parser behave very differently inside "markdown" blocks is very non-ideal from the custom extension perspective, but I'm not sure how many people have tried to push the extensions as I am with the generic fenced blocks.

@facelessuser
Copy link
Collaborator Author

So, I was able to modify md_in_html, while keeping all existing tests passing, and restructure it so that the Markdown parsing process within "markdown" blocks functions the same as it does outside of "markdown" blocks. I will spend some time cleaning it up and post a PR. Hopefully, the change will be found to be acceptable as it provides a consistent expectation for extension developers.

@facelessuser
Copy link
Collaborator Author

Draft PR #1503 is up with passing with all unit tests passing.

facelessuser added a commit that referenced this issue Jan 28, 2025
…#1503)

Ensure `md_in_html` processes content inside a "markdown" block the same way content is processed outside of a "markdown" block.

- Flatten the HTML content into placeholders so that the parser will treat the "markdown" block content in the same way it does when `md_in_html` is not enabled. The placeholders are expanded once the parser reaches them in a linear fashion. This allows extensions to deal with HTML content and consume it the same way it deals with them with them when the content is not nested under a "markdown" block.

- Instead of content being processed in dummy tags, content is now processed under the real parent allowing extensions to have better context to make better decisions.

Additionally, fix some issues with tags and inline code.

Also, fix some issues with one-liner block tags, e.g. `<tag><tag>...`

Resolves #1502 
Resolves #1075
Resolves #1074
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants