-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
What problem is this issue looking to solve?
I'd like Gutenberg PHP code to process HTML using an actual parser instead of regular expressions as it does now.
A few examples of regexps in action:
- Remove whitespaces from the block markup
- Inject inline styles to HTML tags
- Inject CSS class names to HTML tags
Regular expressions are error-prone, hard to debug, and can fail due to unexpected corner-cases.
What solution does this issue propose
Let's lean on an HTML parser.
However, let's avoid using DOMDocument if possible. A related discussion surfaced the following problems with it:
- It requires a PHP extension that may not be available, meaning we'd need a fallback
- It's an XML/HTML4 at heart and is known to have deficiencies in parsing modern HTML
What would we use instead?
Step 1: Identify a parser to lean on
I've looked in Google and Github for PHP HTML Parser, DOM Parsers, component libraries and frameworks, HTML Formatters, and HTML Tokenizers. I also went through this great StackOverflow answer.
Here's the list of libraries I found:
Compatible with PHP 5 and maintained recently
- hQuery – Parses HTML, but doesn't support updating attributes or injecting nodes.
- HTML Purifier – Not a parser library, but ships with an HTML tokenizer that Gutenberg could reuse.
Neither gets us 100% there, but they'd make great starting points.
Unmaintained recently
Incompatible with PHP 5 or dependent on DOMDocument
- HtmlTokenizer
- Symfony DomCrawler
- html5-php
- HTML5 DOM Document
- Simple HTML DOM
- php-html-parser
- Laminas DOM
- DiDOM
- Zend DOM Query
Step 2: Figure out what's next
There are three possible ourcomes:
- Use a on-DOMDocument parser – how can it be brought to Gutenberg and then to WordPress?
- Use a DOMDocument parser – what's the best way to document its limitations?
- Don't use any parser at all – would that mean regular expressions are the way to parse HTML in Gutenberg, then?
CC @azaozz @hellofromtonya @dmsnell @draganescu @getdave @scruffian @mtias @youknowriad @anton-vlasenko @noisysocks @Mamaduka @paaljoachim @mcsf