Skip to content

Blocks: Process HTML With a Parser (Instead of Regular Expressions) #42128

@adamziel

Description

@adamziel

What problem is this issue looking to solve?

I'd like Gutenberg PHP code to process HTML using an actual parser instead of regular expressions as it does now.

A few examples of regexps in action:

Regular expressions are error-prone, hard to debug, and can fail due to unexpected corner-cases.

What solution does this issue propose

Let's lean on an HTML parser.

However, let's avoid using DOMDocument if possible. A related discussion surfaced the following problems with it:

  • It requires a PHP extension that may not be available, meaning we'd need a fallback
  • It's an XML/HTML4 at heart and is known to have deficiencies in parsing modern HTML

What would we use instead?

Step 1: Identify a parser to lean on

I've looked in Google and Github for PHP HTML Parser, DOM Parsers, component libraries and frameworks, HTML Formatters, and HTML Tokenizers. I also went through this great StackOverflow answer.

Here's the list of libraries I found:

Compatible with PHP 5 and maintained recently

  • hQuery – Parses HTML, but doesn't support updating attributes or injecting nodes.
  • HTML Purifier – Not a parser library, but ships with an HTML tokenizer that Gutenberg could reuse.

Neither gets us 100% there, but they'd make great starting points.

Unmaintained recently

Incompatible with PHP 5 or dependent on DOMDocument

Step 2: Figure out what's next

There are three possible ourcomes:

  • Use a on-DOMDocument parser – how can it be brought to Gutenberg and then to WordPress?
  • Use a DOMDocument parser – what's the best way to document its limitations?
  • Don't use any parser at all – would that mean regular expressions are the way to parse HTML in Gutenberg, then?

CC @azaozz @hellofromtonya @dmsnell @draganescu @getdave @scruffian @mtias @youknowriad @anton-vlasenko @noisysocks @Mamaduka @paaljoachim @mcsf

Metadata

Metadata

Assignees

No one assigned

    Labels

    Developer ExperienceIdeas about improving block and theme developer experienceNeeds Technical FeedbackNeeds testing from a developer perspective.[Type] Code QualityIssues or PRs that relate to code quality[Type] New APINew API to be used by plugin developers or package users.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions