Skip to content

Commit 15863a8

Browse files
dmsnelladamzielockhamgziolo
committed
Introduce HTML Tag Processor
This commit pulls in the HTML Tag Processor from the Gutenbeg repository. The Tag Processor attempts to be an HTML5-spec-compliant parser that provides the ability in PHP to find specific HTML tags and then add, remove, or update attributes on that tag. It provides a safe and reliable way to modify the attribute on HTML tags. ```php // Add missing `rel` attribute to links. $p = new WP_HTML_Tag_Processor( $block_content ); if ( $p->next_tag( 'A' ) && empty( $p->get_attribute( 'rel' ) ) ) { $p->set_attribute( 'noopener nofollow' ); } return $p->get_updated_html(); ``` Introduced originally in WordPress/gutenberg#42485 and developed within the Gutenberg repository, this HTML parsing system was built in order to address a persistent need (properly modifying HTML tag attributes) and was motivated after a sequence of block editor defects which stemmed from mismatches between actual HTML code and expectectations for HTML input running through existing naive string-search-based solutions. The Tag Processor is intended to operate fast enough to avoid being an obstacle on page render while using as little memory overhead as possible. It is practically a zero-memory-overhead system, and only allocates memory as changes to the input HTML document are enqueued, releasing that memory when flushing those changes to the document, moving on to find the next tag, or flushing its entire output via `get_updated_html()`. Rigor has been taken to ensure that the Tag Processor will not be consfused by unexpected or non-normative HTML input, including issues arising from quoting, from different syntax rules within `<title>`, `<textarea>`, and `<script>` tags, from the appearance of rare but legitimate comment and XML-like regions, and from a variety of syntax abnormalities such as unbalanced tags, incomplete syntax, and overlapping tags. The Tag Processor is constrained to parsing an HTML document as a stream of tokens. It will not build an HTML tree or generate a DOM representation of a document. It is designed to start at the beginning of an HTML document and linearly scan through it, potentially modifying that document as it scans. It has no access to the markup inside or around tags and it has no ability to determine which tag openers and tag closers belong to each other, or determine the nesting depth of a given tag. It includes a primitive bookmarking system to remember tags it has previously visited. These bookmarks refer to specific tags, not to string offsets, and continue to point to the same place in the document as edits are applied. By asking the Tag Processor to seek to a given bookmark it's possible to back up and continue processsing again content that has already been traversed. Attribute values are sanitized with `esc_attr()` and rendered as double-quoted attributes. On read they are unescaped and unquoted. Authors wishing to rely on the Tag Processor therefore are free to pass around data as normal strings. Convenience methods for adding and removing CSS class names exist in order to remove the need to process the `class` attribute. ```php // Update heading block class names $p = new WP_HTML_Tag_Processor( $html ); while ( $p->next_tag() ) { switch ( $p->get_tag() ) { case 'H1': case 'H2': case 'H3': case 'H4': case 'H5': case 'H6': $p->remove_class( 'wp-heading' ); $p->add_class( 'wp-block-heading' ); break; } return $p->get_updated_html(); ``` The Tag Processor is intended to be a reliable low-level library for traversing HTML documents and higher-level APIs are to be built upon it. Immediately, and in Core Gutenberg blocks it is meant to replace HTML modification that currently relies on RegExp patterns and simpler string replacements. See the following for examples of such replacement: WordPress/gutenberg@1315784 https://github.com/WordPress/gutenberg/pull/45469/files#diff-dcd9e1f9b87ca63efe9f1e834b4d3048778d3eca41aa39c636f8b16a5bb452d2L46 WordPress/gutenberg#46625 Co-Authored-By: Adam Zielinski <[email protected]> Co-Authored-By: Bernie Reiter <[email protected]> Co-Authored-By: Grzegorz Ziolkowski <[email protected]>
1 parent 9356d97 commit 15863a8

8 files changed

+4572
-0
lines changed
Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
<?php
2+
/**
3+
* HTML Tag Processor: Attribute token structure class.
4+
*
5+
* @package WordPress
6+
* @subpackage HTML
7+
* @since 6.2.0
8+
*/
9+
10+
if ( ! class_exists( 'WP_HTML_Attribute_Token' ) ) :
11+
12+
/**
13+
* Data structure for the attribute token that allows to drastically improve performance.
14+
*
15+
* This class is for internal usage of the WP_HTML_Tag_Processor class.
16+
*
17+
* @access private
18+
* @since 6.2.0
19+
*
20+
* @see WP_HTML_Tag_Processor
21+
*/
22+
class WP_HTML_Attribute_Token {
23+
/**
24+
* Attribute name.
25+
*
26+
* @since 6.2.0
27+
* @var string
28+
*/
29+
public $name;
30+
31+
/**
32+
* Attribute value.
33+
*
34+
* @since 6.2.0
35+
* @var int
36+
*/
37+
public $value_starts_at;
38+
39+
/**
40+
* How many bytes the value occupies in the input HTML.
41+
*
42+
* @since 6.2.0
43+
* @var int
44+
*/
45+
public $value_length;
46+
47+
/**
48+
* The string offset where the attribute name starts.
49+
*
50+
* @since 6.2.0
51+
* @var int
52+
*/
53+
public $start;
54+
55+
/**
56+
* The string offset after the attribute value or its name.
57+
*
58+
* @since 6.2.0
59+
* @var int
60+
*/
61+
public $end;
62+
63+
/**
64+
* Whether the attribute is a boolean attribute with value `true`.
65+
*
66+
* @since 6.2.0
67+
* @var bool
68+
*/
69+
public $is_true;
70+
71+
/**
72+
* Constructor.
73+
*
74+
* @since 6.2.0
75+
*
76+
* @param string $name Attribute name.
77+
* @param int $value_start Attribute value.
78+
* @param int $value_length Number of bytes attribute value spans.
79+
* @param int $start The string offset where the attribute name starts.
80+
* @param int $end The string offset after the attribute value or its name.
81+
* @param bool $is_true Whether the attribute is a boolean attribute with true value.
82+
*/
83+
public function __construct( $name, $value_start, $value_length, $start, $end, $is_true ) {
84+
$this->name = $name;
85+
$this->value_starts_at = $value_start;
86+
$this->value_length = $value_length;
87+
$this->start = $start;
88+
$this->end = $end;
89+
$this->is_true = $is_true;
90+
}
91+
}
92+
93+
endif;
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
<?php
2+
/**
3+
* HTML Span: Represents a textual span inside an HTML document.
4+
*
5+
* @package WordPress
6+
* @subpackage HTML
7+
* @since 6.2.0
8+
*/
9+
10+
if ( ! class_exists( 'WP_HTML_Span' ) ) :
11+
12+
/**
13+
* Represents a textual span inside an HTML document.
14+
*
15+
* This is a two-tuple in disguise, used to avoid the memory
16+
* overhead involved in using an array for the same purpose.
17+
*
18+
* This class is for internal usage of the WP_HTML_Tag_Processor class.
19+
*
20+
* @access private
21+
* @since 6.2.0
22+
*
23+
* @see WP_HTML_Tag_Processor
24+
*/
25+
class WP_HTML_Span {
26+
/**
27+
* Byte offset into document where span begins.
28+
*
29+
* @since 6.2.0
30+
* @var int
31+
*/
32+
public $start;
33+
34+
/**
35+
* Byte offset into document where span ends.
36+
*
37+
* @since 6.2.0
38+
* @var int
39+
*/
40+
public $end;
41+
42+
/**
43+
* Constructor.
44+
*
45+
* @since 6.2.0
46+
*
47+
* @param int $start Byte offset into document where replacement span begins.
48+
* @param int $end Byte offset into document where replacement span ends.
49+
*/
50+
public function __construct( $start, $end ) {
51+
$this->start = $start;
52+
$this->end = $end;
53+
}
54+
}
55+
56+
endif;

0 commit comments

Comments
 (0)