Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 164 additions & 15 deletions lib/experimental/html/class-wp-html-tag-processor.php
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
* E.g. match having class `1<"2` needs to recognize `class="1&lt;&quot;2"`.
* @TODO: Decode character references in `get_attribute()`
* @TODO: Properly escape attribute value in `set_attribute()`
* @TODO: Add slow mode to escape character entities in CSS class names?
* (This requires a custom decoder since `html_entity_decode()`
* doesn't handle attribute character reference decoding rules.
*
* @package WordPress
* @subpackage HTML
Expand All @@ -28,6 +31,152 @@
* of patches to that input. Tokenizes HTML but does not fully
* parse the input document.
*
* ## Usage
*
* Use of this class requires three steps:
*
* 1. Create a new class instance with your input HTML document.
* 2. Find the tag(s) you are looking for.
* 3. Request changes to the attributes in those tag(s).
*
* Example:
* ```php
* $tags = new WP_HTML_Tag_Processor( $html );
* if ( $tags->next_tag( [ 'tag_name' => 'option' ] ) ) {
* $tags->set_attribute( 'selected', true );
* }
* ```
*
* ### Finding tags
*
* The `next_tag()` function moves the internal cursor through
* your input HTML document until it finds a tag meeting any of
* the supplied restrictions in the optional query argument. If
* no argument is provided then it will find the next HTML tag,
* regardless of what kind it is.
*
* If you want to _find whatever the next tag is_
* ```php
* $tags->next_tag();
* ```
*
* | Goal | Query |
* |-----------------------------------------------------------|----------------------------------------------------------------------------|
* | Find any tag. | `$tags->next_tag();` |
* | Find next image tag. | `$tags->next_tag( [ 'tag_name' => 'img' ] );` |
* | Find next tag containing the `fullwidth` CSS class. | `$tags->next_tag( [ 'class_name' => 'fullwidth' ] );` |
* | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( [ 'tag_name' => 'img', 'class_name' => 'fullwidth' ] );` |
*
* If a tag was found meeting your criteria then `next_tag()`
* will return `true` and you can proceed to modify it. If it
* returns `false`, however, it failed to find the tag and
* moved the cursor to the end of the file.
*
* Once the cursor reaches the end of the file the processor
* is done and if you want to reach an earlier tag you will
* need to recreate the processor and start over. The internal
* cursor can only proceed forward, never backing up.
*
* #### Custom queries
*
* Sometimes it's necessary to further inspect an HTML tag than
* the query syntax here permits. In these cases one may further
* inspect the search results using the read-only functions
* provided by the processor or external state or variables.
*
* Example:
* ```php
* // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style.
* $remaining_count = 5;
* while ( $remaining_count > 0 && $tags->next_tag() ) {
* if (
* ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) &&
* 'jazzy' === $tags->get_attribute( 'data-style' )
* ) {
* $tags->add_class( 'theme-style-everest-jazz' );
* $remaining_count--;
* }
* }
* ```
*
* `get_attribute()` will return `null` if the attribute wasn't present
* on the tag when it was called. It may return `""` (the empty string)
* in cases where the attribute was present but its value was empty.
* For boolean attributes, those whose name is present but no value is
* given, it will return `true` (the only way to set `false` for an
* attribute is to remove it).
*
* ### Modifying HTML attributes for a found tag
*
* Once you've found the start of an opening tag you can modify
* any number of the attributes on that tag. You can set a new
* value for an attribute, remove the entire attribute, or do
* nothing and move on to the next opening tag.
*
* Example:
* ```php
* if ( $tags->next_tag( [ 'class' => 'wp-group-block' ] ) ) {
* $tags->set_attribute( 'title', 'This groups the contained content.' );
* $tags->remove_attribute( 'data-test-id' );
* }
* ```
*
* If `set_attribute()` is called for an existing attribute it will
* overwrite the existing value. Similarly, calling `remove_attribute()`
* for a non-existing attribute has no effect on the document. Both
* of these methods are safe to call without knowing if a given attribute
* exists beforehand.
*
* ### Modifying CSS classes for a found tag
*
* The tag processor treats the `class` attribute as a special case.
* Because it's a common operation to add or remove CSS classes you
* can do so using this interface.
*
* As with attribute values, adding or removing CSS classes is a safe
* operation that doesn't require checking if the attribute or class
* exists before making changes. If removing the only class then the
Copy link
Contributor

@adamziel adamziel Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* exists before making changes. If removing the only class then the
* exists before making changes. If removing the only class name, the

* entire `class` attribute will be removed.
*
* Example:
* ```php
* // from `<span>Yippee!</span>`
* // to `<span class="is-active">Yippee!</span>`
* $tags->add_class( 'is-active' );
*
* // from `<span class="excited">Yippee!</span>`
* // to `<span class="excited is-active">Yippee!</span>`
* $tags->add_class( 'is-active' );
*
* // from `<span class="is-active heavy-accent">Yippee!</span>`
* // to `<span class="is-active heavy-accent">Yippee!</span>`
* $tags->add_class( 'is-active' );
*
* // from `<input type="text" class="is-active rugby not-disabled" length="24">`
* // to `<input type="text" class="is-active not-disabled" length="24">
* $tags->remove_class( 'rugby' );
*
* // from `<input type="text" class="rugby" length="24">`
* // to `<input type="text" length="24">
* $tags->remove_class( 'rugby' );
*
* // from `<input type="text" length="24">`
* // to `<input type="text" length="24">
* $tags->remove_class( 'rugby' );
* ```
*
* ## Design limitations
*
* @TODO: Expand this section
*
* - no nesting: cannot match open and close tag
* - only move forward, never backward
* - class names not decoded if they contain character references
* - only secures against HTML escaping issues; requires
* manually sanitizing or escaping values based on the needs of
* each individual attribute, since different attributes have
* different needs.
*
* @since 6.2.0
*/
class WP_HTML_Tag_Processor {
Expand Down Expand Up @@ -136,16 +285,16 @@ class WP_HTML_Tag_Processor {
* // and stops after recognizing the `id` attribute
* // <div id="test-4" class=outline title="data:text/plain;base64=asdk3nk1j3fo8">
* // ^ parsing will continue from this point
* $this->attributes = array(
* $this->attributes = [
* 'id' => new WP_HTML_Attribute_Match( 'id', null, 6, 17 )
* );
* ];
*
* // when picking up parsing again, or when asking to find the
* // `class` attribute we will continue and add to this array
* $this->attributes = array(
* 'id' => new WP_HTML_Attribute_Match( 'id', null, 6, 17 ),
* $this->attributes = [
* 'id' => new WP_HTML_Attribute_Match( 'id', null, 6, 17 ),
* 'class' => new WP_HTML_Attribute_Match( 'class', 'outline', 18, 32 )
* );
* ];
*
* // Note that only the `class` attribute value is stored in the index.
* // That's because it is the only value used by this class at the moment.
Expand All @@ -170,11 +319,11 @@ class WP_HTML_Tag_Processor {
* Example:
* <code>
* // Add the `WP-block-group` class, remove the `WP-group` class.
* $class_changes = array(
* $class_changes = [
* // Indexed by a comparable class name
* 'wp-block-group' => new WP_Class_Name_Operation( 'WP-block-group', WP_Class_Name_Operation::ADD ),
* 'wp-group' => new WP_Class_Name_Operation( 'WP-group', WP_Class_Name_Operation::REMOVE )
* );
* ];
* </code>
*
* @since 6.2.0
Expand Down Expand Up @@ -206,9 +355,9 @@ class WP_HTML_Tag_Processor {
*
* // Correspondingly, something like this
* // will appear in the replacements array.
* $replacements = array(
* $replacements = [
* WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' )
* );
* ];
* </code>
*
* @since 6.2.0
Expand Down Expand Up @@ -270,9 +419,9 @@ public function next_tag( $query = null ) {
if ( 's' === $t || 'S' === $t || 't' === $t || 'T' === $t ) {
$tag_name = $this->get_tag();

if ( 'script' === $tag_name ) {
if ( 'SCRIPT' === $tag_name ) {
$this->skip_script_data();
} elseif ( 'textarea' === $tag_name || 'title' === $tag_name ) {
} elseif ( 'TEXTAREA' === $tag_name || 'TITLE' === $tag_name ) {
$this->skip_rcdata( $tag_name );
}
}
Expand Down Expand Up @@ -318,7 +467,7 @@ private function skip_rcdata( $tag_name ) {
$tag_char = $tag_name[ $i ];
$html_char = $html[ $at + $i ];

if ( $html_char !== $tag_char && strtolower( $html_char ) !== $tag_char ) {
if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) {
$at += $i;
continue 2;
}
Expand Down Expand Up @@ -937,7 +1086,7 @@ public function get_tag() {

$tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );

return strtolower( $tag_name );
return strtoupper( $tag_name );
}

/**
Expand Down Expand Up @@ -1189,7 +1338,7 @@ private function matches() {

/*
* Otherwise we have to check for each character if they
* are the same, and only `strtolower()` if we have to.
* are the same, and only `strtoupper()` if we have to.
* Presuming that most people will supply lowercase tag
* names and most HTML will contain lowercase tag names,
* most of the time this runs we shouldn't expect to
Expand All @@ -1199,7 +1348,7 @@ private function matches() {
$html_char = $this->html[ $this->tag_name_starts_at + $i ];
$tag_char = $this->sought_tag_name[ $i ];

if ( $html_char !== $tag_char && strtolower( $html_char ) !== $tag_char ) {
if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) {
return false;
}
}
Expand Down
22 changes: 11 additions & 11 deletions phpunit/html/wp-html-tag-processor-standalone-test.php
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ public function test_get_tag_returns_null_when_not_in_open_tag() {
public function test_get_tag_returns_open_tag_name() {
$p = new WP_HTML_Tag_Processor( '<div>Test</div>' );
$this->assertTrue( $p->next_tag( 'div' ), 'Querying an existing tag did not return true' );
$this->assertSame( 'div', $p->get_tag(), 'Accessing an existing tag name did not return "div"' );
$this->assertSame( 'DIV', $p->get_tag(), 'Accessing an existing tag name did not return "div"' );
}

/**
Expand Down Expand Up @@ -841,7 +841,7 @@ public function test_setting_a_boolean_attribute_to_a_string_value_adds_explicit
public function test_unclosed_script_tag_should_not_cause_an_infinite_loop() {
$p = new WP_HTML_Tag_Processor( '<script>' );
$p->next_tag();
$this->assertSame( 'script', $p->get_tag() );
$this->assertSame( 'SCRIPT', $p->get_tag() );
$p->next_tag();
}

Expand All @@ -855,9 +855,9 @@ public function test_unclosed_script_tag_should_not_cause_an_infinite_loop() {
public function test_next_tag_ignores_the_contents_of_a_script_tag( $script_then_div ) {
$p = new WP_HTML_Tag_Processor( $script_then_div );
$p->next_tag();
$this->assertSame( 'script', $p->get_tag(), 'The first found tag was not "script"' );
$this->assertSame( 'SCRIPT', $p->get_tag(), 'The first found tag was not "script"' );
$p->next_tag();
$this->assertSame( 'div', $p->get_tag(), 'The second found tag was not "∂iv"' );
$this->assertSame( 'DIV', $p->get_tag(), 'The second found tag was not "div"' );
}

/**
Expand Down Expand Up @@ -934,7 +934,7 @@ public function test_next_tag_ignores_the_contents_of_a_rcdata_tag( $rcdata_then
$p->next_tag();
$this->assertSame( $rcdata_tag, $p->get_tag(), "The first found tag was not '$rcdata_tag'" );
$p->next_tag();
$this->assertSame( 'div', $p->get_tag(), "The second found tag was not 'div'" );
$this->assertSame( 'DIV', $p->get_tag(), "The second found tag was not 'div'" );
}

/**
Expand All @@ -951,32 +951,32 @@ public function data_rcdata_state() {
$examples = array();
$examples['Simple textarea'] = array(
'<textarea><span class="d-none d-md-inline">Back to notifications</span></textarea><div></div>',
'textarea',
'TEXTAREA',
);

$examples['Simple title'] = array(
'<title><span class="d-none d-md-inline">Back to notifications</title</span></title><div></div>',
'title',
'TITLE',
);

$examples['Comment opener inside a textarea tag should be ignored'] = array(
'<textarea class="d-md-none"><!--</textarea><div></div>-->',
'textarea',
'TEXTAREA',
);

$examples['Textarea closer with another textarea tag in closer attributes'] = array(
'<textarea><span class="d-none d-md-inline">Back to notifications</title</span></textarea <textarea><div></div>',
'textarea',
'TEXTAREA',
);

$examples['Textarea closer with attributes'] = array(
'<textarea class="d-md-none"><span class="d-none d-md-inline">Back to notifications</span></textarea id="test"><div></div>',
'textarea',
'TEXTAREA',
);

$examples['Textarea opener with title closer inside'] = array(
'<textarea class="d-md-none"></title></textarea><div></div>',
'textarea',
'TEXTAREA',
);
return $examples;
}
Expand Down