Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,13 @@ Released: TBD (Not before 2025-05-01)
the type has changed to `PeggySyntaxError`, which may cause some slight need
for rework in TypeScript-aware projects. This was the main driver behind
moving away from ES5. [#593](https://github.com/peggyjs/peggy/pull/593)
- BREAKING: The grammar parser now uses your JavaScript environment's understanding
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this to be a problem in practice. The previous code was trying to do ID_Start and ID_Continue, kept to the BMP. All of the things that used to be identifiers should still work, in any JS runtime recent enough to support ES2020.

There are 93778 characters that are out of the BMP that are valid ID_Start or ID_Continue, and they're now valid Peggy identifiers. But note that none of them are Emoji, if you're worried about that.

of Unicode classes, rather than a partial copy of Unicode 8 as before. This
should be more correct and evolve over time while staying backward-compatible
to the extent that the Unicode Consortium keeps to its goals. Because this
might slightly affect what rule names are valid, we are marking this as a
breaking change just in case.
[#602](https://github.com/peggyjs/peggy/pull/602)

### New features
- Extend library mode to include a success flag and a function for throwing syntax errors when needed.
Expand All @@ -47,6 +54,39 @@ Released: TBD (Not before 2025-05-01)
generated for each of these removals. Like merged class rules above, this
should only be removing dead code.
[#594](https://github.com/peggyjs/peggy/pull/594)
- Character classes now process characters not in the Basic Multi-Lingual
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that using a character not in the BMP automatically puts the character class in Unicode mode.

Plane (BMP) correctly. This feature requires a JavaScript environment
that supports the `u` flag to regular expressions. The `u` flag will only
be used on character classes that make use of this new feature.
[#602](https://github.com/peggyjs/peggy/pull/602)
- Unicode characters may now be specified with the `\u{hex}` syntax, allowing
easier inclusion of characters not in the BMP (such as newer emoji). This
syntax works both in string literals and in character classes.
[#602](https://github.com/peggyjs/peggy/pull/602)
- Errors pointing to non-BMP characters as the "found" text will now show the
full character and not the replacement character for the first surrogate in
the UTF-16 representation.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the error doesn't point to the middle of a surrogate pair. Then you still see the replacement character, and that's probably correct.

[#602](https://github.com/peggyjs/peggy/pull/602)
- Character classes can now be annotated with the "u" flag, which will force
the character class into Unicode mode, where one full Codepoint will be matched.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the beginning, then refer to Unicode mode throughout.

For example, `[^a]u` will match 💪 (U+1F4AA). Without the "u" flag, `[^a]`
would only match \uD83D, the first surrogate that makes up U+1F4AA in UTF-16
encoding. [#602](https://github.com/peggyjs/peggy/pull/602)
- Empty inverted character classes such as `[^]u` now match one character,
because they match "not-nothing". Without the "u" flag, this is the same as
`.`. With the "u" flag, this matches an entire codepoint, not just a single
UTF-16 code unit (JS character). Previously, this expression compiled but
was useless.
[#602](https://github.com/peggyjs/peggy/pull/602)
- String literals may now contain characters from outside the BMP.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mentioned above.

[#602](https://github.com/peggyjs/peggy/pull/602)
- Character classes may now contain `\p{}` or `\P{}` escapes to match or
inverted-match Unicode properties. See
[MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape)
for more details. If you are generating code for a non-JavaScript environment
using a plugin, this may be somewhat challenging for the plugin author.
Please file an issue on Peggy for help.
[#602](https://github.com/peggyjs/peggy/pull/602)

### Bug fixes

Expand Down
3 changes: 3 additions & 0 deletions docs/css/documentation.css
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,6 @@
.error {
color: red;
}
.dim {
color: #999999;
}
90 changes: 73 additions & 17 deletions docs/documentation.html
Original file line number Diff line number Diff line change
Expand Up @@ -857,7 +857,7 @@ <h3 id="grammar-syntax-and-semantics-parsing-expression-types">Parsing Expressio
<dt><code>.</code> (U+002E: FULL STOP, or "period")</dt>

<dd>
<p>Match exactly one character and return it as a string.</p>
<p>Match exactly one JavaScript character (UTF-16 code unit) and return it as a string.</p>
<div class="example">
<div>
<div><em>Example:</em> <code>any = .</code></div>
Expand Down Expand Up @@ -906,16 +906,29 @@ <h3 id="grammar-syntax-and-semantics-parsing-expression-types">Parsing Expressio
</div>
</dd>

<dt><code>[<em>characters</em>]</code></dt>
<dt><code>[<span class="dim">^</span><em>characters</em>]<span class="dim">iu</span></code></dt>

<dd>
<p>Match one character from a set and return it as a string. The characters
in the list can be escaped in exactly the same way as in JavaScript string.
<p>Match one character from a character class and return it as a string. The characters
in the list can be escaped in exactly the same way as in JavaScript string, using
<code>\uXXXX</code> or <code>\u{XXXX}</code>.
The list of characters can also contain ranges (e.g. <code>[a-z]</code>
means “all lowercase letters”). Preceding the characters with <code>^</code>
inverts the matched set (e.g. <code>[^a-z]</code> means “all character but
lowercase letters”). Appending <code>i</code> right after the class makes
the match case-insensitive.</p>
inverts the matched set (e.g. <code>[^a-z]</code> means “all characters except
lowercase letters”). Appending <code>i</code> after the class makes
the match case-insensitive. Appending <code>u</code> after the class forces
the class into Unicode mode, where an entire codepoint will be matched, even
if it takes up two JavaScript characters in a UTF-16 surrogate pair. If
any of the characters in the class are outside the range 0x0-0xFFFF (the
Basic Multilingual Plane: BMP), the class is automatically forced into
Unicode mode even if the "u" flag is not specified. Note: the Unicode mode
generates a JavaScript regular expression with the "u" flag set.</p>
<p>The list of characters may also contain the special escape sequences
<code>\p{}</code> or <code>\P{}</code></code>. These escape sequences
are used to match Unicode properties. See
<a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape">MDN: Unicode character class escape</a>
for more information. When one or more of these escapes are included,
the class is automatically put in Unicode mode.</p>

<div class="example">
<div>
Expand All @@ -942,6 +955,54 @@ <h3 id="grammar-syntax-and-semantics-parsing-expression-types">Parsing Expressio
<div class="result"></div>
</div>
</div>

<div class="example">
<div>
<div><em>Example:</em> <code>not_class_u = [^a-z]u</code></div>
<div><em>Matches:</em> <code>"🦥"</code></div>
<div><em>Does not match:</em> <code>"f"</code>, <code>""</code></div>
</div>
<div class="try">
<em>Try it:</em>
<input type="text" value="🦥" class="exampleInput" name="not_class_u">
<div class="result"></div>
</div>
</div>

<div class="example">
<div>
<div><em>Example:</em> <code>class_p = [\p{ASCII}]</code></div>
<div><em>Matches:</em> <code>"a"</code>, <code>"_"</code></div>
<div><em>Does not match:</em> <code>ø</code>, <code>"🦥"</code></div>
</div>
<div class="try">
<em>Try it:</em>
<input type="text" value="a" class="exampleInput" name="class_p">
<div class="result"></div>
</div>
</div>

<div class="example">
<div>
<div><em>Example:</em> <code>class_P = [\P{ASCII}]</code></div>
<div><em>Matches:</em> <code>ø</code>, <code>"🦥"</code></div>
<div><em>Does not match:</em> <code>"a"</code>, <code>"_"</code></div>
</div>
<div class="try">
<em>Try it:</em>
<input type="text" value="ø" class="exampleInput" name="class_P">
<div class="result"></div>
</div>
</div>
</dd>

<dt><code>[^]<span class="dim">u</span></code> (not-nothing)</dt>
<dd>
<p>This is a special case of a character class, which is defined to equal
one character. If the "u" flag is not specified, this is the same as the
<code>.</code> expression. If the "u" flag is specified, this matches a
whole Unicode codepoint, which may be one or two JavaScript characters
(UTF-16 code units).</p>
</dd>

<dt><code><em>rule</em></code></dt>
Expand Down Expand Up @@ -1718,17 +1779,12 @@ <h2 id="locations">Locations</h2>
<p>All of the notes about values for <code>location()</code> object are also
applicable to the <code>range()</code> and <code>offset()</code> calls.</p>

<p>Currently, Peggy grammars may only contain codepoints from the
<p>Peggy grammars work one UTF-16 code unit at a time, except for string
literals containing characters from outside the
<a href="https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane">Basic
Multilingual Plane (BMP)</a> of Unicode.
This means that all offsets are measured in UTF-16 code units. If you
include characters outside this Plane (for example, emoji, or any
surrogate pairs), you may get an offset inside a code point.</p>

<p>Changing this behavior might be a breaking change, so it will likely cause
a major version number increase if it happens. You can join to the discussion
for this topic on the <a href="https://github.com/peggyjs/peggy/discussions/15">GitHub Discussions
page</a>.</p>
Multilingual Plane (BMP)</a> of Unicode or character classes in Unicode mode.
All offsets are measured in UTF-16 code units (JavaScript characters). It
is possible to get an offset in the middle of a UTF-16 surrogate pair.</p>

<h2 id="plugins-api">Plugins API</h2>

Expand Down
Loading