peggyjs · hildjj · Apr 23, 2025 · Apr 11, 2025 · Apr 17, 2025 · Apr 17, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -26,6 +26,13 @@ Released: TBD (Not before 2025-05-01)
   the type has changed to `PeggySyntaxError`, which may cause some slight need
   for rework in TypeScript-aware projects.  This was the main driver behind
   moving away from ES5. [#593](https://github.com/peggyjs/peggy/pull/593)
+- BREAKING: The grammar parser now uses your JavaScript environment's understanding
+  of Unicode classes, rather than a partial copy of Unicode 8 as before.  This
+  should be more correct and evolve over time while staying backward-compatible
+  to the extent that the Unicode Consortium keeps to its goals.  Because this
+  might slightly affect what rule names are valid, we are marking this as a
+  breaking change just in case.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
 
 ### New features
 - Extend library mode to include a success flag and a function for throwing syntax errors when needed.
@@ -47,6 +54,39 @@ Released: TBD (Not before 2025-05-01)
   generated for each of these removals.  Like merged class rules above, this
   should only be removing dead code.
   [#594](https://github.com/peggyjs/peggy/pull/594)
+- Character classes now process characters not in the Basic Multi-Lingual
+  Plane (BMP) correctly.  This feature requires a JavaScript environment
+  that supports the `u` flag to regular expressions.  The `u` flag will only
+  be used on character classes that make use of this new feature.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- Unicode characters may now be specified with the `\u{hex}` syntax, allowing
+  easier inclusion of characters not in the BMP (such as newer emoji).  This
+  syntax works both in string literals and in character classes.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- Errors pointing to non-BMP characters as the "found" text will now show the
+  full character and not the replacement character for the first surrogate in
+  the UTF-16 representation.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- Character classes can now be annotated with the "u" flag, which will force
+  the character class into Unicode mode, where one full Codepoint will be matched.
+  For example, `[^a]u` will match 💪 (U+1F4AA).  Without the "u" flag, `[^a]`
+  would only match \uD83D, the first surrogate that makes up U+1F4AA in UTF-16
+  encoding.  [#602](https://github.com/peggyjs/peggy/pull/602)
+- Empty inverted character classes such as `[^]u` now match one character,
+  because they match "not-nothing". Without the "u" flag, this is the same as
+  `.`.  With the "u" flag, this matches an entire codepoint, not just a single
+  UTF-16 code unit (JS character).  Previously, this expression compiled but
+  was useless.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- String literals may now contain characters from outside the BMP.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- Character classes may now contain `\p{}` or `\P{}` escapes to match or
+  inverted-match Unicode properties.  See
+  [MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape)
+  for more details.  If you are generating code for a non-JavaScript environment
+  using a plugin, this may be somewhat challenging for the plugin author.
+  Please file an issue on Peggy for help.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
 
 ### Bug fixes
 

diff --git a/docs/css/documentation.css b/docs/css/documentation.css
@@ -17,3 +17,6 @@
 .error {
   color: red;
 }
+.dim {
+  color: #999999;
+}
diff --git a/docs/documentation.html b/docs/documentation.html
@@ -857,7 +857,7 @@ <h3 id="grammar-syntax-and-semantics-parsing-expression-types">Parsing Expressio
   <dt><code>.</code> (U+002E: FULL STOP, or "period")</dt>
 
   <dd>
-    <p>Match exactly one character and return it as a string.</p>
+    <p>Match exactly one JavaScript character (UTF-16 code unit) and return it as a string.</p>
     <div class="example">
       <div>
         <div><em>Example:</em> <code>any = .</code></div>
@@ -906,16 +906,29 @@ <h3 id="grammar-syntax-and-semantics-parsing-expression-types">Parsing Expressio
     </div>
   </dd>
 
-  <dt><code>[<em>characters</em>]</code></dt>
+  <dt><code>[<span class="dim">^</span><em>characters</em>]<span class="dim">iu</span></code></dt>
 
   <dd>
-    <p>Match one character from a set and return it as a string. The characters
-      in the list can be escaped in exactly the same way as in JavaScript string.
+    <p>Match one character from a character class and return it as a string. The characters
+      in the list can be escaped in exactly the same way as in JavaScript string, using
+      <code>\uXXXX</code> or <code>\u{XXXX}</code>.
       The list of characters can also contain ranges (e.g. <code>[a-z]</code>
       means “all lowercase letters”). Preceding the characters with <code>^</code>
-      inverts the matched set (e.g. <code>[^a-z]</code> means “all character but
-      lowercase letters”). Appending <code>i</code> right after the class makes
-      the match case-insensitive.</p>
+      inverts the matched set (e.g. <code>[^a-z]</code> means “all characters except
+      lowercase letters”). Appending <code>i</code> after the class makes
+      the match case-insensitive.  Appending <code>u</code> after the class forces
+      the class into Unicode mode, where an entire codepoint will be matched, even
+      if it takes up two JavaScript characters in a UTF-16 surrogate pair.  If
+      any of the characters in the class are outside the range 0x0-0xFFFF (the
+      Basic Multilingual Plane: BMP), the class is automatically forced into
+      Unicode mode even if the "u" flag is not specified.  Note: the Unicode mode
+      generates a JavaScript regular expression with the "u" flag set.</p>
+    <p>The list of characters may also contain the special escape sequences
+      <code>\p{}</code> or <code>\P{}</code></code>.  These escape sequences
+      are used to match Unicode properties.  See
+      <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape">MDN: Unicode character class escape</a>
+      for more information.  When one or more of these escapes are included,
+      the class is automatically put in Unicode mode.</p>
 
     <div class="example">
       <div>
@@ -942,6 +955,54 @@ <h3 id="grammar-syntax-and-semantics-parsing-expression-types">Parsing Expressio
         <div class="result"></div>
       </div>
     </div>
+
+    <div class="example">
+      <div>
+        <div><em>Example:</em> <code>not_class_u = [^a-z]u</code></div>
+        <div><em>Matches:</em> <code>"🦥"</code></div>
+        <div><em>Does not match:</em> <code>"f"</code>, <code>""</code></div>
+      </div>
+      <div class="try">
+        <em>Try it:</em>
+        <input type="text" value="🦥" class="exampleInput" name="not_class_u">
+        <div class="result"></div>
+      </div>
+    </div>
+
+    <div class="example">
+      <div>
+        <div><em>Example:</em> <code>class_p = [\p{ASCII}]</code></div>
+        <div><em>Matches:</em> <code>"a"</code>, <code>"_"</code></div>
+        <div><em>Does not match:</em> <code>ø</code>, <code>"🦥"</code></div>
+      </div>
+      <div class="try">
+        <em>Try it:</em>
+        <input type="text" value="a" class="exampleInput" name="class_p">
+        <div class="result"></div>
+      </div>
+    </div>
+
+    <div class="example">
+      <div>
+        <div><em>Example:</em> <code>class_P = [\P{ASCII}]</code></div>
+        <div><em>Matches:</em> <code>ø</code>, <code>"🦥"</code></div>
+        <div><em>Does not match:</em> <code>"a"</code>, <code>"_"</code></div>
+      </div>
+      <div class="try">
+        <em>Try it:</em>
+        <input type="text" value="ø" class="exampleInput" name="class_P">
+        <div class="result"></div>
+      </div>
+    </div>
+  </dd>
+
+  <dt><code>[^]<span class="dim">u</span></code> (not-nothing)</dt>
+  <dd>
+    <p>This is a special case of a character class, which is defined to equal
+      one character.  If the "u" flag is not specified, this is the same as the
+      <code>.</code> expression.  If the "u" flag is specified, this matches a
+      whole Unicode codepoint, which may be one or two JavaScript characters
+      (UTF-16 code units).</p>
   </dd>
 
   <dt><code><em>rule</em></code></dt>
@@ -1718,17 +1779,12 @@ <h2 id="locations">Locations</h2>
 <p>All of the notes about values for <code>location()</code> object are also
   applicable to the <code>range()</code> and <code>offset()</code> calls.</p>
 
-<p>Currently, Peggy grammars may only contain codepoints from the
+<p>Peggy grammars work one UTF-16 code unit at a time, except for string
+  literals containing characters from outside the
   <a href="https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane">Basic
-  Multilingual Plane (BMP)</a> of Unicode.
-  This means that all offsets are measured in UTF-16 code units. If you
-  include characters outside this Plane (for example, emoji, or any
-  surrogate pairs), you may get an offset inside a code point.</p>
-
-<p>Changing this behavior might be a breaking change, so it will likely cause
-  a major version number increase if it happens. You can join to the discussion
-  for this topic on the <a href="https://github.com/peggyjs/peggy/discussions/15">GitHub Discussions
-    page</a>.</p>
+  Multilingual Plane (BMP)</a> of Unicode or character classes in Unicode mode.
+  All offsets are measured in UTF-16 code units (JavaScript characters).  It
+  is possible to get an offset in the middle of a UTF-16 surrogate pair.</p>
 
 <h2 id="plugins-api">Plugins API</h2>