Add unicode support to classes and extended unicode escapes by hildjj · Pull Request #602 · peggyjs/peggy

hildjj · 2025-04-16T19:07:28Z

Fixes #462

Still needs work to add #375.

hildjj · 2025-04-16T19:08:11Z

@frostburn please give this a try?

hildjj · 2025-04-16T19:08:34Z

@Mingun this adds opcode 42 (I can change that to another number if you prefer)

Mingun · 2025-04-16T19:25:17Z

You may reuse it. Anyway any future PRs will need full review of my existing code

frostburn · 2025-04-16T19:35:34Z

@frostburn please give this a try?

$ npx peggy sharp.peggy -t '𝄪'
'𝄪'

#462 Works as desired now. 👍

hildjj · 2025-04-17T14:20:37Z

This is now ready for review.

hildjj

Some notes for other reviewers.

hildjj · 2025-04-18T16:16:28Z

CHANGELOG.md

  the type has changed to `PeggySyntaxError`, which may cause some slight need
  for rework in TypeScript-aware projects.  This was the main driver behind
  moving away from ES5. [#593](https://github.com/peggyjs/peggy/pull/593)
+- BREAKING: The grammar parser now uses your JavaScript environment's understanding


I don't expect this to be a problem in practice. The previous code was trying to do ID_Start and ID_Continue, kept to the BMP. All of the things that used to be identifiers should still work, in any JS runtime recent enough to support ES2020.

There are 93778 characters that are out of the BMP that are valid ID_Start or ID_Continue, and they're now valid Peggy identifiers. But note that none of them are Emoji, if you're worried about that.

hildjj · 2025-04-18T16:17:31Z

CHANGELOG.md

  generated for each of these removals.  Like merged class rules above, this
  should only be removing dead code.
  [#594](https://github.com/peggyjs/peggy/pull/594)
+- Character classes now process characters not in the Basic Multi-Lingual


note that using a character not in the BMP automatically puts the character class in Unicode mode.

hildjj · 2025-04-18T16:18:31Z

CHANGELOG.md

+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- Errors pointing to non-BMP characters as the "found" text will now show the
+  full character and not the replacement character for the first surrogate in
+  the UTF-16 representation.


As long as the error doesn't point to the middle of a surrogate pair. Then you still see the replacement character, and that's probably correct.

hildjj · 2025-04-18T16:19:20Z

CHANGELOG.md

+  the UTF-16 representation.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- Character classes can now be annotated with the "u" flag, which will force
+  the character class into Unicode mode, where one full Codepoint will be matched.


Move this to the beginning, then refer to Unicode mode throughout.

hildjj · 2025-04-18T16:20:02Z

CHANGELOG.md

+  UTF-16 code unit (JS character).  Previously, this expression compiled but
+  was useless.
+  [#602](https://github.com/peggyjs/peggy/pull/602)
+- String literals may now contain characters from outside the BMP.


This is mentioned above.

hildjj · 2025-04-18T16:31:15Z

lib/compiler/passes/generate-js.js

      "        );",
      "",
-      "        return \"[\" + (expectation.inverted ? \"^\" : \"\") + escapedParts.join(\"\") + \"]\";",
+      "        return \"[\" + (expectation.inverted ? \"^\" : \"\") + escapedParts.join(\"\") + \"]\" + (expectation.unicode ? \"u\" : \"\");",


This will output expectations like "expected [^a-z]u", which might not be enough for casual users. Watch for confusion in the userbase and find something better if needed.

hildjj · 2025-04-18T16:31:35Z

lib/compiler/passes/generate-js.js

+      "function peg$getUnicode(pos = peg$currPos) {",
+      "  const cp = input.codePointAt(pos);",
+      "  if (cp === undefined) {",
+      "    return \"\";",


This is possible at the end of the string.

hildjj · 2025-04-18T16:31:57Z

lib/compiler/passes/generate-js.js

      "    throw peg$buildStructuredError(",
      "      peg$maxFailExpected,",
-      "      peg$maxFailPos < input.length ? input.charAt(peg$maxFailPos) : null,",
+      "      peg$maxFailPos < input.length ? peg$getUnicode(peg$maxFailPos) : null,",


Always get a full char.

hildjj · 2025-04-18T16:34:19Z

lib/compiler/passes/merge-character-classes.js

-      parts.splice(i--, 1);
-      parts[i] = [prevStart, prevEnd = curEnd];
-      continue;
+    if ((typeof curStart === "string") && (typeof curEnd === "string")


Parts may now be {type: "classEscape"}, which never combine.

hildjj · 2025-04-18T16:45:05Z

lib/compiler/passes/merge-character-classes.js


 /** @type {PEG.compiler.visitor} */
 const visitor = require("../visitor");
+const { codePointLen1 } = require("../utils");


Check this whole file to see if I missed a case.

This allows Peggy to match on full codepoints without sacrificing backward-compatibility.

You may now specify a class like `[\p{ASCII}]`. Fixes peggyjs#375.

…t's definition.

`[^]` is the same as `.` `[^]u` is similar, but matches an entire codepoint.

hildjj mentioned this pull request Apr 17, 2025

Allow use of Unicode property escapes to match a character #375

Closed

hildjj marked this pull request as ready for review April 17, 2025 14:20

hildjj mentioned this pull request Apr 17, 2025

Relatively small Unicode wins #290

Closed

hildjj force-pushed the unicode-classes branch 2 times, most recently from 52c21c9 to 1ce98c2 Compare April 17, 2025 20:41

hildjj commented Apr 18, 2025

View reviewed changes

hildjj mentioned this pull request Apr 18, 2025

Soft mode #502

Merged

hildjj added 7 commits April 22, 2025 17:56

Add unicode support to classes and extended unicode escapes

8e4981a

Add explicitly-unicode classes.

4a736ea

This allows Peggy to match on full codepoints without sacrificing backward-compatibility.

Add support for \p{} Unicode class escapes

49c188f

You may now specify a class like `[\p{ASCII}]`. Fixes peggyjs#375.

BREAKING: move away from copies of Unicode classes to your environmen…

580d2fc

…t's definition.

Add not-nothing.

717f186

`[^]` is the same as `.` `[^]u` is similar, but matches an entire codepoint.

lint

a00dd99

Small refactor and some comments.

df4df94

hildjj force-pushed the unicode-classes branch from 438e187 to df4df94 Compare April 22, 2025 23:57

hildjj merged commit 2434f35 into peggyjs:main Apr 23, 2025
10 checks passed

hildjj deleted the unicode-classes branch April 23, 2025 18:22

hildjj mentioned this pull request Apr 23, 2025

Warn if an always-match clause is not the last alternative #576

Closed

Conversation

hildjj commented Apr 16, 2025

Uh oh!

hildjj commented Apr 16, 2025

Uh oh!

hildjj commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mingun commented Apr 16, 2025

Uh oh!

frostburn commented Apr 16, 2025

Uh oh!

hildjj commented Apr 17, 2025

Uh oh!

hildjj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hildjj commented Apr 16, 2025 •

edited

Loading