-
Notifications
You must be signed in to change notification settings - Fork 75
Add unicode support to classes and extended unicode escapes #602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8e4981a
4a736ea
49c188f
580d2fc
717f186
a00dd99
df4df94
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,6 +26,13 @@ Released: TBD (Not before 2025-05-01) | |
| the type has changed to `PeggySyntaxError`, which may cause some slight need | ||
| for rework in TypeScript-aware projects. This was the main driver behind | ||
| moving away from ES5. [#593](https://github.com/peggyjs/peggy/pull/593) | ||
| - BREAKING: The grammar parser now uses your JavaScript environment's understanding | ||
| of Unicode classes, rather than a partial copy of Unicode 8 as before. This | ||
| should be more correct and evolve over time while staying backward-compatible | ||
| to the extent that the Unicode Consortium keeps to its goals. Because this | ||
| might slightly affect what rule names are valid, we are marking this as a | ||
| breaking change just in case. | ||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
|
|
||
| ### New features | ||
| - Extend library mode to include a success flag and a function for throwing syntax errors when needed. | ||
|
|
@@ -47,6 +54,39 @@ Released: TBD (Not before 2025-05-01) | |
| generated for each of these removals. Like merged class rules above, this | ||
| should only be removing dead code. | ||
| [#594](https://github.com/peggyjs/peggy/pull/594) | ||
| - Character classes now process characters not in the Basic Multi-Lingual | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note that using a character not in the BMP automatically puts the character class in Unicode mode. |
||
| Plane (BMP) correctly. This feature requires a JavaScript environment | ||
| that supports the `u` flag to regular expressions. The `u` flag will only | ||
| be used on character classes that make use of this new feature. | ||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - Unicode characters may now be specified with the `\u{hex}` syntax, allowing | ||
| easier inclusion of characters not in the BMP (such as newer emoji). This | ||
| syntax works both in string literals and in character classes. | ||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - Errors pointing to non-BMP characters as the "found" text will now show the | ||
| full character and not the replacement character for the first surrogate in | ||
| the UTF-16 representation. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As long as the error doesn't point to the middle of a surrogate pair. Then you still see the replacement character, and that's probably correct. |
||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - Character classes can now be annotated with the "u" flag, which will force | ||
| the character class into Unicode mode, where one full Codepoint will be matched. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move this to the beginning, then refer to Unicode mode throughout. |
||
| For example, `[^a]u` will match 💪 (U+1F4AA). Without the "u" flag, `[^a]` | ||
| would only match \uD83D, the first surrogate that makes up U+1F4AA in UTF-16 | ||
| encoding. [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - Empty inverted character classes such as `[^]u` now match one character, | ||
| because they match "not-nothing". Without the "u" flag, this is the same as | ||
| `.`. With the "u" flag, this matches an entire codepoint, not just a single | ||
| UTF-16 code unit (JS character). Previously, this expression compiled but | ||
| was useless. | ||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - String literals may now contain characters from outside the BMP. | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is mentioned above. |
||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - Character classes may now contain `\p{}` or `\P{}` escapes to match or | ||
| inverted-match Unicode properties. See | ||
| [MDN](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Unicode_character_class_escape) | ||
| for more details. If you are generating code for a non-JavaScript environment | ||
| using a plugin, this may be somewhat challenging for the plugin author. | ||
| Please file an issue on Peggy for help. | ||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
|
|
||
| ### Bug fixes | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,3 +17,6 @@ | |
| .error { | ||
| color: red; | ||
| } | ||
| .dim { | ||
| color: #999999; | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this to be a problem in practice. The previous code was trying to do ID_Start and ID_Continue, kept to the BMP. All of the things that used to be identifiers should still work, in any JS runtime recent enough to support ES2020.
There are 93778 characters that are out of the BMP that are valid ID_Start or ID_Continue, and they're now valid Peggy identifiers. But note that none of them are Emoji, if you're worried about that.