Add unicode support to classes and extended unicode escapes#602
Add unicode support to classes and extended unicode escapes#602hildjj merged 7 commits intopeggyjs:mainfrom
Conversation
|
@frostburn please give this a try? |
|
@Mingun this adds opcode 42 (I can change that to another number if you prefer) |
|
You may reuse it. Anyway any future PRs will need full review of my existing code |
$ npx peggy sharp.peggy -t '𝄪'
'𝄪'#462 Works as desired now. 👍 |
|
This is now ready for review. |
52c21c9 to
1ce98c2
Compare
hildjj
left a comment
There was a problem hiding this comment.
Some notes for other reviewers.
| the type has changed to `PeggySyntaxError`, which may cause some slight need | ||
| for rework in TypeScript-aware projects. This was the main driver behind | ||
| moving away from ES5. [#593](https://github.com/peggyjs/peggy/pull/593) | ||
| - BREAKING: The grammar parser now uses your JavaScript environment's understanding |
There was a problem hiding this comment.
I don't expect this to be a problem in practice. The previous code was trying to do ID_Start and ID_Continue, kept to the BMP. All of the things that used to be identifiers should still work, in any JS runtime recent enough to support ES2020.
There are 93778 characters that are out of the BMP that are valid ID_Start or ID_Continue, and they're now valid Peggy identifiers. But note that none of them are Emoji, if you're worried about that.
| generated for each of these removals. Like merged class rules above, this | ||
| should only be removing dead code. | ||
| [#594](https://github.com/peggyjs/peggy/pull/594) | ||
| - Character classes now process characters not in the Basic Multi-Lingual |
There was a problem hiding this comment.
note that using a character not in the BMP automatically puts the character class in Unicode mode.
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - Errors pointing to non-BMP characters as the "found" text will now show the | ||
| full character and not the replacement character for the first surrogate in | ||
| the UTF-16 representation. |
There was a problem hiding this comment.
As long as the error doesn't point to the middle of a surrogate pair. Then you still see the replacement character, and that's probably correct.
| the UTF-16 representation. | ||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - Character classes can now be annotated with the "u" flag, which will force | ||
| the character class into Unicode mode, where one full Codepoint will be matched. |
There was a problem hiding this comment.
Move this to the beginning, then refer to Unicode mode throughout.
| UTF-16 code unit (JS character). Previously, this expression compiled but | ||
| was useless. | ||
| [#602](https://github.com/peggyjs/peggy/pull/602) | ||
| - String literals may now contain characters from outside the BMP. |
There was a problem hiding this comment.
This is mentioned above.
| " );", | ||
| "", | ||
| " return \"[\" + (expectation.inverted ? \"^\" : \"\") + escapedParts.join(\"\") + \"]\";", | ||
| " return \"[\" + (expectation.inverted ? \"^\" : \"\") + escapedParts.join(\"\") + \"]\" + (expectation.unicode ? \"u\" : \"\");", |
There was a problem hiding this comment.
This will output expectations like "expected [^a-z]u", which might not be enough for casual users. Watch for confusion in the userbase and find something better if needed.
lib/compiler/passes/generate-js.js
Outdated
| "function peg$getUnicode(pos = peg$currPos) {", | ||
| " const cp = input.codePointAt(pos);", | ||
| " if (cp === undefined) {", | ||
| " return \"\";", |
There was a problem hiding this comment.
This is possible at the end of the string.
| " throw peg$buildStructuredError(", | ||
| " peg$maxFailExpected,", | ||
| " peg$maxFailPos < input.length ? input.charAt(peg$maxFailPos) : null,", | ||
| " peg$maxFailPos < input.length ? peg$getUnicode(peg$maxFailPos) : null,", |
There was a problem hiding this comment.
Always get a full char.
| parts.splice(i--, 1); | ||
| parts[i] = [prevStart, prevEnd = curEnd]; | ||
| continue; | ||
| if ((typeof curStart === "string") && (typeof curEnd === "string") |
There was a problem hiding this comment.
Parts may now be {type: "classEscape"}, which never combine.
|
|
||
| /** @type {PEG.compiler.visitor} */ | ||
| const visitor = require("../visitor"); | ||
| const { codePointLen1 } = require("../utils"); |
There was a problem hiding this comment.
Check this whole file to see if I missed a case.
This allows Peggy to match on full codepoints without sacrificing backward-compatibility.
You may now specify a class like `[\p{ASCII}]`.
Fixes peggyjs#375.
`[^]` is the same as `.` `[^]u` is similar, but matches an entire codepoint.
Fixes #462
Still needs work to add #375.