Skip to content

Add unicode support to classes and extended unicode escapes#602

Merged
hildjj merged 7 commits intopeggyjs:mainfrom
hildjj:unicode-classes
Apr 23, 2025
Merged

Add unicode support to classes and extended unicode escapes#602
hildjj merged 7 commits intopeggyjs:mainfrom
hildjj:unicode-classes

Conversation

@hildjj
Copy link
Copy Markdown
Contributor

@hildjj hildjj commented Apr 16, 2025

Fixes #462

Still needs work to add #375.

@hildjj
Copy link
Copy Markdown
Contributor Author

hildjj commented Apr 16, 2025

@frostburn please give this a try?

@hildjj
Copy link
Copy Markdown
Contributor Author

hildjj commented Apr 16, 2025

@Mingun this adds opcode 42 (I can change that to another number if you prefer)

@Mingun
Copy link
Copy Markdown
Member

Mingun commented Apr 16, 2025

You may reuse it. Anyway any future PRs will need full review of my existing code

@frostburn
Copy link
Copy Markdown
Contributor

@frostburn please give this a try?

$ npx peggy sharp.peggy -t '𝄪'
'𝄪'

#462 Works as desired now. 👍

@hildjj
Copy link
Copy Markdown
Contributor Author

hildjj commented Apr 17, 2025

This is now ready for review.

@hildjj hildjj marked this pull request as ready for review April 17, 2025 14:20
@hildjj hildjj force-pushed the unicode-classes branch 2 times, most recently from 52c21c9 to 1ce98c2 Compare April 17, 2025 20:41
Copy link
Copy Markdown
Contributor Author

@hildjj hildjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes for other reviewers.

the type has changed to `PeggySyntaxError`, which may cause some slight need
for rework in TypeScript-aware projects. This was the main driver behind
moving away from ES5. [#593](https://github.com/peggyjs/peggy/pull/593)
- BREAKING: The grammar parser now uses your JavaScript environment's understanding
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this to be a problem in practice. The previous code was trying to do ID_Start and ID_Continue, kept to the BMP. All of the things that used to be identifiers should still work, in any JS runtime recent enough to support ES2020.

There are 93778 characters that are out of the BMP that are valid ID_Start or ID_Continue, and they're now valid Peggy identifiers. But note that none of them are Emoji, if you're worried about that.

generated for each of these removals. Like merged class rules above, this
should only be removing dead code.
[#594](https://github.com/peggyjs/peggy/pull/594)
- Character classes now process characters not in the Basic Multi-Lingual
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that using a character not in the BMP automatically puts the character class in Unicode mode.

[#602](https://github.com/peggyjs/peggy/pull/602)
- Errors pointing to non-BMP characters as the "found" text will now show the
full character and not the replacement character for the first surrogate in
the UTF-16 representation.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the error doesn't point to the middle of a surrogate pair. Then you still see the replacement character, and that's probably correct.

the UTF-16 representation.
[#602](https://github.com/peggyjs/peggy/pull/602)
- Character classes can now be annotated with the "u" flag, which will force
the character class into Unicode mode, where one full Codepoint will be matched.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the beginning, then refer to Unicode mode throughout.

UTF-16 code unit (JS character). Previously, this expression compiled but
was useless.
[#602](https://github.com/peggyjs/peggy/pull/602)
- String literals may now contain characters from outside the BMP.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mentioned above.

" );",
"",
" return \"[\" + (expectation.inverted ? \"^\" : \"\") + escapedParts.join(\"\") + \"]\";",
" return \"[\" + (expectation.inverted ? \"^\" : \"\") + escapedParts.join(\"\") + \"]\" + (expectation.unicode ? \"u\" : \"\");",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will output expectations like "expected [^a-z]u", which might not be enough for casual users. Watch for confusion in the userbase and find something better if needed.

"function peg$getUnicode(pos = peg$currPos) {",
" const cp = input.codePointAt(pos);",
" if (cp === undefined) {",
" return \"\";",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible at the end of the string.

" throw peg$buildStructuredError(",
" peg$maxFailExpected,",
" peg$maxFailPos < input.length ? input.charAt(peg$maxFailPos) : null,",
" peg$maxFailPos < input.length ? peg$getUnicode(peg$maxFailPos) : null,",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always get a full char.

parts.splice(i--, 1);
parts[i] = [prevStart, prevEnd = curEnd];
continue;
if ((typeof curStart === "string") && (typeof curEnd === "string")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parts may now be {type: "classEscape"}, which never combine.


/** @type {PEG.compiler.visitor} */
const visitor = require("../visitor");
const { codePointLen1 } = require("../utils");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this whole file to see if I missed a case.

@hildjj hildjj mentioned this pull request Apr 18, 2025
hildjj added 7 commits April 22, 2025 17:56
This allows Peggy to match on full codepoints without
sacrificing backward-compatibility.
You may now specify a class like `[\p{ASCII}]`.
Fixes peggyjs#375.
`[^]` is the same as `.`
`[^]u` is similar, but matches an entire codepoint.
@hildjj hildjj merged commit 2434f35 into peggyjs:main Apr 23, 2025
10 checks passed
@hildjj hildjj deleted the unicode-classes branch April 23, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chunk unicode similar to spread syntax and less like regex

3 participants