-
Notifications
You must be signed in to change notification settings - Fork 51
Unicode correctness #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode correctness #119
Conversation
b8291d8
to
0b605a7
Compare
|
||
-- | Combinator which returns both the result of a parse and the portion of | ||
-- | the input that was consumed while it was being parsed. | ||
match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Attoparsec and Megaparsec this combinator is named match
, but I've always felt it was misnamed and that it should be named capture
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems similar to JS in that match
ing returns capture
s as part of the result. Maybe a capture
helper would discard the result and only return the consumed portion.
39b3a10
to
f948996
Compare
f948996
to
2c6d0bc
Compare
2c6d0bc
to
9392607
Compare
9392607
to
9846f24
Compare
9846f24
to
37f2619
Compare
60ade0c
to
ab104c3
Compare
ParseState remainder | ||
(updatePosString position str) | ||
true | ||
put $ ParseState remainder (updatePosString position str) true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies to the previous implementation as well, but if passed the empty string won't this mark the input as consumed without actually consuming anything?
This is not feedback, just a discussion point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, I'm not sure what the correct behavior should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a new issue for this.
updatePosString :: Position -> String -> Position | ||
updatePosString pos str = case uncons str of | ||
Nothing -> pos | ||
Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive | |
Just { head, tail } -> updatePosString (updatePosSingle pos head) tail |
Since it's optimised by the compiler I don't think the comment is necessary.
src/Text/Parsing/Parser/String.purs
Outdated
-- The CodePoint newtype constructor is not exported, so here's a helper. | ||
-- This will break at runtime if the definition of CodePoint ever changes | ||
-- to something other than `newtype CodePoint = CodePoint Int`. | ||
deconstructCodePoint :: CodePoint -> Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about unCodePoint
, since it strips the type-level abstraction to expose the runtime representation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -746,7 +746,7 @@ whiteSpace' langDef@(LanguageDef languageDef) | |||
skipMany (simpleSpace <|> oneLineComment langDef <|> multiLineComment langDef <?> "") | |||
|
|||
simpleSpace :: forall m . Monad m => ParserT String m Unit | |||
simpleSpace = skipMany1 (satisfyCP isSpace) | |||
simpleSpace = skipMany1 (satisfyCodePoint isSpace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can satisfyCP
be removed, or is it still being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still being used.
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
ab104c3
to
a1413b1
Compare
Some more historical context: the |
…emoved in purescript-contrib/purescript-parsing#119 bc was not used)
Resolves #109
Correctly handle UTF-16 surrogate pairs in
String
s. This is intended to be a conservative change to the package. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail.All prior tests pass with no modifications. Add a few new tests.
If merged, this PR will allow easy solutions for #110 and others.
Non-breaking changes
Add primitive parsers
anyCodePoint
andsatisfyCodePoint
for parsingCodePoint
s.Add the
match
combinator.Breaking changes
Move
updatePosString
to theText.Parsing.Parser.String
module and don'texport it.Change the definition of
whiteSpace
andskipSpaces
toData.CodePoint.Unicode.isSpace
.To make this library handle Unicode correctly, it is necessary to either alter the
StringLike
class or delete it. We decided to delete it. TheString
module will now operate only on inputs of the concreteString
type.StringLike
has no laws, and during the five years of its life, no-one on Github has ever written another instance ofStringLike
.https://github.com/search?l=&q=StringLike+language%3APureScript&type=code
Breaking changes which won’t be caught by the compiler
Fundamentally, we change the way we consume the next input character from
Data.String.CodeUnits.uncons
toData.String.CodePoints.uncons
.anyChar
will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parseranyCodePoint
will always succeed.We are not quite “making the default
CodePoint
”, as was discussed in#76 (comment) .
Rather we are keeping most of the current API and making it work properly with astral Unicode.
We keep the
Char
parsers for backward compatibility. We also keep theChar
parsers for ergonomic reasons. For example the parserchar :: forall m. Monad m => Char -> ParserT String m Char
. This parser is usually called with a literal likechar 'a'
. It would be annoying to call this parser withchar (codePointFromChar 'a')
.Benchmarks
For Unicode correctness, we're now consuming characters with
Data.String.CodePoints.uncons
instead ofData.String.CodeUnits.uncons
. If that were going to effect performance, then the effect would show up in therunParser parse23
benchmark, but it doesn’t.Before
After