Skip to content

Unicode correctness #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 6, 2021

Conversation

jamesdbrock
Copy link
Member

@jamesdbrock jamesdbrock commented Sep 22, 2021

Resolves #109

Correctly handle UTF-16 surrogate pairs in Strings. This is intended to be a conservative change to the package. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail.

All prior tests pass with no modifications. Add a few new tests.

If merged, this PR will allow easy solutions for #110 and others.

Non-breaking changes

Add primitive parsers anyCodePoint and satisfyCodePoint for parsing CodePoints.

Add the match combinator.

Breaking changes

Move updatePosString to the Text.Parsing.Parser.String module and don'texport it.

Change the definition of whiteSpace and skipSpaces toData.CodePoint.Unicode.isSpace.

To make this library handle Unicode correctly, it is necessary to either alter the StringLike class or delete it. We decided to delete it. The String module will now operate only on inputs of the concrete String type. StringLike has no laws, and during the five years of its life, no-one on Github has ever written another instance of StringLike.
https://github.com/search?l=&q=StringLike+language%3APureScript&type=code

Breaking changes which won’t be caught by the compiler

Fundamentally, we change the way we consume the next input character from Data.String.CodeUnits.uncons to Data.String.CodePoints.uncons.

anyChar will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser anyCodePoint will always succeed.

We are not quite “making the default CodePoint”, as was discussed in
#76 (comment) .
Rather we are keeping most of the current API and making it work properly with astral Unicode.

We keep the Char parsers for backward compatibility. We also keep the Char parsers for ergonomic reasons. For example the parser char :: forall m. Monad m => Char -> ParserT String m Char. This parser is usually called with a literal like char 'a'. It would be annoying to call this parser with char (codePointFromChar 'a').

Benchmarks

For Unicode correctness, we're now consuming characters with Data.String.CodePoints.uncons instead of Data.String.CodeUnits.uncons. If that were going to effect performance, then the effect would show up in the runParser parse23 benchmark, but it doesn’t.

Before

runParser parse23
mean   = 43.36 ms
stddev = 6.75 ms
min    = 41.12 ms
max    = 124.65 ms
    
runParser parseSkidoo
mean   = 22.53 ms
stddev = 3.86 ms
min    = 21.40 ms
max    = 61.76 ms

After

runParser parse23
mean   = 42.90 ms
stddev = 6.01 ms
min    = 40.97 ms
max    = 115.74 ms
   
runParser parseSkidoo
mean   = 22.03 ms
stddev = 2.79 ms
min    = 20.78 ms
max    = 53.34 ms

@jamesdbrock jamesdbrock force-pushed the uncons-codepoints branch 3 times, most recently from b8291d8 to 0b605a7 Compare September 23, 2021 10:54

-- | Combinator which returns both the result of a parse and the portion of
-- | the input that was consumed while it was being parsed.
match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a)
Copy link
Member Author

@jamesdbrock jamesdbrock Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Attoparsec and Megaparsec this combinator is named match, but I've always felt it was misnamed and that it should be named capture.

Copy link
Collaborator

@robertdp robertdp Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems similar to JS in that matching returns captures as part of the result. Maybe a capture helper would discard the result and only return the consumed portion.

@jamesdbrock jamesdbrock force-pushed the uncons-codepoints branch 4 times, most recently from 39b3a10 to f948996 Compare September 24, 2021 15:25
jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 24, 2021
@jamesdbrock jamesdbrock marked this pull request as ready for review September 24, 2021 15:51
@jamesdbrock jamesdbrock requested a review from garyb September 24, 2021 15:54
jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 24, 2021
jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 24, 2021
jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 28, 2021
@jamesdbrock jamesdbrock force-pushed the uncons-codepoints branch 2 times, most recently from 60ade0c to ab104c3 Compare September 29, 2021 03:55
ParseState remainder
(updatePosString position str)
true
put $ ParseState remainder (updatePosString position str) true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to the previous implementation as well, but if passed the empty string won't this mark the input as consumed without actually consuming anything?

This is not feedback, just a discussion point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point, I'm not sure what the correct behavior should be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a new issue for this.

updatePosString :: Position -> String -> Position
updatePosString pos str = case uncons str of
Nothing -> pos
Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive
Just { head, tail } -> updatePosString (updatePosSingle pos head) tail

Since it's optimised by the compiler I don't think the comment is necessary.

-- The CodePoint newtype constructor is not exported, so here's a helper.
-- This will break at runtime if the definition of CodePoint ever changes
-- to something other than `newtype CodePoint = CodePoint Int`.
deconstructCodePoint :: CodePoint -> Int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about unCodePoint, since it strips the type-level abstraction to expose the runtime representation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -746,7 +746,7 @@ whiteSpace' langDef@(LanguageDef languageDef)
skipMany (simpleSpace <|> oneLineComment langDef <|> multiLineComment langDef <?> "")

simpleSpace :: forall m . Monad m => ParserT String m Unit
simpleSpace = skipMany1 (satisfyCP isSpace)
simpleSpace = skipMany1 (satisfyCodePoint isSpace)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can satisfyCP be removed, or is it still being used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still being used.

Correctly handle UTF-16 surrogate pairs in `String`s.
We keep all of the API, but we change the primitive parsers so that instead
of succeeding and incorrectly returning half of a surrogate pair, they will fail.

All prior tests pass with no modifications. Add a few new tests.

Non-breaking changes
====================

Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing
`CodePoint`s.

Add the `match` combinator.

Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't
export it.

Split dev dependencies into spago-dev.dhall.

Add benchmark suite.

Add astral UTF-16 test.

Breaking changes
================

Change the definition of `whiteSpace` and `skipSpaces` to
`Data.CodePoint.Unicode.isSpace`.

To make this library handle Unicode correctly, it is necessary to
either alter the `StringLike` class or delete it.
We decided to delete it. The `String` module will now operate only
on inputs of the concrete `String` type.
`StringLike` has no laws, and during the five years of its life,
no-one on Github has ever written another instance of `StringLike`.
https://github.com/search?l=&q=StringLike+language%3APureScript&type=code
The last time someone tried to alter `StringLike`, this is what
happened:
purescript-contrib#62

Breaking changes which won’t be caught by the compiler
======================================================

Fundamentally, we change the way we consume the next input character from
`Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`.

`anyChar` will no longer always succeed. It will only succeed on a Basic
Multilingual Plane character. The new parser `anyCodePoint` will always succeed.

We are not quite “making the default `CodePoint`”, as was discussed in
purescript-contrib#76 (comment) .
Rather we are keeping most of the current API and making it work
properly with astral Unicode.

We keep the `Char` parsers for backward compatibility.
We also keep the `Char` parsers for ergonomic reasons. For example
the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`.
This parser is usually called with a literal like `char 'a'`. It would
be annoying to call this parser with `char (codePointFromChar 'a')`.

Benchmarks
==========

For Unicode correctness, we're now consuming characters with
`Data.String.CodePoints.uncons` instead of
`Data.String.CodeUnits.uncons`. If that were going to effect
performance, then the effect would show up in the `runParser parse23`
benchmark, but it doesn’t.

Before
------

```
runParser parse23
mean   = 43.36 ms
stddev = 6.75 ms
min    = 41.12 ms
max    = 124.65 ms

runParser parseSkidoo
mean   = 22.53 ms
stddev = 3.86 ms
min    = 21.40 ms
max    = 61.76 ms
```

After
-----

```
runParser parse23
mean   = 42.90 ms
stddev = 6.01 ms
min    = 40.97 ms
max    = 115.74 ms

runParser parseSkidoo
mean   = 22.03 ms
stddev = 2.79 ms
min    = 20.78 ms
max    = 53.34 ms
```
@jamesdbrock jamesdbrock merged commit b5ac522 into purescript-contrib:main Oct 6, 2021
@jamesdbrock jamesdbrock deleted the uncons-codepoints branch October 6, 2021 06:25
@jamesdbrock
Copy link
Member Author

Some more historical context: the StringLike typeclass was introduced in #36

@jamesdbrock jamesdbrock mentioned this pull request Jan 23, 2022
srghma added a commit to srghma/purescript-eth-core that referenced this pull request Jan 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

CodePoints uncons? Deprecate drop?
2 participants