Unicode correctness #119

jamesdbrock · 2021-09-22T14:51:18Z

Resolves #109

Correctly handle UTF-16 surrogate pairs in Strings. This is intended to be a conservative change to the package. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail.

All prior tests pass with no modifications. Add a few new tests.

If merged, this PR will allow easy solutions for #110 and others.

Non-breaking changes

Add primitive parsers anyCodePoint and satisfyCodePoint for parsing CodePoints.

Add the match combinator.

Breaking changes

Move updatePosString to the Text.Parsing.Parser.String module and don'texport it.

Change the definition of whiteSpace and skipSpaces toData.CodePoint.Unicode.isSpace.

To make this library handle Unicode correctly, it is necessary to either alter the StringLike class or delete it. We decided to delete it. The String module will now operate only on inputs of the concrete String type. StringLike has no laws, and during the five years of its life, no-one on Github has ever written another instance of StringLike.
https://github.com/search?l=&q=StringLike+language%3APureScript&type=code

Breaking changes which won’t be caught by the compiler

Fundamentally, we change the way we consume the next input character from Data.String.CodeUnits.uncons to Data.String.CodePoints.uncons.

anyChar will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser anyCodePoint will always succeed.

We are not quite “making the default CodePoint”, as was discussed in
#76 (comment) .
Rather we are keeping most of the current API and making it work properly with astral Unicode.

We keep the Char parsers for backward compatibility. We also keep the Char parsers for ergonomic reasons. For example the parser char :: forall m. Monad m => Char -> ParserT String m Char. This parser is usually called with a literal like char 'a'. It would be annoying to call this parser with char (codePointFromChar 'a').

Benchmarks

For Unicode correctness, we're now consuming characters with Data.String.CodePoints.uncons instead of Data.String.CodeUnits.uncons. If that were going to effect performance, then the effect would show up in the runParser parse23 benchmark, but it doesn’t.

Before

runParser parse23
mean   = 43.36 ms
stddev = 6.75 ms
min    = 41.12 ms
max    = 124.65 ms
    
runParser parseSkidoo
mean   = 22.53 ms
stddev = 3.86 ms
min    = 21.40 ms
max    = 61.76 ms

After

runParser parse23
mean   = 42.90 ms
stddev = 6.01 ms
min    = 40.97 ms
max    = 115.74 ms
   
runParser parseSkidoo
mean   = 22.03 ms
stddev = 2.79 ms
min    = 20.78 ms
max    = 53.34 ms

jamesdbrock · 2021-09-23T10:55:53Z

src/Text/Parsing/Parser/String.purs

+
+-- | Combinator which returns both the result of a parse and the portion of
+-- | the input that was consumed while it was being parsed.
+match :: forall m a. Monad m => ParserT String m a -> ParserT String m (Tuple String a)


In Attoparsec and Megaparsec this combinator is named match, but I've always felt it was misnamed and that it should be named capture.

It seems similar to JS in that matching returns captures as part of the result. Maybe a capture helper would discard the result and only return the consumed portion.

robertdp · 2021-10-06T03:20:10Z

src/Text/Parsing/Parser/String.purs

-        ParseState remainder
-                   (updatePosString position str)
-                   true
+      put $ ParseState remainder (updatePosString position str) true


This applies to the previous implementation as well, but if passed the empty string won't this mark the input as consumed without actually consuming anything?

This is not feedback, just a discussion point.

Yeah, good point, I'm not sure what the correct behavior should be.

I made a new issue for this.

robertdp · 2021-10-06T03:27:48Z

src/Text/Parsing/Parser/String.purs

+updatePosString :: Position -> String -> Position
+updatePosString pos str = case uncons str of
+  Nothing -> pos
+  Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive


Suggested change

Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive

Just { head, tail } -> updatePosString (updatePosSingle pos head) tail

Since it's optimised by the compiler I don't think the comment is necessary.

robertdp · 2021-10-06T03:30:36Z

src/Text/Parsing/Parser/String.purs

+-- The CodePoint newtype constructor is not exported, so here's a helper.
+-- This will break at runtime if the definition of CodePoint ever changes
+-- to something other than `newtype CodePoint = CodePoint Int`.
+deconstructCodePoint :: CodePoint -> Int


How about unCodePoint, since it strips the type-level abstraction to expose the runtime representation?

That's better.

robertdp · 2021-10-06T03:37:24Z

src/Text/Parsing/Parser/Token.purs

@@ -746,7 +746,7 @@ whiteSpace' langDef@(LanguageDef languageDef)
        skipMany (simpleSpace <|> oneLineComment langDef <|> multiLineComment langDef <?> "")

 simpleSpace :: forall m . Monad m => ParserT String m Unit
-simpleSpace = skipMany1 (satisfyCP isSpace)
+simpleSpace = skipMany1 (satisfyCodePoint isSpace)


Can satisfyCP be removed, or is it still being used?

Still being used.

Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```

jamesdbrock · 2022-01-11T03:32:01Z

Some more historical context: the StringLike typeclass was introduced in #36

…emoved in purescript-contrib/purescript-parsing#119 bc was not used)

jamesdbrock force-pushed the uncons-codepoints branch 3 times, most recently from b8291d8 to 0b605a7 Compare September 23, 2021 10:54

jamesdbrock commented Sep 23, 2021

View reviewed changes

jamesdbrock force-pushed the uncons-codepoints branch 4 times, most recently from 39b3a10 to f948996 Compare September 24, 2021 15:25

jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 24, 2021

CHANGELOG for purescript-contrib#119

2c6d0bc

jamesdbrock force-pushed the uncons-codepoints branch from f948996 to 2c6d0bc Compare September 24, 2021 15:42

jamesdbrock marked this pull request as ready for review September 24, 2021 15:51

jamesdbrock requested a review from garyb September 24, 2021 15:54

jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 24, 2021

CHANGELOG for purescript-contrib#119

9392607

jamesdbrock force-pushed the uncons-codepoints branch from 2c6d0bc to 9392607 Compare September 24, 2021 16:06

jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 24, 2021

CHANGELOG for purescript-contrib#119

9846f24

jamesdbrock force-pushed the uncons-codepoints branch from 9392607 to 9846f24 Compare September 24, 2021 16:07

jamesdbrock requested a review from thomashoneyman September 24, 2021 18:08

jamesdbrock added a commit to jamesdbrock/purescript-parsing that referenced this pull request Sep 28, 2021

CHANGELOG for purescript-contrib#119

37f2619

jamesdbrock force-pushed the uncons-codepoints branch from 9846f24 to 37f2619 Compare September 28, 2021 00:39

jamesdbrock mentioned this pull request Sep 28, 2021

CodePoints uncons? Deprecate drop? #109

Closed

jamesdbrock force-pushed the uncons-codepoints branch 2 times, most recently from 60ade0c to ab104c3 Compare September 29, 2021 03:55

jamesdbrock removed request for garyb and thomashoneyman October 6, 2021 01:42

robertdp approved these changes Oct 6, 2021

View reviewed changes

jamesdbrock mentioned this pull request Oct 6, 2021

empty string parser consumed? #122

Closed

jamesdbrock force-pushed the uncons-codepoints branch from ab104c3 to a1413b1 Compare October 6, 2021 06:22

jamesdbrock merged commit b5ac522 into purescript-contrib:main Oct 6, 2021

jamesdbrock deleted the uncons-codepoints branch October 6, 2021 06:25

fsoikin mentioned this pull request Jan 5, 2022

CodePoint versions of oneOf and noneOf #127

Merged

4 tasks

jamesdbrock mentioned this pull request Jan 23, 2022

Notes on performance #144

Closed

srghma added a commit to srghma/purescript-eth-core that referenced this pull request Jan 28, 2022

feat: update spago deps, fix error unknown class StringLike (it was r…

591308a

…emoved in purescript-contrib/purescript-parsing#119 bc was not used)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode correctness #119

Unicode correctness #119

jamesdbrock commented Sep 22, 2021 •

edited

Loading

jamesdbrock Sep 23, 2021 •

edited

Loading

robertdp Oct 6, 2021 •

edited

Loading

robertdp Oct 6, 2021

jamesdbrock Oct 6, 2021

jamesdbrock Oct 6, 2021

robertdp Oct 6, 2021

robertdp Oct 6, 2021

jamesdbrock Oct 6, 2021

jamesdbrock Oct 6, 2021

robertdp Oct 6, 2021

jamesdbrock Oct 6, 2021

jamesdbrock commented Jan 11, 2022

	Just {head,tail} -> updatePosString (updatePosSingle pos head) tail -- tail recursive
	Just { head, tail } -> updatePosString (updatePosSingle pos head) tail

Unicode correctness #119

Unicode correctness #119

Conversation

jamesdbrock commented Sep 22, 2021 • edited Loading

Non-breaking changes

Breaking changes

Breaking changes which won’t be caught by the compiler

Benchmarks

Before

After

jamesdbrock Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

robertdp Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesdbrock commented Jan 11, 2022

jamesdbrock commented Sep 22, 2021 •

edited

Loading

jamesdbrock Sep 23, 2021 •

edited

Loading

robertdp Oct 6, 2021 •

edited

Loading