-
Notifications
You must be signed in to change notification settings - Fork 51
Generalize StringLike to StreamLike fix #58 #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I was thinking on how to solve the issue but decided to actually open pr instead of writing this as comment. API of whitespace has changed. We can rename let me know what you think @paf31 |
src/Text/Parsing/Parser/String.purs
Outdated
-- | | ||
class StreamLike f c | f -> c where | ||
uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: (Position -> Position) } | ||
drop :: Prefix f -> f -> Maybe { rest :: f, updatePos :: (Position -> Position) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can name it stripPrefix
src/Text/Parsing/Parser/String.purs
Outdated
uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: (Position -> Position) } | ||
drop :: Prefix f -> f -> Maybe { rest :: f, updatePos :: (Position -> Position) } | ||
|
||
instance stringLikeString :: StreamLike String Char where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stringLikeString
here should be streamLikeString
? Same below.
src/Text/Parsing/Parser/String.purs
Outdated
instance listcharLikeString :: (Eq a, HasUpdatePosition a) => StreamLike (L.List a) a where | ||
uncons f = L.uncons f <#> \({ head, tail}) -> | ||
{ head: head, updatePos: (_ `updatePos` head), tail} | ||
drop (Prefix p') s' = case (tailRecM3 go p' s' id) of -- no MonadRec for Maybe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's worth adding stripPrefix
to Data.List
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define it like in String ? (ie add Pattern
type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I'm tempted to say no, but if you'd like to open a PR, we can discuss it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Text/Parsing/Parser/String.purs
Outdated
-- | Instances must satisfy the following laws: | ||
-- | | ||
class StreamLike f c | f -> c where | ||
uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: (Position -> Position) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parens are redundant here around the type of updatePos
.
Sorry for the delay. I really like this, once the tailrec and list dependencies are updated, I'll merge this and make a major release. Thanks! |
src/Text/Parsing/Parser/String.purs
Outdated
cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t' | ||
pure $ fromCharArray cs | ||
-- | Match many whitespace characters. | ||
whiteSpace :: forall f m g. StreamLike f Char => Unfoldable g => Monoid f => Monad m => ParserT f m (g Char) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you fine with the signature?
src/Text/Parsing/Parser/String.purs
Outdated
|
||
-- | Match a whitespace characters but returns them as Array. | ||
whiteSpace' :: forall f m. StreamLike f Char => Monad m => ParserT f m (Array Char) | ||
whiteSpace' = many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure Data.Array.many
is fine here? maybe we should use catenable list or something similar.
src/Text/Parsing/Parser/String.purs
Outdated
|
||
-- | Match end-of-file. | ||
eof :: forall s m. StringLike s => Monad m => ParserT s m Unit | ||
-- | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this description is outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we fine with description and the law?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine, yeah.
we should update descriptions here and possibly remove some of combinators from Token |
👍 Looks good to me, thanks! I'd be fine with removing overlapping functionality from @garyb Any comments before I merge this? Obviously it'll be a breaking change. |
Looks like it's conflicting currently? |
LGTM if it LGTY though! |
@safareli Can you please merge with what's on master? |
@paf31 Will tackle this on weekend. |
instead String{anyChar,satisfy,char} chould be used
src/Text/Parsing/Parser/String.purs
Outdated
-- | | ||
class StreamLike f c | f -> c where | ||
uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: Position -> Position } | ||
drop :: Prefix f -> f -> Maybe { rest :: f, updatePos :: Position -> Position } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we fine with this name? stipPrefix could be used instead too
src/Text/Parsing/Parser/String.purs
Outdated
|
||
-- | Match the specified string. | ||
string :: forall s m. StringLike s => Monad m => String -> ParserT s m String | ||
-- | Match the specified stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we fine wis descriptions?
We can use function names from Token instead of char and anyChar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's better. So maybe prefix
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok will use:
match
forchar
token
foranyChar
prefix
forstring
Will also rename drop
from StreamLike
to stipPrefix
so it lines up with prefix
.
src/Text/Parsing/Parser/String.purs
Outdated
cs <- many $ satisfy \c -> c == '\n' || c == '\r' || c == ' ' || c == '\t' | ||
pure $ fromCharArray cs | ||
-- | Match many whitespace character in some Unfoldable. | ||
whiteSpace :: forall f m g. StreamLike f Char => Unfoldable g => Monoid f => Monad m => ParserT f m (g Char) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this function as it's still braking change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me again why we'd need to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are operating on list of some tokens you most likely are not gonna use it.
Major use case of this would be to get String
as result, but String
is not Unfoldable
, so you would still need to map over it with stringFromChars
.
I think we can just returning Array Char
is fine, and if client wants a string they can map over it (as they would need to do it any ways).
If you agree i would remove this function and rename whitespace'
to whitespace
(this way we wouldn't have two whitespace
functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks!
@paf31 can you take a look, I have renamed StreamLike to Stream and added |
src/Text/Parsing/Parser/Stream.purs
Outdated
class StreamLike f c | f -> c where | ||
uncons :: f -> Maybe { head :: c, tail :: f, updatePos :: Position -> Position } | ||
stripPrefix :: Prefix f -> f -> Maybe { rest :: f, updatePos :: Position -> Position } | ||
class StreamLike s m t | s -> t where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should add s -> m
here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, in parsec this is how it looks class (Monad m) => Stream s m t | s -> t
Looks good to me! Just one comment about that potential fundep. What do you think? Also, could you please add a small test for a custom stream type which involves a non-trivial monad - |
Will try to add tests using |
@safareli Any more thoughts on testing? If you don't have time, I'm happy to merge this and leave it until later. |
this comment made me think that maybe if you think current signature is fine we can merge it. |
@paf31 I have addressed last issue I had can you take a look? |
s -> m (Maybe { head :: t, tail :: s, updatePos :: Position -> Position }) instead of having updatePos as a result of uncons or stripPrefix now this operations take position with input which is part of a parser state. this way we should allocation less of intermediate objects.
Closing due to lack of activity. |
Also, it sounded like the design needed to be thought through more to account for |
It's exactly what was stated in the last comment of this PR #62 (comment) |
I skimmed through I'll keep this PR open. |
I'm not actively doing PS any more (for now), so this could be closed. (if anyone is up for taking this on they can just checkout from this branch or copy paste code) |
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. Move the character class parsers from `Text.Parsing.Parser.Token` module into the `Text.Parsing.Parser.String` module. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
Correctly handle UTF-16 surrogate pairs in `String`s. We keep all of the API, but we change the primitive parsers so that instead of succeeding and incorrectly returning half of a surrogate pair, they will fail. All prior tests pass with no modifications. Add a few new tests. Non-breaking changes ==================== Add primitive parsers `anyCodePoint` and `satisfyCodePoint` for parsing `CodePoint`s. Add the `match` combinator. Move `updatePosString` to the `Text.Parsing.Parser.String` module and don't export it. Split dev dependencies into spago-dev.dhall. Add benchmark suite. Add astral UTF-16 test. Breaking changes ================ Change the definition of `whiteSpace` and `skipSpaces` to `Data.CodePoint.Unicode.isSpace`. To make this library handle Unicode correctly, it is necessary to either alter the `StringLike` class or delete it. We decided to delete it. The `String` module will now operate only on inputs of the concrete `String` type. `StringLike` has no laws, and during the five years of its life, no-one on Github has ever written another instance of `StringLike`. https://github.com/search?l=&q=StringLike+language%3APureScript&type=code The last time someone tried to alter `StringLike`, this is what happened: purescript-contrib#62 Breaking changes which won’t be caught by the compiler ====================================================== Fundamentally, we change the way we consume the next input character from `Data.String.CodeUnits.uncons` to `Data.String.CodePoints.uncons`. `anyChar` will no longer always succeed. It will only succeed on a Basic Multilingual Plane character. The new parser `anyCodePoint` will always succeed. We are not quite “making the default `CodePoint`”, as was discussed in purescript-contrib#76 (comment) . Rather we are keeping most of the current API and making it work properly with astral Unicode. We keep the `Char` parsers for backward compatibility. We also keep the `Char` parsers for ergonomic reasons. For example the parser `char :: forall s m. Monad m => Char -> ParserT s m Char`. This parser is usually called with a literal like `char 'a'`. It would be annoying to call this parser with `char (codePointFromChar 'a')`. Benchmarks ========== For Unicode correctness, we're now consuming characters with `Data.String.CodePoints.uncons` instead of `Data.String.CodeUnits.uncons`. If that were going to effect performance, then the effect would show up in the `runParser parse23` benchmark, but it doesn’t. Before ------ ``` runParser parse23 mean = 43.36 ms stddev = 6.75 ms min = 41.12 ms max = 124.65 ms runParser parseSkidoo mean = 22.53 ms stddev = 3.86 ms min = 21.40 ms max = 61.76 ms ``` After ----- ``` runParser parse23 mean = 42.90 ms stddev = 6.01 ms min = 40.97 ms max = 115.74 ms runParser parseSkidoo mean = 22.03 ms stddev = 2.79 ms min = 20.78 ms max = 53.34 ms ```
In v7.0.0 we eliminated the |
fix #58