Skip to content

v7 proposals #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jamesdbrock opened this issue Sep 18, 2021 · 10 comments
Closed

v7 proposals #118

jamesdbrock opened this issue Sep 18, 2021 · 10 comments

Comments

@jamesdbrock
Copy link
Member

jamesdbrock commented Sep 18, 2021

Here are some things I would like to see in v7 of this package.

The target design space for this package should be similar to MegaParsec: intended for users who prefer correctness and feature-completeness to speed. Anyone who wants speed in a V8 runtime environment will use the built-in Regex.

Text.Parsing.Parser

-- | Contains the remaining input and current position.
data ParseState s = ParseState s Position Boolean

Change the definition of ParseState so that we can have cursor-based state in parsers, and so that line-column state is optional.

Tracking the newline-based line and column position is an important feature but it’s expensive and rarely-used. I would like to try to make that optional.

  1. I'd like to switch to a cursor-based state for String parsers, instead of a state which tracks “the remaining input”.

Do we need the Boolean “consumed flag” in the ParseState? As far as I can tell this is set but never tested. Nothing cares what the “consumed flag” value is?

  1. Make the Position zero-based. Position line and column should start at zero #94
data ParseState s state = ParseState s state

Text.Parsing.Parser.Combinators

  1. Add combinators manyTill, many1Till_ manyTill_ combinator #108

Text.Parsing.Parser.String

  1. UTF-16 correctness. We should always handle UTF-16 surrogate paris correctly, and that means always treating token as CodePoint instead of CodeUnit. CodePoints uncons? Deprecate drop? #109
  2. Delete the StringLike typeclass. Has anyone ever created an instance of this class for a type other than String?
  3. Add combinator match match combinator #107

Text.Parsing.Parser.DataView

  1. Add DataView parsing to this package? Merge this repo into purescript-parsing? rowtype-yoga/purescript-parsing-dataview#10

Module names

  1. Remove the Text. prefix from all module names.
@jamesdbrock jamesdbrock changed the title WIP v7 proposals v7 proposals Sep 18, 2021
@jamesdbrock
Copy link
Member Author

@paf31 introduces StringLike #36 to “support more efficient string representations.”

What kind of representations? The only thing I can think of is something like CatList<String>, which could be a more efficient “string” representation if the the “string” is a large document which is being edited, or is being lazily read in chunks out of a large file?

@jamesdbrock
Copy link
Member Author

We should combine parsing and string-parsers #69

@jamesdbrock
Copy link
Member Author

@jamesdbrock
Copy link
Member Author

jamesdbrock commented Sep 22, 2021

I would want this library to be as full-featured as possible, and to have these properties (which we mostly already have):

  • Stack-safe
  • Auto-backtracking (if a parser fails then it consumes no input)
  • Monad-transformable
  • Input streams extendable with build-in support for
    • String UCS-2 Big-Endian
    • String UCS-2 Little-Endian
    • String UTF-16 Big-Endian
    • String UTF-16 Little-Endian
    • UInt8Array UTF-8
    • DataView
    • forall token. List<token>

Node.js only supports UTF-16 Little-Endian https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

https://mathiasbynens.be/notes/javascript-encoding

https://kevin.burke.dev/kevin/node-js-string-encoding/

@jamesdbrock
Copy link
Member Author

jamesdbrock commented Sep 22, 2021

The purescript-string-parsers Text.Parsing.StringParser.CodePoints module has the design decision to

  • Use a cursor in units of code points.
  • Return a Char.

A better design would be to

  • Use a cursor in units of code units (and increment by two for astral characters)
  • Return a CodePoint.

purescript-contrib/purescript-string-parsers#48

We could use this getWholeChar function.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt#getting_whole_characters

Or actually the CodePoints.uncons might suffice

https://github.com/purescript/purescript-strings/blob/157e372a23e4becd594d7e7bff6f372a6f63dd82/src/Data/String/CodePoints.purs#L191-L191

@jamesdbrock
Copy link
Member Author

Actually, maybe there is no performance improvement to be had with a cursor-based parser state?

purescript/purescript-strings#120

@jamesdbrock
Copy link
Member Author

jamesdbrock commented Sep 27, 2021

Instructions for how to parse a String with Regex, then switch to Parser, then switch back to Regex. This should be a supported use case, considering that Parser is 100× slower than Regex.

Also support the opposite case, with a parseRegex :: Regex -> ParserT String m (Array String).

@jamesdbrock
Copy link
Member Author

More package properties

  • Does not release (does not free the memory of) input already consumed.
  • Does not allow for continuation of more input received (like Attoparsec).

@jamesdbrock
Copy link
Member Author

instance altParserT :: Monad m => Alt (ParserT s m) where
alt p1 p2 = (ParserT <<< ExceptT <<< StateT) \(s@(ParseState i p _)) -> do
Tuple e s'@(ParseState _ _ c') <- runStateT (runExceptT (unwrap p1)) (ParseState i p false)
case e of
Left _
| not c' -> runStateT (runExceptT (unwrap p2)) s
_ -> pure (Tuple e s')

I think the purpose of the consumed flag is to do something like this?

Note that if p succeeds without consuming input the second alternative is favored if it consumes input. This implements the “longest match” rule.

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/parsec-paper-letter.pdf p.11

Except the way that I read this Alt instance is that it favors p2 if p1 failed and consumed no input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant