Skip to content

StringParser.CodePoints quadratic scaling #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jamesdbrock opened this issue Jan 19, 2022 · 14 comments · Fixed by #83
Closed

StringParser.CodePoints quadratic scaling #77

jamesdbrock opened this issue Jan 19, 2022 · 14 comments · Fixed by #83

Comments

@jamesdbrock
Copy link
Member

jamesdbrock commented Jan 19, 2022

I made a benchmark program for purescript-parsing which benchmarks against this purescript-string-parsers package.

https://github.com/purescript-contrib/purescript-parsing/blob/main/bench/Main.purs

Here are the results of benchmarking many anyDigit

Text.Parsing.StringParser.CodePoints

StringParser.runParser parse23Points
mean   = 746.70 ms
stddev = 12.21 ms
min    = 738.42 ms
max    = 794.27 ms

Text.Parsing.StringParser.CodeUnits

StringParser.runParser parse23Units
mean   = 10.10 ms
stddev = 1.13 ms
min    = 9.46 ms
max    = 24.07 ms

Text.Parsing.Parser.String

runParser parse23
mean   = 44.20 ms
stddev = 6.38 ms
min    = 42.25 ms
max    = 113.16 ms

Data.String.Regex

Regex.match pattern23
mean   = 728.23 μs
stddev = 339.32 μs
min    = 613.72 μs
max    = 2.97 ms

There is something terribly wrong with Text.Parsing.StringParser.CodePoints.anyDigit and I think I know what. The problem is that codePointAt is linear, so many anyDigit then becomes quadratic.

(EDIT deleted notes about stuff not related to this issue)

@jamesdbrock
Copy link
Member Author

jamesdbrock commented Jan 19, 2022

I think the linear CodePoints.anyChar could be fixed by using instead of codePointAt, a custom foreign function with techniques similar to what's described here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt#getting_whole_characters

@jamesdbrock
Copy link
Member Author

It looks like there have already been attempts to fix this. #67

@jamesdbrock
Copy link
Member Author

Here's a way this could be solved, but it requires a new foreign function.

We could change the internal semantics of the Pos so that it always means CodeUnits index, even when using a CodePoints parsers.

Then we use this new foreign function to implement anyChar.

-- | Takes a `String`` and a `CodeUnit` index into the `String`.
-- |
-- | Returns
-- | * the `CodePoint` at the `CodeUnit` index.
-- | * the `CodeUnit` width of the returned `CodePoint` (either `1` or `2`).
-- |
-- | The index must point to a Basic Multilingual Plane character or the
-- | first (high) character of a surrogate pair. If the index is out of bounds
-- | or points to the low character of a surrogate pair then this
-- | function returns `Nothing`.
codePointAtIndexUnit :: Int -> String -> Maybe (Tuple CodePoint Int)
codePointAtIndexUnit i s = runFn4 _codePointAtIndexUnit _codePointAtIndexSuccess Nothing i s

foreign import _codePointAtIndexUnit :: Fn4
  (Fn2 CodePoint Int (Maybe (Tuple CodePoint Int))) -- success
  (Maybe (Tuple CodePoint Int)) -- failure
  Int
  String
  (Maybe (Tuple CodePoint Int))

_codePointAtIndexSuccess :: Fn2 CodePoint Int (Maybe (Tuple CodePoint Int))
_codePointAtIndexSuccess = mkFn2 \cp n -> Just (Tuple cp n)
"use strict"

exports._codePointAtIndexUnit = function (just, nothing, i, s) {
  // There is a codePointAt function
  // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt
  // but we're not using it because
  // 1. It seems less supported
  // 2. It returns the CodePoint but doesn't return the information we need
  //    about whether the CodePoint was 1 or 2 CodeUnits.
  // 3. It wastes time checking if the index is at the low unit of a surrogate pair
  // So instead we'll use the charCodeAt function.
  // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charCodeAt
  let c1 = s.charCodeAt(i);
  if (isNaN(c1)) return nothing; // index is out of bounds
  if (0xDC00 <= c1 && c1 <= 0xDFFF) return nothing; // c1 is the low unit of a surrogate pair
  if (0xD800 <= c1 && c1 <= 0xD8FF) { // c1 is the high unit of a surrogate pair
    let low = s.charCodeAt(i+1); // the low unit of the surrogate pair
    if (isNaN(low)) return nothing; // index is out of bounds
    return just(((c1 - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000, 2);
  }
  return just(c1,1);
}

If we were to use this foreign function, it would probably be best to add it to purescript-strings.

@jamesdbrock
Copy link
Member Author

With the codePointAtIndexUnit function, we would have to maintain the invariant that the Pos always points to a Basic Multilingual Plane Char or the first (high) character of a surrogate pair. Which is easy.

@chtenb
Copy link
Member

chtenb commented Feb 2, 2022

We could change the internal semantics of the Pos so that it always means CodeUnits index, even when using a CodePoints parsers.

With the codePointAtIndexUnit function, we would have to maintain the invariant that the Pos always points to a Basic Multilingual Plane Char or the first (high) character of a surrogate pair. Which is easy.

Would it be an idea to maintain both unit and point offsets? I think the point offset would make more sense from a user perspective when reporting parse errors.

@chtenb
Copy link
Member

chtenb commented Feb 2, 2022

It looks like there have already been attempts to fix this. #67

Any idea why it was reverted? The threads don't seem to mention any of that

@jamesdbrock
Copy link
Member Author

Would it be an idea to maintain both unit and point offsets? I think the point offset would make more sense from a user perspective when reporting parse errors.

That's true, users would want the code point offset in many cases.

The code point offset could be calculated after the error is reported.

-- | Calculate the code point offset in the `String` from a code unit offset in the `String`. 
-- | O(n).
offsetCodePointFromCodeUnit :: String -> Int -> Int

@chtenb
Copy link
Member

chtenb commented Feb 6, 2022

I think this would improve the runtime complexity of the regex and string parser too. Currently these call drop from CodePoints which takes a linear amount of time. If we had the CodeUnit index handy we would have to call the drop from CodeUnits which defers to the JS substring function. Although I suppose that creating a substring in javascript will still be linear in the length of the substring.

@jamesdbrock
Copy link
Member Author

jamesdbrock commented Feb 7, 2022

Although I suppose that creating a substring in javascript will still be linear in the length of the substring.

It's constant-time, conveniently.

@chtenb
Copy link
Member

chtenb commented Feb 7, 2022

It's constant-time, conveniently.

Oh that's indeed convenient. Where did you find that information? I looked for it but couldn't find anything.

@jamesdbrock
Copy link
Member Author

Oh that's indeed convenient. Where did you find that information? I looked for it but couldn't find anything.

There's some good discussion about string slicing here purescript/purescript-strings#120

@chtenb chtenb mentioned this issue Feb 7, 2022
3 tasks
@chtenb
Copy link
Member

chtenb commented Feb 9, 2022

You proposed _codePointAtIndexUnit implementation and usage looks good to me, and I'd like to attempt to rework the CodePoint module to use this approach. Do you prefer to first push this into purescript-strings and then use that directly, or can we first use it here and later move it upstream?

@jamesdbrock
Copy link
Member Author

You proposed _codePointAtIndexUnit implementation and usage looks good to me, and I'd like to attempt to rework the CodePoint module to use this approach. Do you prefer to first push this into purescript-strings and then use that directly, or can we first use it here and later move it upstream?

Because codePointAtIndexUnit needs FFI I think it would be better to add it to purescript-strings. There are no FFI functions in purescript-string-parsers.

If Unicode-correct parsing is what you want, then I recommend https://pursuit.purescript.org/packages/purescript-parsing/

Do you want to talk more? Let's chat on the PureScript discord #contrib channel https://discord.com/channels/864614189094928394/938253816862736405

@chtenb
Copy link
Member

chtenb commented Feb 18, 2022

#83 looks promising

@chtenb chtenb closed this as completed in #83 Mar 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants