StringParser.CodePoints quadratic scaling #77

jamesdbrock · 2022-01-19T07:18:02Z

I made a benchmark program for purescript-parsing which benchmarks against this purescript-string-parsers package.

https://github.com/purescript-contrib/purescript-parsing/blob/main/bench/Main.purs

Here are the results of benchmarking many anyDigit

Text.Parsing.StringParser.CodePoints

StringParser.runParser parse23Points
mean   = 746.70 ms
stddev = 12.21 ms
min    = 738.42 ms
max    = 794.27 ms

Text.Parsing.StringParser.CodeUnits

StringParser.runParser parse23Units
mean   = 10.10 ms
stddev = 1.13 ms
min    = 9.46 ms
max    = 24.07 ms

Text.Parsing.Parser.String

runParser parse23
mean   = 44.20 ms
stddev = 6.38 ms
min    = 42.25 ms
max    = 113.16 ms

Data.String.Regex

Regex.match pattern23
mean   = 728.23 μs
stddev = 339.32 μs
min    = 613.72 μs
max    = 2.97 ms

There is something terribly wrong with Text.Parsing.StringParser.CodePoints.anyDigit and I think I know what. The problem is that codePointAt is linear, so many anyDigit then becomes quadratic.

(EDIT deleted notes about stuff not related to this issue)

The text was updated successfully, but these errors were encountered:

jamesdbrock · 2022-01-19T10:02:25Z

I think the linear CodePoints.anyChar could be fixed by using instead of codePointAt, a custom foreign function with techniques similar to what's described here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt#getting_whole_characters

jamesdbrock · 2022-01-23T07:17:04Z

It looks like there have already been attempts to fix this. #67

jamesdbrock · 2022-01-28T03:42:25Z

Here's a way this could be solved, but it requires a new foreign function.

We could change the internal semantics of the Pos so that it always means CodeUnits index, even when using a CodePoints parsers.

Then we use this new foreign function to implement anyChar.

-- | Takes a `String`` and a `CodeUnit` index into the `String`.
-- |
-- | Returns
-- | * the `CodePoint` at the `CodeUnit` index.
-- | * the `CodeUnit` width of the returned `CodePoint` (either `1` or `2`).
-- |
-- | The index must point to a Basic Multilingual Plane character or the
-- | first (high) character of a surrogate pair. If the index is out of bounds
-- | or points to the low character of a surrogate pair then this
-- | function returns `Nothing`.
codePointAtIndexUnit :: Int -> String -> Maybe (Tuple CodePoint Int)
codePointAtIndexUnit i s = runFn4 _codePointAtIndexUnit _codePointAtIndexSuccess Nothing i s

foreign import _codePointAtIndexUnit :: Fn4
  (Fn2 CodePoint Int (Maybe (Tuple CodePoint Int))) -- success
  (Maybe (Tuple CodePoint Int)) -- failure
  Int
  String
  (Maybe (Tuple CodePoint Int))

_codePointAtIndexSuccess :: Fn2 CodePoint Int (Maybe (Tuple CodePoint Int))
_codePointAtIndexSuccess = mkFn2 \cp n -> Just (Tuple cp n)

"use strict"

exports._codePointAtIndexUnit = function (just, nothing, i, s) {
  // There is a codePointAt function
  // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt
  // but we're not using it because
  // 1. It seems less supported
  // 2. It returns the CodePoint but doesn't return the information we need
  //    about whether the CodePoint was 1 or 2 CodeUnits.
  // 3. It wastes time checking if the index is at the low unit of a surrogate pair
  // So instead we'll use the charCodeAt function.
  // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charCodeAt
  let c1 = s.charCodeAt(i);
  if (isNaN(c1)) return nothing; // index is out of bounds
  if (0xDC00 <= c1 && c1 <= 0xDFFF) return nothing; // c1 is the low unit of a surrogate pair
  if (0xD800 <= c1 && c1 <= 0xD8FF) { // c1 is the high unit of a surrogate pair
    let low = s.charCodeAt(i+1); // the low unit of the surrogate pair
    if (isNaN(low)) return nothing; // index is out of bounds
    return just(((c1 - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000, 2);
  }
  return just(c1,1);
}

If we were to use this foreign function, it would probably be best to add it to purescript-strings.

jamesdbrock · 2022-01-28T03:55:56Z

With the codePointAtIndexUnit function, we would have to maintain the invariant that the Pos always points to a Basic Multilingual Plane Char or the first (high) character of a surrogate pair. Which is easy.

chtenb · 2022-02-02T08:26:43Z

We could change the internal semantics of the Pos so that it always means CodeUnits index, even when using a CodePoints parsers.

With the codePointAtIndexUnit function, we would have to maintain the invariant that the Pos always points to a Basic Multilingual Plane Char or the first (high) character of a surrogate pair. Which is easy.

Would it be an idea to maintain both unit and point offsets? I think the point offset would make more sense from a user perspective when reporting parse errors.

chtenb · 2022-02-02T08:31:12Z

It looks like there have already been attempts to fix this. #67

Any idea why it was reverted? The threads don't seem to mention any of that

jamesdbrock · 2022-02-02T09:27:09Z

Would it be an idea to maintain both unit and point offsets? I think the point offset would make more sense from a user perspective when reporting parse errors.

That's true, users would want the code point offset in many cases.

The code point offset could be calculated after the error is reported.

-- | Calculate the code point offset in the `String` from a code unit offset in the `String`. 
-- | O(n).
offsetCodePointFromCodeUnit :: String -> Int -> Int

chtenb · 2022-02-06T09:37:52Z

I think this would improve the runtime complexity of the regex and string parser too. Currently these call drop from CodePoints which takes a linear amount of time. If we had the CodeUnit index handy we would have to call the drop from CodeUnits which defers to the JS substring function. Although I suppose that creating a substring in javascript will still be linear in the length of the substring.

jamesdbrock · 2022-02-07T04:55:58Z

Although I suppose that creating a substring in javascript will still be linear in the length of the substring.

It's constant-time, conveniently.

chtenb · 2022-02-07T08:37:38Z

It's constant-time, conveniently.

Oh that's indeed convenient. Where did you find that information? I looked for it but couldn't find anything.

jamesdbrock · 2022-02-07T09:02:47Z

Oh that's indeed convenient. Where did you find that information? I looked for it but couldn't find anything.

There's some good discussion about string slicing here purescript/purescript-strings#120

chtenb · 2022-02-09T16:41:01Z

You proposed _codePointAtIndexUnit implementation and usage looks good to me, and I'd like to attempt to rework the CodePoint module to use this approach. Do you prefer to first push this into purescript-strings and then use that directly, or can we first use it here and later move it upstream?

jamesdbrock · 2022-02-09T16:56:24Z

You proposed _codePointAtIndexUnit implementation and usage looks good to me, and I'd like to attempt to rework the CodePoint module to use this approach. Do you prefer to first push this into purescript-strings and then use that directly, or can we first use it here and later move it upstream?

Because codePointAtIndexUnit needs FFI I think it would be better to add it to purescript-strings. There are no FFI functions in purescript-string-parsers.

If Unicode-correct parsing is what you want, then I recommend https://pursuit.purescript.org/packages/purescript-parsing/

Do you want to talk more? Let's chat on the PureScript discord #contrib channel https://discord.com/channels/864614189094928394/938253816862736405

chtenb · 2022-02-18T10:30:39Z

#83 looks promising

jamesdbrock mentioned this issue Jan 28, 2022

Unify CodePoints and CodeUnits parsers #48

Open

chtenb mentioned this issue Feb 7, 2022

Add a simple benchmark module #79

Merged

3 tasks

chtenb mentioned this issue Feb 9, 2022

Efficient CodePoint indexing function? purescript/purescript-strings#155

Closed

chtenb mentioned this issue Feb 17, 2022

Use slices instead of cursors #83

Merged

5 tasks

chtenb closed this as completed in #83 Mar 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringParser.CodePoints quadratic scaling #77

StringParser.CodePoints quadratic scaling #77

jamesdbrock commented Jan 19, 2022 •

edited

Loading

jamesdbrock commented Jan 19, 2022 •

edited

Loading

jamesdbrock commented Jan 23, 2022

jamesdbrock commented Jan 28, 2022

jamesdbrock commented Jan 28, 2022

chtenb commented Feb 2, 2022

chtenb commented Feb 2, 2022

jamesdbrock commented Feb 2, 2022

chtenb commented Feb 6, 2022

jamesdbrock commented Feb 7, 2022 •

edited

Loading

chtenb commented Feb 7, 2022

jamesdbrock commented Feb 7, 2022

chtenb commented Feb 9, 2022

jamesdbrock commented Feb 9, 2022

chtenb commented Feb 18, 2022

StringParser.CodePoints quadratic scaling #77

StringParser.CodePoints quadratic scaling #77

Comments

jamesdbrock commented Jan 19, 2022 • edited Loading

Text.Parsing.StringParser.CodePoints

Text.Parsing.StringParser.CodeUnits

Text.Parsing.Parser.String

Data.String.Regex

jamesdbrock commented Jan 19, 2022 • edited Loading

jamesdbrock commented Jan 23, 2022

jamesdbrock commented Jan 28, 2022

jamesdbrock commented Jan 28, 2022

chtenb commented Feb 2, 2022

chtenb commented Feb 2, 2022

jamesdbrock commented Feb 2, 2022

chtenb commented Feb 6, 2022

jamesdbrock commented Feb 7, 2022 • edited Loading

chtenb commented Feb 7, 2022

jamesdbrock commented Feb 7, 2022

chtenb commented Feb 9, 2022

jamesdbrock commented Feb 9, 2022

chtenb commented Feb 18, 2022

jamesdbrock commented Jan 19, 2022 •

edited

Loading

jamesdbrock commented Jan 19, 2022 •

edited

Loading

jamesdbrock commented Feb 7, 2022 •

edited

Loading