-
Notifications
You must be signed in to change notification settings - Fork 21
StringParser.CodePoints quadratic scaling #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the linear |
It looks like there have already been attempts to fix this. #67 |
Here's a way this could be solved, but it requires a new foreign function. We could change the internal semantics of the Then we use this new foreign function to implement -- | Takes a `String`` and a `CodeUnit` index into the `String`.
-- |
-- | Returns
-- | * the `CodePoint` at the `CodeUnit` index.
-- | * the `CodeUnit` width of the returned `CodePoint` (either `1` or `2`).
-- |
-- | The index must point to a Basic Multilingual Plane character or the
-- | first (high) character of a surrogate pair. If the index is out of bounds
-- | or points to the low character of a surrogate pair then this
-- | function returns `Nothing`.
codePointAtIndexUnit :: Int -> String -> Maybe (Tuple CodePoint Int)
codePointAtIndexUnit i s = runFn4 _codePointAtIndexUnit _codePointAtIndexSuccess Nothing i s
foreign import _codePointAtIndexUnit :: Fn4
(Fn2 CodePoint Int (Maybe (Tuple CodePoint Int))) -- success
(Maybe (Tuple CodePoint Int)) -- failure
Int
String
(Maybe (Tuple CodePoint Int))
_codePointAtIndexSuccess :: Fn2 CodePoint Int (Maybe (Tuple CodePoint Int))
_codePointAtIndexSuccess = mkFn2 \cp n -> Just (Tuple cp n) "use strict"
exports._codePointAtIndexUnit = function (just, nothing, i, s) {
// There is a codePointAt function
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt
// but we're not using it because
// 1. It seems less supported
// 2. It returns the CodePoint but doesn't return the information we need
// about whether the CodePoint was 1 or 2 CodeUnits.
// 3. It wastes time checking if the index is at the low unit of a surrogate pair
// So instead we'll use the charCodeAt function.
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charCodeAt
let c1 = s.charCodeAt(i);
if (isNaN(c1)) return nothing; // index is out of bounds
if (0xDC00 <= c1 && c1 <= 0xDFFF) return nothing; // c1 is the low unit of a surrogate pair
if (0xD800 <= c1 && c1 <= 0xD8FF) { // c1 is the high unit of a surrogate pair
let low = s.charCodeAt(i+1); // the low unit of the surrogate pair
if (isNaN(low)) return nothing; // index is out of bounds
return just(((c1 - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000, 2);
}
return just(c1,1);
} If we were to use this foreign function, it would probably be best to add it to purescript-strings. |
With the |
Would it be an idea to maintain both unit and point offsets? I think the point offset would make more sense from a user perspective when reporting parse errors. |
Any idea why it was reverted? The threads don't seem to mention any of that |
That's true, users would want the code point offset in many cases. The code point offset could be calculated after the error is reported. -- | Calculate the code point offset in the `String` from a code unit offset in the `String`.
-- | O(n).
offsetCodePointFromCodeUnit :: String -> Int -> Int |
I think this would improve the runtime complexity of the |
It's constant-time, conveniently. |
Oh that's indeed convenient. Where did you find that information? I looked for it but couldn't find anything. |
There's some good discussion about string slicing here purescript/purescript-strings#120 |
You proposed |
Because If Unicode-correct parsing is what you want, then I recommend https://pursuit.purescript.org/packages/purescript-parsing/ Do you want to talk more? Let's chat on the PureScript discord #contrib channel https://discord.com/channels/864614189094928394/938253816862736405 |
#83 looks promising |
I made a benchmark program for purescript-parsing which benchmarks against this purescript-string-parsers package.
https://github.com/purescript-contrib/purescript-parsing/blob/main/bench/Main.purs
Here are the results of benchmarking
many anyDigit
Text.Parsing.StringParser.CodePoints
Text.Parsing.StringParser.CodeUnits
Text.Parsing.Parser.String
Data.String.Regex
There is something terribly wrong with
Text.Parsing.StringParser.CodePoints.anyDigit
and I think I know what. The problem is thatcodePointAt
is linear, somany anyDigit
then becomes quadratic.(EDIT deleted notes about stuff not related to this issue)
The text was updated successfully, but these errors were encountered: