-
Notifications
You must be signed in to change notification settings - Fork 73
Efficient CodePoint indexing function? #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe would need an |
It's basically a code point version of |
I think if you want this unsafe function, then you should just forgo the -- | Retrieves a code point and the next starting code unit index from a code unit index.
-- | This assumes the index points to the start of the code point, and will throw when out of bounds.
codePointAt :: Int -> String -> Tuple CodePoint Int
codePointAt = ... |
There is also a safe Here is a reference implementation for the JS backend purescript-contrib/purescript-string-parsers#77 (comment) Is this library also meant for consumption by other backends? I'm not familiar with backends beside JS and I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only. |
Since this library binds to the platform String type, it's necessarily going to be a lot of FFI. I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers? |
Fair enough! |
You can't maintain the invariant yourself because there are corner cases which make it impossible. For example, consider a malformed UTF-16 string which contains only one code unit, and that code unit is half of a surrogate pair. There is no way to read that code point successfully, and no way to know before you read that code point that reading it will fail.
string-parsers currently has no FFI functions and we'd like to keep it that way so that any backend which implements strings will also be able to use the string-parsers library. string-parsers also has this long-term problem with parsing code points which has resisted solution, see for example
A function like
|
Is the problem just with the internal representation of string-parsers? From what I remember, it uses an index pointer and the whole string to act like a pseudo slice, which is where the problems are coming from (this works really well for code units). It's not clear to me if that's even optimal in the presence of code points. Is this function then necessary because you are trying to work around those internals instead of designing new internals? |
... I meant to say “how do I read a character out of a string index in O(1) time?”
Yes, exactly. We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead. |
My personal opinion at first glance would be to benchmark an implementation that just tracks the tail of the string, and so something like anyDigit could be implementing with |
I'm closing this issue, since the change of representation in |
Should this library expose a function which allows constant-time codepoint indexing in string?
This would only work in situations where you have the codeunit index of the codepoint handy, so the signature would look like
See also purescript-contrib/purescript-string-parsers#77
The text was updated successfully, but these errors were encountered: