Skip to content

Efficient CodePoint indexing function? #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chtenb opened this issue Feb 9, 2022 · 11 comments
Closed

Efficient CodePoint indexing function? #155

chtenb opened this issue Feb 9, 2022 · 11 comments

Comments

@chtenb
Copy link

chtenb commented Feb 9, 2022

Should this library expose a function which allows constant-time codepoint indexing in string?
This would only work in situations where you have the codeunit index of the codepoint handy, so the signature would look like

codePointAtIndexUnit :: Int -> String -> Maybe (Tuple CodePoint Int)

See also purescript-contrib/purescript-string-parsers#77

@garyb
Copy link
Member

garyb commented Feb 9, 2022

Maybe would need an unsafe in the name too in case the passed index splits a codepoint in half? Maybe not, I guess that's an implied possibility whenever you have units in the mix.

@natefaubion
Copy link
Contributor

It's basically a code point version of charAt, which is in Unsafe.

@natefaubion
Copy link
Contributor

natefaubion commented Feb 9, 2022

I think if you want this unsafe function, then you should just forgo the Maybe, eg.

-- | Retrieves a code point and the next starting code unit index from a code unit index.
-- | This assumes the index points to the start of the code point, and will throw when out of bounds.
codePointAt :: Int -> String -> Tuple CodePoint Int
codePointAt = ...

@chtenb
Copy link
Author

chtenb commented Feb 9, 2022

There is also a safe charAt function isn't there? The signature with the Maybe would return Nothing in case of split code points.

Here is a reference implementation for the JS backend purescript-contrib/purescript-string-parsers#77 (comment)

Is this library also meant for consumption by other backends? I'm not familiar with backends beside JS and I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.

@natefaubion
Copy link
Contributor

natefaubion commented Feb 9, 2022

I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.

Since this library binds to the platform String type, it's necessarily going to be a lot of FFI.

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

@chtenb
Copy link
Author

chtenb commented Feb 9, 2022

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

Fair enough!

@jamesdbrock
Copy link

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself.

You can't maintain the invariant yourself because there are corner cases which make it impossible. For example, consider a malformed UTF-16 string which contains only one code unit, and that code unit is half of a surrogate pair. There is no way to read that code point successfully, and no way to know before you read that code point that reading it will fail.

It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

string-parsers currently has no FFI functions and we'd like to keep it that way so that any backend which implements strings will also be able to use the string-parsers library.

string-parsers also has this long-term problem with parsing code points which has resisted solution, see for example

A function like codePointAtIndexUnit would allow this problem to be finally solved, and I suspect that that function would also be useful in other domains. There are no other good answers to the commonplace question “how do I read a character out of a string index in O(n) time?” Astral-plane characters show up everywhere these days and all functions which return a Char are partial in the presence of astral-plan characters.

codePointAtIndexUnit is a bit of an awkward function though, and it might be awkward to implement that function for other backends. And we should probably discourage beginners from using it by tucking it back in the Unsafe module or wherever.

@natefaubion
Copy link
Contributor

natefaubion commented Feb 9, 2022

Is the problem just with the internal representation of string-parsers? From what I remember, it uses an index pointer and the whole string to act like a pseudo slice, which is where the problems are coming from (this works really well for code units). It's not clear to me if that's even optimal in the presence of code points. Is this function then necessary because you are trying to work around those internals instead of designing new internals?

@jamesdbrock
Copy link

... I meant to say “how do I read a character out of a string index in O(1) time?”

Is this function then necessary because you are trying to work around those internals instead of designing new internals?

Yes, exactly.

We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.

@natefaubion
Copy link
Contributor

natefaubion commented Feb 9, 2022

We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.

My personal opinion at first glance would be to benchmark an implementation that just tracks the tail of the string, and so something like anyDigit could be implementing with uncons. AFAICT, the regex stuff is going to slice the input string anyway so you aren't gaining anything interesting by the current representation.

@chtenb
Copy link
Author

chtenb commented Feb 21, 2022

I'm closing this issue, since the change of representation in string-parsers seems to be a viable approach, which is what prompted this request. purescript-contrib/purescript-string-parsers#83

@chtenb chtenb closed this as completed Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants