Efficient CodePoint indexing function? #155

chtenb · 2022-02-09T18:10:01Z

Should this library expose a function which allows constant-time codepoint indexing in string?
This would only work in situations where you have the codeunit index of the codepoint handy, so the signature would look like

codePointAtIndexUnit :: Int -> String -> Maybe (Tuple CodePoint Int)

See also purescript-contrib/purescript-string-parsers#77

garyb · 2022-02-09T18:12:45Z

Maybe would need an unsafe in the name too in case the passed index splits a codepoint in half? Maybe not, I guess that's an implied possibility whenever you have units in the mix.

natefaubion · 2022-02-09T18:14:00Z

It's basically a code point version of charAt, which is in Unsafe.

natefaubion · 2022-02-09T18:16:54Z

I think if you want this unsafe function, then you should just forgo the Maybe, eg.

-- | Retrieves a code point and the next starting code unit index from a code unit index.
-- | This assumes the index points to the start of the code point, and will throw when out of bounds.
codePointAt :: Int -> String -> Tuple CodePoint Int
codePointAt = ...

chtenb · 2022-02-09T18:21:57Z

There is also a safe charAt function isn't there? The signature with the Maybe would return Nothing in case of split code points.

Here is a reference implementation for the JS backend purescript-contrib/purescript-string-parsers#77 (comment)

Is this library also meant for consumption by other backends? I'm not familiar with backends beside JS and I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.

natefaubion · 2022-02-09T18:29:27Z

I notice that this package contains foreign JS code, which leads me to think that this package is perhaps JS only.

Since this library binds to the platform String type, it's necessarily going to be a lot of FFI.

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

chtenb · 2022-02-09T18:31:38Z

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself. It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

Fair enough!

jamesdbrock · 2022-02-09T19:05:04Z

I personally think it's a little odd to have a safe version of this function since it mixes code unit indexing and code points. If you wanted the most performance, I'd think you could just forgo the checks altogether, and maintain the invariant yourself.

You can't maintain the invariant yourself because there are corner cases which make it impossible. For example, consider a malformed UTF-16 string which contains only one code unit, and that code unit is half of a surrogate pair. There is no way to read that code point successfully, and no way to know before you read that code point that reading it will fail.

It's not clear to me what value this function has on it's own without being used as a low-level primitive for implementing a lazy code-point producer. Maybe that would be a better high-level API that solves the issues in string-parsers?

string-parsers currently has no FFI functions and we'd like to keep it that way so that any backend which implements strings will also be able to use the string-parsers library.

string-parsers also has this long-term problem with parsing code points which has resisted solution, see for example

A function like codePointAtIndexUnit would allow this problem to be finally solved, and I suspect that that function would also be useful in other domains. There are no other good answers to the commonplace question “how do I read a character out of a string index in O(n) time?” Astral-plane characters show up everywhere these days and all functions which return a Char are partial in the presence of astral-plan characters.

codePointAtIndexUnit is a bit of an awkward function though, and it might be awkward to implement that function for other backends. And we should probably discourage beginners from using it by tucking it back in the Unsafe module or wherever.

natefaubion · 2022-02-09T19:13:36Z

Is the problem just with the internal representation of string-parsers? From what I remember, it uses an index pointer and the whole string to act like a pseudo slice, which is where the problems are coming from (this works really well for code units). It's not clear to me if that's even optimal in the presence of code points. Is this function then necessary because you are trying to work around those internals instead of designing new internals?

jamesdbrock · 2022-02-09T19:22:44Z

... I meant to say “how do I read a character out of a string index in O(1) time?”

Is this function then necessary because you are trying to work around those internals instead of designing new internals?

Yes, exactly.

We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.

natefaubion · 2022-02-09T19:32:16Z

We could design new internals, like for example we could use normal string slicing. We've talked about that, it would work. So if you think this addition to strings it not worth it then we could do that instead.

My personal opinion at first glance would be to benchmark an implementation that just tracks the tail of the string, and so something like anyDigit could be implementing with uncons. AFAICT, the regex stuff is going to slice the input string anyway so you aren't gaining anything interesting by the current representation.

chtenb · 2022-02-21T21:00:39Z

I'm closing this issue, since the change of representation in string-parsers seems to be a viable approach, which is what prompted this request. purescript-contrib/purescript-string-parsers#83

This was referenced Feb 19, 2022

Unify CodePoints and CodeUnits parsers purescript-contrib/purescript-string-parsers#48

Open

Use slices instead of cursors purescript-contrib/purescript-string-parsers#83

Merged

chtenb closed this as completed Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient CodePoint indexing function? #155

Efficient CodePoint indexing function? #155

chtenb commented Feb 9, 2022

garyb commented Feb 9, 2022

natefaubion commented Feb 9, 2022

natefaubion commented Feb 9, 2022 •

edited

Loading

chtenb commented Feb 9, 2022

natefaubion commented Feb 9, 2022 •

edited

Loading

chtenb commented Feb 9, 2022

jamesdbrock commented Feb 9, 2022

natefaubion commented Feb 9, 2022 •

edited

Loading

jamesdbrock commented Feb 9, 2022

natefaubion commented Feb 9, 2022 •

edited

Loading

chtenb commented Feb 21, 2022

Efficient CodePoint indexing function? #155

Efficient CodePoint indexing function? #155

Comments

chtenb commented Feb 9, 2022

garyb commented Feb 9, 2022

natefaubion commented Feb 9, 2022

natefaubion commented Feb 9, 2022 • edited Loading

chtenb commented Feb 9, 2022

natefaubion commented Feb 9, 2022 • edited Loading

chtenb commented Feb 9, 2022

jamesdbrock commented Feb 9, 2022

natefaubion commented Feb 9, 2022 • edited Loading

jamesdbrock commented Feb 9, 2022

natefaubion commented Feb 9, 2022 • edited Loading

chtenb commented Feb 21, 2022

natefaubion commented Feb 9, 2022 •

edited

Loading

natefaubion commented Feb 9, 2022 •

edited

Loading

natefaubion commented Feb 9, 2022 •

edited

Loading

natefaubion commented Feb 9, 2022 •

edited

Loading