Skip to content

Fix CodePoint.anyChar parser #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 15, 2019

Conversation

rintcius
Copy link
Contributor

What does this pull request do?

CodePoint.anyChar was not unicode-safe. I think it would be good to add a CodePoint.anyCodePoint parser too (I'll add that in a separate PR), but CodePoint.anyChar seems to be valuable as well.

@thomashoneyman thomashoneyman self-assigned this Sep 18, 2019
@thomashoneyman
Copy link
Contributor

This relates to #42 as well. @justinwoo do you have opinions on whether this should remain Char or if the anyChar should just be replaced wholesale with anyCodePoint?

@rintcius I read the implementation and it looks good to me; I'm not as familiar with this library as other members of -contrib, though, and I'd like to have one of them weigh in as well before merging. Thanks!

@hdgarrood
Copy link
Contributor

I don’t think this is quite right, as the pos in the parser state will presumably still represent an index into code units elsewhere, whereas it is now being treated as an index into code points here. I think this library probably needs a ground up rethink if it is to handle non BMP characters correctly.

@hdgarrood
Copy link
Contributor

(the difference between indexing into a sequence of code points and indexing into a sequence of code units will only be apparent if there are non-BMP code units before the pos we have reached in the string being parsed.)

@hdgarrood
Copy link
Contributor

Oh also, this stuff is all quite subtle, so please let me know if you haven’t followed everything I’ve just said - I’m very happy to clarify.

@rintcius
Copy link
Contributor Author

I don’t think this is quite right, as the pos in the parser state will presumably still represent an index into code units elsewhere, whereas it is now being treated as an index into code points here. I think this library probably needs a ground up rethink if it is to handle non BMP characters correctly.

I think the parsers in CodePoints.purs need to have pos to be an index into code points, like
https://github.com/purescript-contrib/purescript-string-parsers/blob/master/src/Text/Parsing/StringParser/CodePoints.purs#L67 does this already.
What's not handling non Chars correctly are the parsers in CodeUnits, but AFAIU these are not supposed to be used if you need full unicode support.

But yeah this code is pretty tricky indeed. I am curious if you can think of scenarios that go wrong for the parsers in CodePoints (for full unicode support) or CodeUnits (for strings that contain proper Chars only)

@hdgarrood
Copy link
Contributor

Oh yes of course, sorry. I will try to look in more detail soon.

@thomashoneyman thomashoneyman removed their assignment Sep 23, 2019
@thomashoneyman
Copy link
Contributor

Un-assigning myself as I know little of this topic and won't be of much help outside clicking the Merge button. I'll keep watching the pull request, however.

@hdgarrood
Copy link
Contributor

Ok, on a second look I think this PR is good to merge (and could be released as a patch-level change). I didn't realise until just now that we use the same Parser type for all parsers, regardless of whether they expect the index to count code units or code points, i.e. the type checker will let you mix and match parsers from both the CodeUnits and CodePoints modules, which is not great, as these are very likely to break in the presence of astral plane code points. But of course we can't fix that without a big breaking change.

@rintcius
Copy link
Contributor Author

That would be a big change indeed, but also if we want to take that road, then for consistency we may want to start with splitting String into CodeUnit.String and CodePoint.String, or something along those lines. Which is of course a huge breaking change.

Not saying that that is the road to take, just that the root of the problem isn't really in these Parsers but in String (allowing it to be interpreted both as a list of chars and as a list of codepoints)

@hdgarrood
Copy link
Contributor

I don’t think so - I would argue that it makes sense to be able to ask both “what is the nth code unit” and “what is the nth code point” of the same String value. I think the problem is in the parsers because it’s only in the parsers where the meaning of the pos in the parser state is interpreted inconsistently.

@rintcius
Copy link
Contributor Author

it makes sense to be able to ask both “what is the nth code unit” and “what is the nth code point” of the same String value.

... but by allowing to ask both at the same time, you also open the door to bugs like this. if you disallow this in string, then you'd avoid bugs like this altogether further downstream.

@hdgarrood
Copy link
Contributor

In my mind the problem is very definitely in this library, not upstream. The bug is not as a result of “get the nth code point” and “get the nth code unit” both being operations we allow on String, because they are both operations which String supports whether we like it or not. The bug should be considered to be here because this library defines the parser type and then fails to assign a consistent meaning to the pos field in the parser state.

@hdgarrood
Copy link
Contributor

Actually, perhaps the real answer is to just have one collection of parsers (rather than separate CodeUnits and CodePoints modules for parsers), and always have the pos field count code units, but also to arrange things so that astral plane characters are handled correctly by default. I don’t think using an index into code points is justifiable really, if we are considering performance (which we ought to, imo). For example, anyChar should probably fail instead of splitting up a surrogate pair. We should probably also nudge people towards using String and/or CodePoint over Char.

@rintcius
Copy link
Contributor Author

@hdgarrood Do you agree that this PR addresses the immediate problem and do you accept it as a short term fix?

I'm happy to discuss longer term solutions but may be good to do that separate from this PR?

@hdgarrood
Copy link
Contributor

Yes I do, I just hadn't gotten around to merging. Thanks!

@hdgarrood hdgarrood merged commit 348fdae into purescript-contrib:master Oct 15, 2019
@rintcius rintcius deleted the fix-codepoint-anychar branch October 15, 2019 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants