Fix CodePoint.anyChar parser #46

rintcius · 2019-09-18T09:36:51Z

What does this pull request do?

CodePoint.anyChar was not unicode-safe. I think it would be good to add a CodePoint.anyCodePoint parser too (I'll add that in a separate PR), but CodePoint.anyChar seems to be valuable as well.

thomashoneyman · 2019-09-18T15:50:58Z

This relates to #42 as well. @justinwoo do you have opinions on whether this should remain Char or if the anyChar should just be replaced wholesale with anyCodePoint?

@rintcius I read the implementation and it looks good to me; I'm not as familiar with this library as other members of -contrib, though, and I'd like to have one of them weigh in as well before merging. Thanks!

hdgarrood · 2019-09-19T19:33:13Z

I don’t think this is quite right, as the pos in the parser state will presumably still represent an index into code units elsewhere, whereas it is now being treated as an index into code points here. I think this library probably needs a ground up rethink if it is to handle non BMP characters correctly.

hdgarrood · 2019-09-19T19:36:13Z

(the difference between indexing into a sequence of code points and indexing into a sequence of code units will only be apparent if there are non-BMP code units before the pos we have reached in the string being parsed.)

hdgarrood · 2019-09-19T20:14:24Z

Oh also, this stuff is all quite subtle, so please let me know if you haven’t followed everything I’ve just said - I’m very happy to clarify.

rintcius · 2019-09-19T21:04:47Z

I don’t think this is quite right, as the pos in the parser state will presumably still represent an index into code units elsewhere, whereas it is now being treated as an index into code points here. I think this library probably needs a ground up rethink if it is to handle non BMP characters correctly.

I think the parsers in CodePoints.purs need to have pos to be an index into code points, like
https://github.com/purescript-contrib/purescript-string-parsers/blob/master/src/Text/Parsing/StringParser/CodePoints.purs#L67 does this already.
What's not handling non Chars correctly are the parsers in CodeUnits, but AFAIU these are not supposed to be used if you need full unicode support.

But yeah this code is pretty tricky indeed. I am curious if you can think of scenarios that go wrong for the parsers in CodePoints (for full unicode support) or CodeUnits (for strings that contain proper Chars only)

hdgarrood · 2019-09-20T14:00:28Z

Oh yes of course, sorry. I will try to look in more detail soon.

thomashoneyman · 2019-09-23T01:53:00Z

Un-assigning myself as I know little of this topic and won't be of much help outside clicking the Merge button. I'll keep watching the pull request, however.

hdgarrood · 2019-09-23T13:27:22Z

Ok, on a second look I think this PR is good to merge (and could be released as a patch-level change). I didn't realise until just now that we use the same Parser type for all parsers, regardless of whether they expect the index to count code units or code points, i.e. the type checker will let you mix and match parsers from both the CodeUnits and CodePoints modules, which is not great, as these are very likely to break in the presence of astral plane code points. But of course we can't fix that without a big breaking change.

rintcius · 2019-09-23T14:50:44Z

That would be a big change indeed, but also if we want to take that road, then for consistency we may want to start with splitting String into CodeUnit.String and CodePoint.String, or something along those lines. Which is of course a huge breaking change.

Not saying that that is the road to take, just that the root of the problem isn't really in these Parsers but in String (allowing it to be interpreted both as a list of chars and as a list of codepoints)

hdgarrood · 2019-09-23T15:04:37Z

I don’t think so - I would argue that it makes sense to be able to ask both “what is the nth code unit” and “what is the nth code point” of the same String value. I think the problem is in the parsers because it’s only in the parsers where the meaning of the pos in the parser state is interpreted inconsistently.

rintcius · 2019-09-23T19:28:28Z

it makes sense to be able to ask both “what is the nth code unit” and “what is the nth code point” of the same String value.

... but by allowing to ask both at the same time, you also open the door to bugs like this. if you disallow this in string, then you'd avoid bugs like this altogether further downstream.

hdgarrood · 2019-09-23T23:42:52Z

In my mind the problem is very definitely in this library, not upstream. The bug is not as a result of “get the nth code point” and “get the nth code unit” both being operations we allow on String, because they are both operations which String supports whether we like it or not. The bug should be considered to be here because this library defines the parser type and then fails to assign a consistent meaning to the pos field in the parser state.

hdgarrood · 2019-09-24T00:10:30Z

Actually, perhaps the real answer is to just have one collection of parsers (rather than separate CodeUnits and CodePoints modules for parsers), and always have the pos field count code units, but also to arrange things so that astral plane characters are handled correctly by default. I don’t think using an index into code points is justifiable really, if we are considering performance (which we ought to, imo). For example, anyChar should probably fail instead of splitting up a surrogate pair. We should probably also nudge people towards using String and/or CodePoint over Char.

rintcius · 2019-10-13T19:27:00Z

@hdgarrood Do you agree that this PR addresses the immediate problem and do you accept it as a short term fix?

I'm happy to discuss longer term solutions but may be good to do that separate from this PR?

hdgarrood · 2019-10-15T05:46:58Z

Yes I do, I just hadn't gotten around to merging. Thanks!

Fix CodePoint.anyChar parser

326cd30

thomashoneyman self-assigned this Sep 18, 2019

thomashoneyman removed their assignment Sep 23, 2019

hdgarrood merged commit 348fdae into purescript-contrib:master Oct 15, 2019

This was referenced Oct 15, 2019

Revert "Fix CodePoint.anyChar parser" #47

Closed

Unify CodePoints and CodeUnits parsers #48

Open

rintcius deleted the fix-codepoint-anychar branch October 15, 2019 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CodePoint.anyChar parser #46

Fix CodePoint.anyChar parser #46

rintcius commented Sep 18, 2019

thomashoneyman commented Sep 18, 2019

hdgarrood commented Sep 19, 2019

hdgarrood commented Sep 19, 2019

hdgarrood commented Sep 19, 2019

rintcius commented Sep 19, 2019

hdgarrood commented Sep 20, 2019

thomashoneyman commented Sep 23, 2019

hdgarrood commented Sep 23, 2019

rintcius commented Sep 23, 2019

hdgarrood commented Sep 23, 2019

rintcius commented Sep 23, 2019

hdgarrood commented Sep 23, 2019

hdgarrood commented Sep 24, 2019

rintcius commented Oct 13, 2019

hdgarrood commented Oct 15, 2019

Fix CodePoint.anyChar parser #46

Fix CodePoint.anyChar parser #46

Conversation

rintcius commented Sep 18, 2019

What does this pull request do?

thomashoneyman commented Sep 18, 2019

hdgarrood commented Sep 19, 2019

hdgarrood commented Sep 19, 2019

hdgarrood commented Sep 19, 2019

rintcius commented Sep 19, 2019

hdgarrood commented Sep 20, 2019

thomashoneyman commented Sep 23, 2019

hdgarrood commented Sep 23, 2019

rintcius commented Sep 23, 2019

hdgarrood commented Sep 23, 2019

rintcius commented Sep 23, 2019

hdgarrood commented Sep 23, 2019

hdgarrood commented Sep 24, 2019

rintcius commented Oct 13, 2019

hdgarrood commented Oct 15, 2019