-
Notifications
You must be signed in to change notification settings - Fork 21
Fix CodePoint.anyChar parser #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CodePoint.anyChar parser #46
Conversation
This relates to #42 as well. @justinwoo do you have opinions on whether this should remain @rintcius I read the implementation and it looks good to me; I'm not as familiar with this library as other members of -contrib, though, and I'd like to have one of them weigh in as well before merging. Thanks! |
I don’t think this is quite right, as the |
(the difference between indexing into a sequence of code points and indexing into a sequence of code units will only be apparent if there are non-BMP code units before the |
Oh also, this stuff is all quite subtle, so please let me know if you haven’t followed everything I’ve just said - I’m very happy to clarify. |
I think the parsers in But yeah this code is pretty tricky indeed. I am curious if you can think of scenarios that go wrong for the parsers in |
Oh yes of course, sorry. I will try to look in more detail soon. |
Un-assigning myself as I know little of this topic and won't be of much help outside clicking the Merge button. I'll keep watching the pull request, however. |
Ok, on a second look I think this PR is good to merge (and could be released as a patch-level change). I didn't realise until just now that we use the same |
That would be a big change indeed, but also if we want to take that road, then for consistency we may want to start with splitting Not saying that that is the road to take, just that the root of the problem isn't really in these Parsers but in String (allowing it to be interpreted both as a list of chars and as a list of codepoints) |
I don’t think so - I would argue that it makes sense to be able to ask both “what is the nth code unit” and “what is the nth code point” of the same String value. I think the problem is in the parsers because it’s only in the parsers where the meaning of the |
... but by allowing to ask both at the same time, you also open the door to bugs like this. if you disallow this in string, then you'd avoid bugs like this altogether further downstream. |
In my mind the problem is very definitely in this library, not upstream. The bug is not as a result of “get the nth code point” and “get the nth code unit” both being operations we allow on String, because they are both operations which String supports whether we like it or not. The bug should be considered to be here because this library defines the parser type and then fails to assign a consistent meaning to the |
Actually, perhaps the real answer is to just have one collection of parsers (rather than separate CodeUnits and CodePoints modules for parsers), and always have the |
@hdgarrood Do you agree that this PR addresses the immediate problem and do you accept it as a short term fix? I'm happy to discuss longer term solutions but may be good to do that separate from this PR? |
Yes I do, I just hadn't gotten around to merging. Thanks! |
What does this pull request do?
CodePoint.anyChar
was not unicode-safe. I think it would be good to add aCodePoint.anyCodePoint
parser too (I'll add that in a separate PR), butCodePoint.anyChar
seems to be valuable as well.