-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Exposing surrogates in the string API leads to buggy code #24619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That would be an extremely breaking change - as you say: changing the API to remove I fully agree that it isn't safe. I think the real problem is that we don't have a safe alternative that you can use when you want to. Even then, you still have to pick the right iterator to use (UTF-8/UTF-16/UTF-32/Grapheme cluster), and most simple programs aren't really up to using grapheme clusters as the unit of string content. It is, technically, a library only change (no need to change the language spec). In practice, it's a language change to change something so fundamental as the String class. |
Sure. I'm only suggesting doing this for new embeddings, like Flutter. Obviously existing embeddings wouldn't change. Presumably, there will be far more new code in the future than there already exists today. The earlier we do this, the less breakage there will be. My thesis is that most of the code that would be broken by this is already actually broken. The authors and users just don't realise it yet. |
What about introducing a new class similar to string with UTF functionality? |
A new class would be fine by me, so long as that's the class that the string literal syntax creates. |
@lrhn Just "runes" is already much safer than UTF-16 codepoints. Just leaving that part of the API would already be a huge improvement even if it's not the perfect solution. We shouldn't let a desire to make this perfect paralyze us and prevent us from making it better. |
If the underlying representation is still UTF-16, but you have an API that The API and the underlying representation have to go together. Similarly, On 19 October 2015 at 09:17, Ian Hickson [email protected] wrote:
Erik Corry Google Denmark ApS |
This is still a huge breaking change similar to changing string itself. |
@ErikCorryGoogle There's no way to get the nth character in O(1) if the representation is UTF-16. You can get the nth codeunit, but that's really bad (see the first comment here, that's the whole point of this bug). IMHO, the write way to implement this is to have the underlying representation be a detail, and have the runes iterator be swapped in on the fly based on the representation. This is exactly like how in Dart, an "int" has multiple kinds of representations depending on the value, but that is all hidden from the author. So for example, if the backing data is in UTF-8, you iterate with an efficient UTF-8 iterator. If the backing data is in UTF-16, you iterate with an efficient UTF-16 iterator. In both cases, if you want complete graphemes, then you iterate with a grapheme cluster iterator layered on top of the codepoint iterator. If you want to iterate over the string while tracking the bidi state, you use a bidi-algorithm iterator layered on top of the codepoint iterator. And so on. They're strings, though. You shouldn't be able to index into strings today. Unicode strings just aren't indexable in O(1) time. You can index into byte or word arrays, but if you want to do that you should have to ask for that. It shouldn't be the default. This is very analogous to the situation with numbers. In dart-to-js, numbers work differently than in the Dart VM. We should just do the same with Strings. @zoechi The bug here is that the easiest string manipulation API is a bug-prone API. Breaking it is the entire point. |
The Go approach is to use UTF-8 for string literals (really, it's a byte slice) and have string interpretations of that which allows you to iterate code points or even grapheme clusters. You can index the byte slice as you wish, it's just not a way to get "characters". We have a This request is for an easy/short way to get the first code point of a Dart string as a string. |
@lrhn I was also thinking about the Go approach. People behind Go also did UTF8 in the 90's so one might think they know a thing or two about strings. On the other hand, when solving the problems Go was intended for, you often just pass strings around (say over the network) without inspecting them. Dart is different in this respect. In Go, strings are byte arrays (slices), without specifying encoding. String literals are turned into utf-8 encoded byte arrays. Function in the library are either encoding agnostic (work with bytes), e.g. various searching and replacement functions or (rarely) assume utf-8 (capitalize a string) Then there is a syntactic construct to iterate over runes and a function to convert from byte arrays to int arrays of runes, if you need constant time addressing. But still, runes are not "characters". The article at https://blog.golang.org/strings explains the string handling better. There is also a followup https://blog.golang.org/normalization. I somehow dislike the idea that one needs to learn all these complicated things to be able to use Strings in a programming language. |
It is complicated. That's one of the reasons we have a simple-but-incomplete String class. You can work with that without understanding everything about Unicode, it just won't handle all cases - but to all handle a case, you need to be aware that they exist. If the only strings we have were Unicode-safe iterator only structures, it would be quite a barrier of entry. There is no royal road to writing Unicode aware code. You do need to know that all these things exist, otherwise you will fail just as much as the code that only expects ASCII. What we need to do is to provide the tools you need when you know which problem they are solving. |
That's bug #24617. This request is to not expose UTF-16 code units (or UTF-8 bytes) as an API that is more intuitive than Unicode characters (runes) or graphemes.
This is not a problem for embedders that are already paying that cost, such as Flutter. |
If the request is to not have an UTF-16 API that is more intuitive than runes/grapheme-clusters, then the solution should be to improve the rune/grapheme-cluster APIs. Reducing the intuitiveness of an existing API is counter-productive (and breaking). It's basically saying that we have a good, usable API that doesn't cover all cases, so we should make it worse. :) |
I disagree. Even if somehow we had a really intuitive runes API, Again, this being a breaking change is the entire point. It's very likely that most usage of the current API is already broken. |
I think it cannot be helped. It is like There are plans in progress about Dart 2.0. https://github.com/dart-lang/dart_enhancement_proposals/blob/master/Meetings/2015-08-26%20DEP%20Committee%20Meeting.md#dart-20 so now might be a good time to start proposing breaking changes in the Dart Libraries as well. I do not feel qualified to write a DEP, but it seems to me this is what needs to be done if this discussion is to move anywhere. One more thing that Go does is that they have an automated tool to update existing code when the language or the libraries change. Dart already has a code formatter, which is the first prerequisite to doing this. It it was possible to move existing code to a new API by executing a single command when Dart 2.0 comes out, there would be much less friction in making significant changes to Dart and the Dart Libraries. |
Closing in favour of #28404 |
In embeddings that don't have a UTF-16 legacy to deal with, we should remove the String API affordances that expose the underlying UTF-16-ness of the String class.
With the UTF-16-ness exposed, it's very easy to mistakenly write code that works perfectly well in ASCII-only environments and even in naïve Unicode-aware testing ("Let me make sure it works with an 'é'!"), but that break as soon as the real users start sending emoji to each other. Furthermore, because the fix is to switch from using an O(1)-indexable API to an iterator API, it requires some non-trivial work to fix.
We should only expose UTF-16 code units when the author is explicitly trying to encode Strings to UTF-16 byte arrays, just like we only expose UTF-8 bytes when the author is encoding Strings to UTF-8.
The text was updated successfully, but these errors were encountered: