Skip to content

Exposing surrogates in the string API leads to buggy code #24619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Hixie opened this issue Oct 18, 2015 · 16 comments
Closed

Exposing surrogates in the string API leads to buggy code #24619

Hixie opened this issue Oct 18, 2015 · 16 comments
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries.

Comments

@Hixie
Copy link
Contributor

Hixie commented Oct 18, 2015

In embeddings that don't have a UTF-16 legacy to deal with, we should remove the String API affordances that expose the underlying UTF-16-ness of the String class.

With the UTF-16-ness exposed, it's very easy to mistakenly write code that works perfectly well in ASCII-only environments and even in naïve Unicode-aware testing ("Let me make sure it works with an 'é'!"), but that break as soon as the real users start sending emoji to each other. Furthermore, because the fix is to switch from using an O(1)-indexable API to an iterator API, it requires some non-trivial work to fix.

We should only expose UTF-16 code units when the author is explicitly trying to encode Strings to UTF-16 byte arrays, just like we only expose UTF-8 bytes when the author is encoding Strings to UTF-8.

@sethladd sethladd added the area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. label Oct 19, 2015
@lrhn
Copy link
Member

lrhn commented Oct 19, 2015

That would be an extremely breaking change - as you say: changing the API to remove length, operator[] and codeUnitAt. There is a lot of code out there using those. So, it's not something to be taken lightly, it would break every string manipulating library ever written.

I fully agree that it isn't safe. I think the real problem is that we don't have a safe alternative that you can use when you want to. Even then, you still have to pick the right iterator to use (UTF-8/UTF-16/UTF-32/Grapheme cluster), and most simple programs aren't really up to using grapheme clusters as the unit of string content.

It is, technically, a library only change (no need to change the language spec). In practice, it's a language change to change something so fundamental as the String class.

@Hixie
Copy link
Contributor Author

Hixie commented Oct 19, 2015

Sure. I'm only suggesting doing this for new embeddings, like Flutter. Obviously existing embeddings wouldn't change.

Presumably, there will be far more new code in the future than there already exists today. The earlier we do this, the less breakage there will be.

My thesis is that most of the code that would be broken by this is already actually broken. The authors and users just don't realise it yet.

@zoechi
Copy link
Contributor

zoechi commented Oct 19, 2015

What about introducing a new class similar to string with UTF functionality?

@Hixie
Copy link
Contributor Author

Hixie commented Oct 19, 2015

A new class would be fine by me, so long as that's the class that the string literal syntax creates.

@Hixie
Copy link
Contributor Author

Hixie commented Oct 19, 2015

@lrhn Just "runes" is already much safer than UTF-16 codepoints. Just leaving that part of the API would already be a huge improvement even if it's not the perfect solution. We shouldn't let a desire to make this perfect paralyze us and prevent us from making it better.

@ErikCorryGoogle
Copy link
Contributor

If the underlying representation is still UTF-16, but you have an API that
doesn't expose this, then several common operations go from O(1) to O(n).
For example, getting the nth character for some fuzzy definition of
character.

The API and the underlying representation have to go together. Similarly,
I don't think you'll find systems that store data in UTF-8 but don't allow
byte indexing.

On 19 October 2015 at 09:17, Ian Hickson [email protected] wrote:

@lrhn https://github.com/lrhn Just "runes" is already much safer than
UTF-16 codepoints. Just leaving that part of the API would already be a
huge improvement even if it's not the perfect solution. We shouldn't let a
desire to make this perfect paralyze us and prevent us from making it
better.


Reply to this email directly or view it on GitHub
#24619 (comment).

Erik Corry

Google Denmark ApS
Frederiksborggade 20B, 1 sal
1360 København K
Denmark
CVR nr. 28 86 69 84

@zoechi
Copy link
Contributor

zoechi commented Oct 19, 2015

@Hixie

A new class would be fine by me, so long as that's the class that the string literal syntax creates.

This is still a huge breaking change similar to changing string itself.
A minor addition to the syntax (like r u'some utf string', or backtick some utf string, ...) and a linter rule (if enabled) checks that only the UTF syntax is used could be close enough?

@Hixie
Copy link
Contributor Author

Hixie commented Oct 19, 2015

@ErikCorryGoogle There's no way to get the nth character in O(1) if the representation is UTF-16. You can get the nth codeunit, but that's really bad (see the first comment here, that's the whole point of this bug).

IMHO, the write way to implement this is to have the underlying representation be a detail, and have the runes iterator be swapped in on the fly based on the representation. This is exactly like how in Dart, an "int" has multiple kinds of representations depending on the value, but that is all hidden from the author. So for example, if the backing data is in UTF-8, you iterate with an efficient UTF-8 iterator. If the backing data is in UTF-16, you iterate with an efficient UTF-16 iterator. In both cases, if you want complete graphemes, then you iterate with a grapheme cluster iterator layered on top of the codepoint iterator. If you want to iterate over the string while tracking the bidi state, you use a bidi-algorithm iterator layered on top of the codepoint iterator. And so on.

They're strings, though. You shouldn't be able to index into strings today. Unicode strings just aren't indexable in O(1) time. You can index into byte or word arrays, but if you want to do that you should have to ask for that. It shouldn't be the default.

This is very analogous to the situation with numbers. In dart-to-js, numbers work differently than in the Dart VM. We should just do the same with Strings.

@zoechi The bug here is that the easiest string manipulation API is a bug-prone API. Breaking it is the entire point.

@lrhn
Copy link
Member

lrhn commented Oct 20, 2015

The Go approach is to use UTF-8 for string literals (really, it's a byte slice) and have string interpretations of that which allows you to iterate code points or even grapheme clusters. You can index the byte slice as you wish, it's just not a way to get "characters".

We have a String class which is not just an array of uint16 (although it is effectivley one), but can still have interpretations of it as code points or grapheme clusters. We have the "runes" iterator which allows iterating it as a sequence of code points (as integers or strings), but no grapheme cluster iterator.

This request is for an easy/short way to get the first code point of a Dart string as a string.
I'm not sure that's something that should go on String. It could go on Runes as firstAsString, but it seems a little too specialized for my taste.

@jiridanek
Copy link

@lrhn I was also thinking about the Go approach. People behind Go also did UTF8 in the 90's so one might think they know a thing or two about strings. On the other hand, when solving the problems Go was intended for, you often just pass strings around (say over the network) without inspecting them. Dart is different in this respect.

In Go, strings are byte arrays (slices), without specifying encoding. String literals are turned into utf-8 encoded byte arrays. Function in the library are either encoding agnostic (work with bytes), e.g. various searching and replacement functions or (rarely) assume utf-8 (capitalize a string) Then there is a syntactic construct to iterate over runes and a function to convert from byte arrays to int arrays of runes, if you need constant time addressing. But still, runes are not "characters".

The article at https://blog.golang.org/strings explains the string handling better. There is also a followup https://blog.golang.org/normalization. I somehow dislike the idea that one needs to learn all these complicated things to be able to use Strings in a programming language.

@lrhn
Copy link
Member

lrhn commented Oct 20, 2015

It is complicated. That's one of the reasons we have a simple-but-incomplete String class. You can work with that without understanding everything about Unicode, it just won't handle all cases - but to all handle a case, you need to be aware that they exist. If the only strings we have were Unicode-safe iterator only structures, it would be quite a barrier of entry.

There is no royal road to writing Unicode aware code. You do need to know that all these things exist, otherwise you will fail just as much as the code that only expects ASCII. What we need to do is to provide the tools you need when you know which problem they are solving.
The biggest problem is that most Unicode functionality needs large tables. If we could rely on the Unicode functionality being provided by the OS or platform, it would be much easier, but fx JavaScript doesn't have Unicode support except for toUpperCase/toLowerCase. Even those require significant internal tables.

@Hixie
Copy link
Contributor Author

Hixie commented Oct 20, 2015

This request is for an easy/short way to get the first code point of a Dart string as a string.

That's bug #24617. This request is to not expose UTF-16 code units (or UTF-8 bytes) as an API that is more intuitive than Unicode characters (runes) or graphemes.

The biggest problem is that most Unicode functionality needs large tables.

This is not a problem for embedders that are already paying that cost, such as Flutter.

@lrhn
Copy link
Member

lrhn commented Oct 21, 2015

That's bug #24617
Ack, my mistake.

If the request is to not have an UTF-16 API that is more intuitive than runes/grapheme-clusters, then the solution should be to improve the rune/grapheme-cluster APIs. Reducing the intuitiveness of an existing API is counter-productive (and breaking). It's basically saying that we have a good, usable API that doesn't cover all cases, so we should make it worse. :)

@Hixie
Copy link
Contributor Author

Hixie commented Oct 21, 2015

I disagree. Even if somehow we had a really intuitive runes API, s[0] would still look like it was getting the first character of a String s, which looks like it's a sequence of characters. That's the bug.

Again, this being a breaking change is the entire point. It's very likely that most usage of the current API is already broken.

@jiridanek
Copy link

s[0] would still look like it was getting the first character of a String s, which looks like it's a sequence of characters.

I think it cannot be helped. It is like while (condition); {do stuff();} looks like it does stuff when it does not. The solution is education. I head once that debugging the semicolon example usually takes a beginning CS student about half an hour. But they learn. ;)

There are plans in progress about Dart 2.0. https://github.com/dart-lang/dart_enhancement_proposals/blob/master/Meetings/2015-08-26%20DEP%20Committee%20Meeting.md#dart-20 so now might be a good time to start proposing breaking changes in the Dart Libraries as well.

I do not feel qualified to write a DEP, but it seems to me this is what needs to be done if this discussion is to move anywhere.

One more thing that Go does is that they have an automated tool to update existing code when the language or the libraries change. Dart already has a code formatter, which is the first prerequisite to doing this. It it was possible to move existing code to a new API by executing a single command when Dart 2.0 comes out, there would be much less friction in making significant changes to Dart and the Dart Libraries.

@Hixie
Copy link
Contributor Author

Hixie commented Mar 7, 2017

Closing in favour of #28404

@Hixie Hixie closed this as completed Mar 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries.
Projects
None yet
Development

No branches or pull requests

6 participants