Exposing surrogates in the string API leads to buggy code #24619

Hixie · 2015-10-18T18:35:47Z

In embeddings that don't have a UTF-16 legacy to deal with, we should remove the String API affordances that expose the underlying UTF-16-ness of the String class.

With the UTF-16-ness exposed, it's very easy to mistakenly write code that works perfectly well in ASCII-only environments and even in naïve Unicode-aware testing ("Let me make sure it works with an 'é'!"), but that break as soon as the real users start sending emoji to each other. Furthermore, because the fix is to switch from using an O(1)-indexable API to an iterator API, it requires some non-trivial work to fix.

We should only expose UTF-16 code units when the author is explicitly trying to encode Strings to UTF-16 byte arrays, just like we only expose UTF-8 bytes when the author is encoding Strings to UTF-8.

lrhn · 2015-10-19T06:48:33Z

That would be an extremely breaking change - as you say: changing the API to remove length, operator[] and codeUnitAt. There is a lot of code out there using those. So, it's not something to be taken lightly, it would break every string manipulating library ever written.

I fully agree that it isn't safe. I think the real problem is that we don't have a safe alternative that you can use when you want to. Even then, you still have to pick the right iterator to use (UTF-8/UTF-16/UTF-32/Grapheme cluster), and most simple programs aren't really up to using grapheme clusters as the unit of string content.

It is, technically, a library only change (no need to change the language spec). In practice, it's a language change to change something so fundamental as the String class.

Hixie · 2015-10-19T07:13:08Z

Sure. I'm only suggesting doing this for new embeddings, like Flutter. Obviously existing embeddings wouldn't change.

Presumably, there will be far more new code in the future than there already exists today. The earlier we do this, the less breakage there will be.

My thesis is that most of the code that would be broken by this is already actually broken. The authors and users just don't realise it yet.

zoechi · 2015-10-19T07:13:24Z

What about introducing a new class similar to string with UTF functionality?

Hixie · 2015-10-19T07:15:15Z

A new class would be fine by me, so long as that's the class that the string literal syntax creates.

Hixie · 2015-10-19T07:17:55Z

@lrhn Just "runes" is already much safer than UTF-16 codepoints. Just leaving that part of the API would already be a huge improvement even if it's not the perfect solution. We shouldn't let a desire to make this perfect paralyze us and prevent us from making it better.

ErikCorryGoogle · 2015-10-19T07:29:28Z

If the underlying representation is still UTF-16, but you have an API that
doesn't expose this, then several common operations go from O(1) to O(n).
For example, getting the nth character for some fuzzy definition of
character.

The API and the underlying representation have to go together. Similarly,
I don't think you'll find systems that store data in UTF-8 but don't allow
byte indexing.

On 19 October 2015 at 09:17, Ian Hickson [email protected] wrote:

@lrhn https://github.com/lrhn Just "runes" is already much safer than
UTF-16 codepoints. Just leaving that part of the API would already be a
huge improvement even if it's not the perfect solution. We shouldn't let a
desire to make this perfect paralyze us and prevent us from making it
better.

—
Reply to this email directly or view it on GitHub
#24619 (comment).

Erik Corry

Google Denmark ApS
Frederiksborggade 20B, 1 sal
1360 København K
Denmark
CVR nr. 28 86 69 84

zoechi · 2015-10-19T08:23:37Z

@Hixie

A new class would be fine by me, so long as that's the class that the string literal syntax creates.

This is still a huge breaking change similar to changing string itself.
A minor addition to the syntax (like r u'some utf string', or backtick some utf string, ...) and a linter rule (if enabled) checks that only the UTF syntax is used could be close enough?

Hixie · 2015-10-19T15:42:20Z

@ErikCorryGoogle There's no way to get the nth character in O(1) if the representation is UTF-16. You can get the nth codeunit, but that's really bad (see the first comment here, that's the whole point of this bug).

IMHO, the write way to implement this is to have the underlying representation be a detail, and have the runes iterator be swapped in on the fly based on the representation. This is exactly like how in Dart, an "int" has multiple kinds of representations depending on the value, but that is all hidden from the author. So for example, if the backing data is in UTF-8, you iterate with an efficient UTF-8 iterator. If the backing data is in UTF-16, you iterate with an efficient UTF-16 iterator. In both cases, if you want complete graphemes, then you iterate with a grapheme cluster iterator layered on top of the codepoint iterator. If you want to iterate over the string while tracking the bidi state, you use a bidi-algorithm iterator layered on top of the codepoint iterator. And so on.

They're strings, though. You shouldn't be able to index into strings today. Unicode strings just aren't indexable in O(1) time. You can index into byte or word arrays, but if you want to do that you should have to ask for that. It shouldn't be the default.

This is very analogous to the situation with numbers. In dart-to-js, numbers work differently than in the Dart VM. We should just do the same with Strings.

@zoechi The bug here is that the easiest string manipulation API is a bug-prone API. Breaking it is the entire point.

lrhn · 2015-10-20T08:21:01Z

The Go approach is to use UTF-8 for string literals (really, it's a byte slice) and have string interpretations of that which allows you to iterate code points or even grapheme clusters. You can index the byte slice as you wish, it's just not a way to get "characters".

We have a String class which is not just an array of uint16 (although it is effectivley one), but can still have interpretations of it as code points or grapheme clusters. We have the "runes" iterator which allows iterating it as a sequence of code points (as integers or strings), but no grapheme cluster iterator.

This request is for an easy/short way to get the first code point of a Dart string as a string.
I'm not sure that's something that should go on String. It could go on Runes as firstAsString, but it seems a little too specialized for my taste.

jiridanek · 2015-10-20T09:12:59Z

@lrhn I was also thinking about the Go approach. People behind Go also did UTF8 in the 90's so one might think they know a thing or two about strings. On the other hand, when solving the problems Go was intended for, you often just pass strings around (say over the network) without inspecting them. Dart is different in this respect.

In Go, strings are byte arrays (slices), without specifying encoding. String literals are turned into utf-8 encoded byte arrays. Function in the library are either encoding agnostic (work with bytes), e.g. various searching and replacement functions or (rarely) assume utf-8 (capitalize a string) Then there is a syntactic construct to iterate over runes and a function to convert from byte arrays to int arrays of runes, if you need constant time addressing. But still, runes are not "characters".

The article at https://blog.golang.org/strings explains the string handling better. There is also a followup https://blog.golang.org/normalization. I somehow dislike the idea that one needs to learn all these complicated things to be able to use Strings in a programming language.

lrhn · 2015-10-20T10:46:23Z

It is complicated. That's one of the reasons we have a simple-but-incomplete String class. You can work with that without understanding everything about Unicode, it just won't handle all cases - but to all handle a case, you need to be aware that they exist. If the only strings we have were Unicode-safe iterator only structures, it would be quite a barrier of entry.

There is no royal road to writing Unicode aware code. You do need to know that all these things exist, otherwise you will fail just as much as the code that only expects ASCII. What we need to do is to provide the tools you need when you know which problem they are solving.
The biggest problem is that most Unicode functionality needs large tables. If we could rely on the Unicode functionality being provided by the OS or platform, it would be much easier, but fx JavaScript doesn't have Unicode support except for toUpperCase/toLowerCase. Even those require significant internal tables.

Hixie · 2015-10-20T15:04:51Z

This request is for an easy/short way to get the first code point of a Dart string as a string.

That's bug #24617. This request is to not expose UTF-16 code units (or UTF-8 bytes) as an API that is more intuitive than Unicode characters (runes) or graphemes.

The biggest problem is that most Unicode functionality needs large tables.

This is not a problem for embedders that are already paying that cost, such as Flutter.

lrhn · 2015-10-21T12:07:39Z

That's bug #24617
Ack, my mistake.

If the request is to not have an UTF-16 API that is more intuitive than runes/grapheme-clusters, then the solution should be to improve the rune/grapheme-cluster APIs. Reducing the intuitiveness of an existing API is counter-productive (and breaking). It's basically saying that we have a good, usable API that doesn't cover all cases, so we should make it worse. :)

Hixie · 2015-10-21T14:48:02Z

I disagree. Even if somehow we had a really intuitive runes API, s[0] would still look like it was getting the first character of a String s, which looks like it's a sequence of characters. That's the bug.

Again, this being a breaking change is the entire point. It's very likely that most usage of the current API is already broken.

jiridanek · 2015-10-21T15:37:48Z

s[0] would still look like it was getting the first character of a String s, which looks like it's a sequence of characters.

I think it cannot be helped. It is like while (condition); {do stuff();} looks like it does stuff when it does not. The solution is education. I head once that debugging the semicolon example usually takes a beginning CS student about half an hour. But they learn. ;)

There are plans in progress about Dart 2.0. https://github.com/dart-lang/dart_enhancement_proposals/blob/master/Meetings/2015-08-26%20DEP%20Committee%20Meeting.md#dart-20 so now might be a good time to start proposing breaking changes in the Dart Libraries as well.

I do not feel qualified to write a DEP, but it seems to me this is what needs to be done if this discussion is to move anywhere.

One more thing that Go does is that they have an automated tool to update existing code when the language or the libraries change. Dart already has a code formatter, which is the first prerequisite to doing this. It it was possible to move existing code to a new API by executing a single command when Dart 2.0 comes out, there would be much less friction in making significant changes to Dart and the Dart Libraries.

Hixie · 2017-03-07T00:35:52Z

Closing in favour of #28404

Hixie mentioned this issue Oct 18, 2015

Getting the first character of a string is way too verbose #24617

Closed

sethladd added the area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. label Oct 19, 2015

Hixie closed this as completed Mar 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing surrogates in the string API leads to buggy code #24619

Exposing surrogates in the string API leads to buggy code #24619

Hixie commented Oct 18, 2015

lrhn commented Oct 19, 2015

Hixie commented Oct 19, 2015

zoechi commented Oct 19, 2015

Hixie commented Oct 19, 2015

Hixie commented Oct 19, 2015

ErikCorryGoogle commented Oct 19, 2015

zoechi commented Oct 19, 2015

Hixie commented Oct 19, 2015

lrhn commented Oct 20, 2015

jiridanek commented Oct 20, 2015

lrhn commented Oct 20, 2015

Hixie commented Oct 20, 2015

lrhn commented Oct 21, 2015

Hixie commented Oct 21, 2015

jiridanek commented Oct 21, 2015

Hixie commented Mar 7, 2017

Exposing surrogates in the string API leads to buggy code #24619

Exposing surrogates in the string API leads to buggy code #24619

Comments

Hixie commented Oct 18, 2015

lrhn commented Oct 19, 2015

Hixie commented Oct 19, 2015

zoechi commented Oct 19, 2015

Hixie commented Oct 19, 2015

Hixie commented Oct 19, 2015

ErikCorryGoogle commented Oct 19, 2015

zoechi commented Oct 19, 2015

Hixie commented Oct 19, 2015

lrhn commented Oct 20, 2015

jiridanek commented Oct 20, 2015

lrhn commented Oct 20, 2015

Hixie commented Oct 20, 2015

lrhn commented Oct 21, 2015

Hixie commented Oct 21, 2015

jiridanek commented Oct 21, 2015

Hixie commented Mar 7, 2017