-
Notifications
You must be signed in to change notification settings - Fork 213
Dart Strings should support Unicode grapheme cluster operations #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We hope to decide on an approach for this during Q4 2018, so inputs are welcome. There are many different possible approaches, but we will need a design that is consistent and usable, efficiently implementable (even in JavaScript), that it is possible to migrate existing code to, and that isn't too blocking for other possible future language features (for example, let's not use all remaining ASCII characters for a new string syntax, we want some left for other things too!) If we need breaking language changes to fully use new string features that we want to introduce, one way to make that breakage migratable could be to make the change start as opt-in on a per library basis, as long as non-migrated and migrated code can still co-exist in the same program. (This kind of language-change opt-in feature is something we will likely need for other changes too, so it's on our roadmap anyway). |
cc @rakudrama |
My vote: make a new class called Why: correct Unicode processing must be easier than incorrect Unicode processing. If grapheme cluster processing is anything less than a first-class citizen, the default for developers will be to perform incorrect, UTF-16 processing, and we will continue to see preventable bugs for years to come. |
|
One problem that will have to be addressed with a brand-new class is how does it work with JavaScript interop. It would be pain to require user to wrap and unwrap strings and it is not clear we could do that automatically. |
I like the idea of a low level and high level interface for strings, and I agree with @sffc that it has to be easier to do things correctly than incorrectly. Maybe the low level parts of the current String interface (all the code unit calls, This would definitely be a breaking change, but the vast majority of call sites wouldn't have to stop using String, only those that used API that moved to the low level |
If we're comfortable making a breaking change, then the low-hanging fruit would be to eliminate the If we did this, then adding the This being said, although
|
I think we need a much shorter phrase than
With operations that match elements (split, replace, etc), we can put the extended grapheme cluster versions on |
Do we have any data on how many uses of the existing code-point based Do we know what fraction of strings in an application contain human text versus other things (database column names, JSON map keys, enum names, HTML tag names, etc.)? |
As a flutter developer I personally make heavy use of trim, substring, and more methods many of you have labeled as "wrong" with many strings in my app so the words "breaking change" make me nervous to read; but I'm all for getting grapheme clusters to work. Also, @munificent, for whatever my anecdote is worth, most of the strings I deal with are human readable. Thanks for your efforts! I'm eagerly waiting for a solution for this as it's required for me to release. Breaking change or not. |
I agree with @rakudrama: Code like The way to handle the issue of performance is not to limit the API drastically, but to give the developer the choice: Use a string type/view that is appropriate for the task. For example, We don't stop people from doing a |
I agree with everything Greg said. We should get users using best practices by making those practices easier not by deliberately making other practices harder. Deliberately lowering usability of an API almost never works out well. There was a time in Dart's history where A key virtue of Dart is approachability. The idioms you know from other mainstream languages often work in Dart too so you can get up and running quickly without having to learn the "Dart way" of doing something. We should be very cautious about sacrificing that. |
Although it can be a pain to give up integer indexing for strings, the Swift language did seem to get grapheme clusters right, and it would be worth considering following the same approach. The advantage of being able to iterate grapheme clusters and avoid bugs related to integer indexing is well worth the pain in my opinion. See https://oleb.net/blog/2017/11/swift-4-strings/ |
#685 is our current attempt at supporting this via a package |
One other thing worth mentioning: the definition for code point is well-defined and very unlikely to change, but the definition for grapheme cluster changes from Unicode release to Unicode release. So, different versions of Dart would have different behaviors depending on which Unicode version was bundled. In my opinion, it would not be unreasonable for Dart strings to obey code point boundaries by default, which handles many use cases, and defer to third-party libraries to handle grapheme cluster boundaries. |
We now have a |
The Dart String class is a sequence of UTF-16 code units (aka. 16-bit integers).
It has a
runes
getter which provides a way to iterate the string as code points. However, that is not sufficient to perform operations which treat the string content as human readable Unicode text, because the unit of representation for that is an extended grapheme cluster which can be more than one code unit. The most traditional example is the string"e\u0301"
which contains only one grapheme cluster (the U+0301 code point is the [combining acute accent](accent aigu combining mark) which combines with the priore
to designate the glyphé
). More complicated examples include combining emojis or country-codes (flag emojis).Users currently cannot work with strings at the grapheme cluster level.
This leads to tricky bugs where tests work for simple examples, but the program fails badly when it encounters real-life text.
The Dart
String
class should, at the very least, provide a way to iterate the string as a sequence of grapheme clusters. There should probably also be other operations on the grapheme cluster sequence, so users won't have to do everything manually. The exact operations and API will need to be designed.It might also be useful to make some changes to the
String
class, or add other related functionality in separate libraries or packages.I've collected a number of ideas, wishes and concerns about such changes in a document.
A minimal solution to this issue would be a
graphemeClusters
getter onString
which provides an iterable over "grapheme clusters". We believe this to be practically possible, even when compiled to JavaScript.The text was updated successfully, but these errors were encountered: