-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Bring Dart's String support into the modern age #28404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Most of these are perfectly good suggestions for arbitrary Unicode strings, but makes it much harder to work with text that you know is in a more limited format (typically ASCII, maybe parsing JSON with a known forma, or Dart identifiers, or something else with a simple format). As always, it's a trade-off between making simple things easy and complex things possible. Going all Unicode-grapheme-cluster only will make simple things harder, but will also make some complex things easier - and force users to be aware that there is an issue. |
That's why I added the fourth check box. I agree that we need to make it easy to deal with byte strings, including building byte strings from string literals that only use ASCII-compatible characters (ord<128), maybe even implicitly. It's everything between straight ASCII and full Unicode that's the problem. :-)
That's a security vulnerability waiting to happen, usually. |
Being as good as Swift, I guess, also means publishing an article "Why Dart String API is so hard?" on dartlang.org similar to "Why is Swift's String API So Hard?". |
Yes. Or we can do even better, as suggested by my forth checkbox above. We can have ways to initialize constant byte arrays from string literals and let these be accessed by index, for example. The path API could exist and work with strings as well as byte arrays. and so forth. I'm happy to try and design a comprehensive API here if this is something that would help. |
These sound nice, but there are an awful lot of minefields. Do we want
and if so, which normalization scheme do we use? |
We should compare grapheme clusters using NFD or NFC (it doesn't matter, they have the same result). The only other plausible option would be NFKC/NFKD but those aren't really reasonable for string comparisons. So I don't really see that as a minefield. The current behaviour (without any normalisation) is the minefield. |
Removing the ability to index into strings probably makes it possible to implement this efficiently on JS too. |
Language proposal (this is a rough draft I just hammered out and I'm sure others will have opinions on making this better): There are two kinds of string literals, String literals can be prefixed by characters that control the syntax and kind of string object generated from the literal. The String literals as described so far create a String object, described below. The String literals can be placed adjacent and will be combined to form a single object. However, all the parts of this sequence must agree on the presence or absence of the The String class is entirely replaced. It does not have a
StringIterator implements StringPosition. The StringPosition class represents a position in a particular String. In checked mode, when a StringPosition is applied to another String (e.g. applying the
StringPosition has a For example, There are classes that take String objects and return Uint8List objects by encoding the String accordingly. There are classes that do the reverse, also. |
Some random notes: What does it look like to use the API?
Do you have a model implementation in some language? Mention there is no .codeUnitAt(). In practice we use this instead of [] since does not allocate a small string. Do trim(), trimLeft(), trimRight() need any changes? Are StringPositions constants, e.g. can I write const x = 'abc'.end; ? Are there different normalizations? Any tie-in with for-in? Today I can say RegExp will probably have to change - they are part of the larger string API. ES6 has I work on dart2js and I am concerned about how this can be implemented with reasonable efficiency. I understand that targeting JS is not your concern, but it would be great of we could make it work, and things which are hard to optimize on one platform tend to be hard to optimize on other platforms. If we can make it that an iterator can be reduced to a JavaScript UTF16 code unit index and some helper code in static functions to find the next/previous index, it might be possible for the compiler to do some magic and reduce local iterators to scanning indexes. To pull this off would require the iterator api to be easily analyzed, for example, to know something is monotonic and bounded. Is there a need for a single iterator instance that can go forwards and backwards? It would be easier to optimize if there was exactly one class implementing StringPosition, so operations do not need polymorphic call in tight loops, and hopepully it could reliably be exploded into a backing string + index. Could the API be written entirely with positions? e.g. pos = pos.next(); I'm not sure I understand StringPosition.+ We could experiment with the API by having a class Characters with a static method of and putting all the code we want to experiment with on class Characters, i.e
|
(repost of mail-comment from March 14th, now with formatting)
var x = "\uDC00"; // Invalid?
var y = "\uD800$x"; // Valid?
If the string is a one-byte String, then it might even be efficient. Do you still have the String.fromCharCodes constructor. Should it throw on
Does the If it's opaque, it's probably a bad design. Opaque classes that just So, I don't think a StringPosition by itself is necessarily useful, maybe
I'm not sure mixing them too much is a good idea, but I can see that either Then you can do: var buffer = new StringBuffer();
for (StringIterator s = string.start.iterator; s.moveNext();) {
Char c = s.current; // GraphemeCluster, really.
if (!isWhiteSpace(c)) {
buffer.write(c);
}
}
return buffer.toString();
StringIterator indexOf(String s, {StringIterator start}) =>
(start ?? this.start).indexOf(s); If StringPosition indexOf(String s, {StringPosition start}) =>
(start ?? this.start).indexOf(s); (Yes,
You have strings Again, doing prefix checks on the input strings is not viable. Even doing What you need is a way to convert a string position from one string to StringPosition after(StringPosition other) That will check that the position of other has a string where the prefix of Just doing addition and lazily checking later requires far too much So, for the use-case above, that means something like:
That's annoying. We can probably do some shorthands for when it's relative var newPos = pos.after(a.in(sum)); This will check that sum starts with
What is 'aaa'.lastIndexOf('aa') + 'bbb'.end ? (something that checks for Why is it different from 'bbb'.end + 'aaa'.lastIndexOf('aa')? (because It's not that '+' can't be non-commutative (like
So, a string grapheme cluster (no matter how it's represented, as a single All in all, this sounds heavy. Then there is normalization. accent(bool up) => "a${up ? "\u0301" : "\u0300"}";
eccent(bool up) => "e" + (up ? "\u0301" : "\u0300") (That's adding the accent to a base letter computationally, which is not an Not using JS string functions makes this a very expensive change for JS |
I'd just like to put in a word of support for Ian's request: if Dart doesn't have this kind of support, it's extremely hard to support multilingual programs, or even just support entering emoji in a program. It's pretty hard to argue that it isn't in the wheelhouse for Dart (and that programmers should just code their own solution) because it's very non-trivial to code, requiring lookup databases, etc., and is widely applicable (many internationalized programs could use this). If Dart is to be batteries-included, then some kind of character-level (grapheme cluster) manipulation is needed. As a simple concrete example, it's not possible to implement an input field that limits the number of user-visible "characters" without the ability to count them and truncate the input properly. |
Ok, new proposal. String foo = 'Hello world';
var space = foo.indexOf(' ');
var hello = foo.substring(foo.start, space);
var world = foo.substring(space + 1, foo.end);
// Count number of extended grapheme clusters in a string.
int lengthOf(String s) {
int result = 0;
for (String character in s.characters)
result += 1;
return result;
} String zalgo = 'D̸̛͇̻̼̜̲a̤̕r̟͚̥͍̲̬ṯ̘̕͞';
for (String character in zalgo.characters)
print('The character $character begins with ${character.firstRune.currentAsString}.');
// prints:
// The character D̸̛͇̻̼̜̲ begins with D.
// The character a̤̕ begins with a.
// The character r̟͚̥͍̲̬ begins with r.
// The character ṯ̘̕͞ begins with t. // Naive implementation of indexOf (native implementation could just compare
// the underlying buffers and construct the resulting iterator artificially).
RuneIterator indexOf(String s, String pattern) {
bool match(RuneIterator position1, RuneIterator position2) {
while (position1.moveNext() || position2.moveNext()) {
if (position1.current != position2.current)
return false;
}
return true;
}
RuneIterator position = s.runes.first;
while (position.moveNext()) {
if (match(position.clone(), pattern.start))
return position;
}
return null;
} UNICODE STRINGSSyntaxThe syntax for String literals in Dart is unchanged by this proposal, except that string literals that would not be valid Unicode are compile-time syntax errors. APIThe constructors on the String class, its isEmpty, isNotEmpty, runes, and hashCode properties, its contains, endsWith, replaceAll, replaceAllMapped, split, splitMapJoin, trim, trimLeft, trimRight methods, and the *, +, and == operators, are left as today. The String class codeUnits property, codeUnitAt method, the [] operator, and the length property are removed entirely. RuneIterator's at named constructor, and its currentSize and rawIndex properties, are removed entirely. The argument to its reset method is also removed. The property with the name last on the class Runes is replaced with a property described below. A new property is introduced, characters, which returns a Characters object. Characters is like Runes but implements CharacterIterator has a property "firstRune" that returns a RuneIterator that points to the first rune of the substring pointed to by the CharacterIterator. CharacterIterator and RuneIterator both implement StringPosition. StringPosition has a property that returns a RuneIterator. It returns "this" for a RuneIterator and "firstRune" for a CharacterIterator. CharacterIterator and RuneIterator also both implement two new properties, previous and next, which return new iterators that point to the previous or next extended grapheme cluster or rune respectively, or throw if they are at the start or end of the string respectively. CharacterIterator and RuneIterator also both implement the binary + and - operators, with int operands. The - operator is expressed in terms of the + operator with the operand negated. The + operator creates a clone of the iterator and then advances (or retreats, for negative operands) that new iterator as many times as specified by the operand. It then returns the new operator. The replaceFirst, replaceFirstMapped, replaceRange, startsWith, and substring methods are changed to take a StringPosition instead of an int for any parameter that refers to a position in a string. If the StringPosition is an iterator that refers to a different string than the one passed to the method, then in debug mode the method asserts (in release mode behaviour is undefined). The indexOf and lastIndexOf methods return a RuneIterator instead of an int. They return null if the pattern isn't found. The padLeft and padRight width arguments are changed to refer to runes. Two new methods padLeftByCharacters and padRightByCharacters are introduced that are identicial but whose width arguments refer to extended grapheme clusters. Runes and Characters get two new properties, first and last, that return RuneIterators and CharacterIterators respectively that point to the first and last rune and extended grapheme cluster in the string respectively. String also gets start and end properties that return the same values as Runes.first and Runes.last respectively. RuneIterator and CharacterIterator get a new method, clone(), which returns a new, identically-configured, iterator. The toLowerCase and toUpperCase methods take a Locale object and perform the conversion according to the relevant locale. String is given a new method, toUtf8(), which returns a Uint8List that represents the same string, encoded as UTF-8. String is also given a new constructer, fromUtf8, which takes a Uint8List and decodes it as UTF-8. There is no way to construct a String object with invalid Unicode. When strings are constructed, they apply NFC normalization. The actual buffer of a String, and in particular its internal encoding, cannot be determined from Dart code. BYTE STRINGSSyntaxA "b" prefixed in front of a string literal changes it into a byte string literal.
Byte strings must not contain \u escapes and must not contain any literal characters beyond U+007F. Byte strings can't be combined with Unicode strings using the adjacent string syntax ("foo" "bar") APIA byte string literal creates a Uint8List whose buffer contains the scalar values of each character in the literal. The dart:io libraries that deal with filenames are changed to use Uint8List rather than String. |
(Mostly I intend these proposals to demonstrate feasibility, not to be final concrete proposals. I'm sure Dart language and library experts can come up with better things with their holistic knowledge of the platform.) |
If you want to test the proposal (is it nice to use? is it fast enough? etc) I suggest that you put all the new and changed String methods on Characters, and have a 'of' constructor. Then you can experiment without changing String: var foo = Characters.of('Hello world');
var space = foo.indexOf(' ');
var hello = foo.substring(foo.start, space);
var world = foo.substring(space + 1, foo.end);
// Count number of extended grapheme clusters in a string.
int lengthOf(String s) {
return Characters.of(s).length;
}``` |
Hopefully not actually |
I'm very interested in seeing alternative proposals, too. The current state of String is IMHO a non-contender. |
Yet another incident of needing grapheme cluster support: Android's TalkBack allows the user to indicate that they want to move forward and backward by a "character". Without grapheme cluster support, support for that is not (easily) implementable in Flutter. |
This is a bit of a blocker on my Flutter app. Any word on progress with this? |
This is definitely on the Dart team's radar. See this discussion: dart-lang/language#34 |
We now have an experimental version of a new package API example (full API docs): import 'package:characters/characters.dart';
main() {
String hi = 'Hi 🇩🇰';
print('String is "$hi"\n');
// Length.
print('String.length: ${hi.length}');
print('Characters.length: ${Characters(hi).length}\n');
// Skip last character.
print('String.substring: "${hi.substring(0, hi.length - 1)}"');
print('Characters.skipLast: "${Characters(hi).skipLast(1)}"\n');
// Replace characters.
Characters newHi =
Characters(hi).replaceAll(Characters('🇩🇰'), Characters('🇺🇸'));
print('Change flag: "$newHi"');
} Output when run: $ dart example/main.dart
String is "Hi 🇩🇰"
String.length: 7
Characters.length: 4
String.substring: "Hi 🇩���"
Characters.skipLast: "Hi "
Change flag: "Hi 🇺🇸" Feedback most welcome! |
Nice, any plans of also supporting collation/string comparison/sorting, or is that out of scope of that library? |
No current plans to extend this package's scope to something requiring full Unicode data tables. |
Will the LengthLimitingTextInputFormatter() be updated to support counting the characters in emojis correctly? |
Closing this issue: With the |
Could someone here perhaps shed some light on the thinking on having the As a package maintainer, I do go "sigh" on having to add a package dependency for correct string handling. Is there a roadmap for this being incorporated into the standard library? |
If by standard library, you mean incorporate these APIs into the ones on We could have shipped it as a new @lrhn may have additional comments on this topic. |
I too hate adding dependencies to my packages unnecessarily (especially when they come with a multitude of transitive dependencies, at least That said, putting the feature into a package does indeed allow us to iterate on the API much more easily than if it was in the platform libraries. The rules against breaking changes in the platform libraries are very strict. Adding a member to a class is potentially breaking. Packages generally consider adding members to a class which is not intended as a reusable interface, to be non-breaking. Even if it does break someone, they can just stay on an earlier version of the package until things are fixed. That's not an option for platform libraries, you get the ones in the current SDK and that's it. We might eventually decide that the package is mature enough, and move it into the platform libraries. That depends on a lot of things, including how it's being used, and by how many, and how often we need to make changes. We don't know any of that yet. Adding the current package to the platform libraries could turn out to cause premature lock-in, and then we'll be stuck with it forever. |
Yes, that's what I meant. One would hope to see baseline functionality incorporated into the platform libraries going forward.
I thought the 1.0 release was an indication of that. I suppose I'm mostly wondering what differentiates this from all the churn in the platform libraries itself in the past year or two. (I've been on this ride since 2017 or so.) It would seem that the answer is that perhaps the problem domain is somewhat novel? As in, we're all used to our bad old ways, in most contemporary programming languages, of treating strings as sequences of bytes and/or codepoints, and how to move to thinking of grapheme clusters isn't entirely obvious in terms of its API implications? |
Being Grapheme cluster APIs are indeed not that well studied. The only other modern string API is Swift. They too need to consider backwards compatability with the 16-bit based |
@lrhn All right, that all makes sense. Thank you kindly for elaborating. |
The previous aside, this seems as good a place as any to state for the record that while Dart's evolution has been impressive, perhaps even singularly impressive, Dart strings' internal UTF-16 encoding is surely one of Dart's remaining cardinal sins. Given the apriori-unlikely retrofits already successfully made to the language (all the churn), is there perhaps any kind of long-term plan to move Dart (in Dart 3+, say) towards a UTF-8 basis? (Create a UTF-8 Other than the obvious performance implications of tons of unnecessary encoding conversions--particularly when working with native libraries via FFI, which I do frequently--this sometimes bites people in unexpected ways. I want to share here a brief recent example that I found instructive in terms of the pitfalls facing a Dart novice coming from the external UTF-8 world. On first glance, this seems an innocent HTTP response handler snippet, not that dissimilar from how you would write it in any number of languages in use today: final html = await rootBundle.loadString(assetKey);
httpResponse
..headers.add("Content-Type", "text/html;charset=UTF-8")
..headers.add("Content-Length", html.length.toString())
..write(html); But, of course, the code above is not actually well-formulated at all:
I trust the reason is obvious to all participants here, but it certainly wasn't to everyone. The corrected code is: final html = utf8.encode(await rootBundle.loadString(assetKey));
httpResponse
..headers.add("Content-Type", "text/html;charset=UTF-8")
..headers.add("Content-Length", html.length.toString())
..add(html); Now, I realize, of course, all the caveats here, particularly in light of the preceding discussion. Indeed, beginning with the assumption that |
AFAICT Is there some recommended way to do Unicode normalization? |
Admin comment: For current support for working on strings containing Unicode (extended) grapheme clusters, please see https://pub.dev/packages/characters and https://medium.com/dartlang/dart-2-7-a3710ec54e97.
For example, consider this discussion:
http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
We should be at least as good as Swift and Perl 6 when it comes to dealing with strings.
Things we should do:
cc @sethladd, since you were asking what improvements we can make to Dart to bring it into the modern age. These changes would have massively more meaningful impact than making semicolons optional or removing other punctuation.
The text was updated successfully, but these errors were encountered: