Skip to content

Minimal non-integer-index Unicode grapheme cluster support. #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lrhn opened this issue Oct 17, 2018 · 0 comments
Open

Minimal non-integer-index Unicode grapheme cluster support. #52

lrhn opened this issue Oct 17, 2018 · 0 comments
Assignees
Labels
feature Proposed language feature that solves one or more problems

Comments

@lrhn
Copy link
Member

lrhn commented Oct 17, 2018

Strawman proposal for #34

As a minimal solution, we add one getter to the String class, get graphemeClusters => GraphemClusters(this);, returning an instance of a new class GrahemeClusters, which is an iterable of GraphemeCluster, which represents an extended grapheme cluster.

Unlike #49, this approach does not use integer indices, but instead relies on mutable index objects,
the GraphemeClusterIterator objects. This design is highly speculative. There is likely many things that can be done to make it better.

The classes are defined something like:

/// A single grapheme cluster, or an invalid code unit.
/// 
/// Represents a single grapheme cluster as a sequence of Unicode
/// scalar values.
/// If the source contains things other than valid Unicode scalar
/// values, each code unit that is not part of a valid scalar value
/// gets represented as a single [GraphemeCluster] with [isValid] being
/// false.
abstract class GraphemeCluster {
  /// Whether the cluster represents a sequence of Unicode scalar values.
  /// 
  /// If not, the [runes] will contain only one (invalid) code unit.
  bool get isValid;

  /// The code points making up this grapheme cluster.
  Runes get runes;

  /// Returns a string containing just this grapheme cluster.
  String toString();

  /// Whether [other] is another grapheme cluster with the same runes.
  /// 
  /// Returns `true` if [other] is a grapheme cluster, and it has the
  /// same value for [isValid] and the same sequence of [runes].
  bool operator==(Object other);
  int get hashCode;
}

/// A sequence of [GraphemeCluster]s.
///
/// Also provides operations to work at the [GraphemeClusters] level.
/// 
/// Represents all positions in a [GraphemeClusters] as iterable by
/// a [GraphemeClusterIterator]. Operations like [indexOf] returns
/// a [GraphemeClusterIterator] which can be used to find more
/// instances of the pattern.
abstract class GraphemeClusters implements Iterable<GraphemeCluster> {
  // Iterate over all the grapheme clusters.
  GraphemeClusterIterator get iterator;
  
  /// Whether this is an empty sequence of grapheme clusters.
  bool get isEmpty;
  /// Whether this is not an empty sequence of grapheme clusters.
  bool get isNotEmpty;

  /// Returns an iterator pointing before the first occurence of [clusters].
  /// 
  /// The iterator has no [GraphemeClusterIterator.current].
  /// If [clusters] does not occur in this, then the returned iterator is
  /// at the end of the grapheme cluster sequence.
  GraphemeClusterIterator indexOf(GraphemeClusters clusters) =>
      iterator..moveTo(clusters);

  /// Returns an iterator pointing after the first occurence of [clusters].
  /// 
  /// The iterator has no [GraphemeClusterIterator.current].
  /// If [clusters] does not occur in this, then the returned iterator is
  /// at the end of the grapheme cluster sequence.
  /// 
  /// Equivalent to `gc.indexOf(clusters)..skipPrefix(clusters)`.
  GraphemeClusterIterator indexAfter(GraphemeClusters clusters) =>
      iterator..moveAfter(clusters);

  /// Returns an iterator pointing to the last occurence of [clusters].
  /// 
  /// The iterator has no [GraphemeClusterIterator.current].
  /// If [clusters] does not occur in this, then the returned iterator is
  /// at the end of the grapheme cluster sequence.
  GraphemeClusterIterator lastIndexOf(GraphemeClusters clusters) =>
      iterator..moveToLast(clusters);

  /// Returns an iterator pointing after the last occurence of [clusters].
  /// 
  /// The iterator has no [GraphemeClusterIterator.current].
  /// If [clusters] does not occur in this, then the returned iterator is
  /// at the end of the grapheme cluster sequence.
  GraphemeClusterIterator lastIndexAfter(GraphemeClusters clusters) =>
      iterator..moveToLast(clusters);

  /// Whether this sequence of grapheme clusters contains [clusters].
  bool containsAll(GraphemeClusters clusters) => iterator.moveAfter(clusters);

  /// Whether this sequence starts with [clusters].
  bool startsWith(GraphemeClusters clusters) => iterator.startsWith(clusters);

  /// Whether this sequence ends with [clusters];
  bool endsWith(GraphemeClusters clusters) {
    var iterator = this.iterator;
    if (!iterator.moveAfterLast(clusters)) return false;
    return !iterator.moveNext();
  }

  /// Replaces the first occurrence of [pattern] with [replacement].
  /// 
  /// Returns a new grapheme cluster sequence where the first occurences of [pattern]
  /// has been replaced by [replacement].
  GraphemeClusters replaceFirst(GraphemeClusters pattern, GraphemeClusters replacement);
  
  /// Replaces all occurrences of [pattern] with [replacement].
  /// 
  /// Returns a new grapheme cluster sequence where the occurences of [pattern]
  /// has been replaced by [replacement]. If occurences of [pattern] overlap, 
  /// the first occurence is replaced, and further occurences are only searched
  /// for after the end of the replacement.
  GraphemeClusters replaceAll(GraphemeClusters pattern, GraphemeClusters replacement);

  /// Splits on [pattern] and emits the contents around the patterns.
  /// 
  /// If [pattern] does not occur in this, the returned iterable contains only
  /// one element with the same grapheme clusters as this.
  /// Otherwise the emitted grapheme cluster sequences are those before the
  /// first occurence of [pattern], after the last occurence and between 
  /// two consequtive occurences.
  Iterable<GraphemeClusters> split(GraphemeClusters pattern);

  /// Returns a string containing the same runes as this.
  ///
  /// If this is backed by a string, that string will be returned again.
  /// For another implementation of [GraphemeCluster] that is not backed
  /// directly by a string, a new string is created which contains 
  /// the same valid runes as the grapheme clusters of this,
  /// and with the same invalid code units *if possible*.
  /// If [ensureValid] is true, then invalid code units are replaced by
  /// U+FFFD.
  /// Otherwise, if an invalid code point cannot be represented by the
  /// string, then this call throws.
  String toString({bool ensureValid = false});
}

/// An extended [Iterator] over [GraphemeCluster].
/// 
/// Allows operations on the remaining grapheme clusters, 
/// similar to the operations of [GraphemeClusters].
/// 
/// A [GraphemeClusterIterator] represents either a single grapheme cluster,
/// or a position between grapheme clusters. In the latter case, [current] 
/// is null. The extra operations that work on multiple grapheme clusters
/// will typically end up with a position between two grapheme clusters.
abstract class GraphemeClusterIterator implements Iterator<GraphemeCluster> {

  /// The end point of this iteration.
  /// 
  /// Is initially an iterator at the end of the grapheme cluster sequence.
  /// 
  /// Using [resetEnd] allows a user of this iterator to set an end
  /// point for iteration of this iterator.
  GraphemeClusterIterator get end;

  /// Sets the [end] position of this iterator.
  /// 
  /// Calls [setEnd] with the current iterator, after remembering
  /// its current position.
  /// When [setEnd] returns, the current position of the iterator
  /// (the position before its current grapheme cluster or the current
  /// position between two grapheme clusters) becomes the new end position
  /// of this iterator, and the rembered current position is restored.
  /// 
  /// This can be used to limit iteration to a range, e.g.:
  /// ```dart
  /// var gcs = "<foo>".graphemeClusters;
  /// var iter = gcs.moveAfter("<".graphemeClusters)
  ///   ..resetEnd((i) => i.moveBeforeLast(">".graphemeClusters));
  /// ```
  /// This produces an iterator that will only iterate from after the first 
  /// `<` to before the last `>` (the grapheme clusters of "foo").
  /// 
  /// If [setEnd] is omitted, the [end] is set to the end of the
  /// [GraphemeClusters] being iterated.
  /// 
  /// Returns true if the end positions changed and false if it did not.
  bool resetEnd([void Function(GraphemeClusterIterator) setEnd]);

  /// A [GraphemeClusters] containing the remainder of the clusters of this.
  /// 
  /// If [current] is not null, it is included in the remainder, as are all
  /// values of [current] reachable by calling [moveNext] until it returns false.
  GraphemeClusters get graphemeClusters;

  /// Iterate until [clusters] is a prefix of the remaining grapheme clusters.
  /// 
  /// Repeatedly does a [moveNext] until either that returns false, in which case
  /// this call also returns false, or until the remaining clusters start
  /// with [clusters]. At that point, [current] will be null, and the next call
  /// to [moveNext] will select the first grapheme cluster.
  bool moveTo(GraphemeClusters clusters, {int skip = 0});

  /// Iterate until [clusters] is a suffixed of the iterated grapheme clusters.
  /// 
  /// Repeatedly does a [moveNext] until either that returns false, in which case
  /// this call also returns false, or until the already iterated clusters end
  /// with [clusters]. At that point, [current] will be null, and the next call
  /// to [moveNext] will select the next remaining grapheme cluster.
  bool moveAfter(GraphemeClusters clusters, {int skip = 0});

  /// Iterate until after the last occurence of [clusters].
  /// 
  /// If [skip] is provided, only iterate until after the [skip]+1th
  /// last occurence of [clusters].
  /// 
  /// If there are not enough
  bool moveToLast(GraphemeClusters clusters, {int skip = 0});
  
  /// Find the last occurrence of [clusters] in this.
  /// 
  /// If [skip] is provided, 
  /// instead find the [skip]th last occurence of [clusters].
  /// 
  /// Then set the current position of this iterator to just
  /// after that occurence, with no [current] being selected.
  /// 
  /// If there are not [skip] + 1 occurrences of [clusters]
  /// in the remaining grapheme clusters of this iterator,
  /// this iteration will end and return null, as if by
  /// repeatedly calling [moveNext].
  bool moveAfterLast(GraphemeClusters clusters, {int skip = 0});

  /// Whether [cluster] is a prefix of the remaining grapheme clusters.
  /// 
  /// The remaining grapheme clusters includes [current] if [current] is 
  /// not null. Does not move the current position.
  bool startsWith(GraphemeClusters clusters);
    
  /// If this iterator [startsWith] [clusters], move to after it.
  /// 
  /// Returns true if [clusters] was skipped. After this,
  /// [current] is unset.
  bool skipPrefix(GraphemeClusters clusters);

  /// Resets the position to before the [current] grapheme cluster.
  /// 
  /// If [current] is null, then nothing happens. 
  /// Otherwise sets [current] to null, 
  /// and the next call to [moveNext] will select [current] again.
  /// 
  /// If [skipCurrent] is true, the position is reset to *after* the [current]
  /// grapheme cluster. This sets [current] to null, and the next call 
  /// to [moveNext] will select the next grapheme cluster after [current].
  void reset({bool skipCurrent = false});
  
  /// Creates a copy of this iterator.
  /// 
  /// The copy is at the exact same position, but is otherwise unrelated to
  /// this iterator. The two iterators can be changed independently.
  GraphemeClusterIterator clone();
 
  // Make these return GraphemeClusterIterator too.
  GraphemeClusterIterator skip(int count);
  GraphemeClusterIterator take(int count);
  GraphemeClusterIterator skipWhile(bool test(GraphemeClusters));
  GraphemeClusterIterator takeWhile(bool test(GraphemeClusters));
}

There are no Patterns on grapheme clusters. We can define a ClusterPattern if necessary, but RegExp won't implement it.

This design has no support for:

  • Normalization
  • Localization

All it needs to be implemented is enough information to recognize Unicode extended grapheme clusters when scanning a string from left to right.

The API is designed so that you never give an iterator to a function on GraphemeClusters, so you never need to check if it's an iterator for that object. Instead you always use the iterator itself to do the operation.

Since indices are no longer exposed as integers, you cannot do arithmetic operations on them. That includes comparisons, so there is no way to figure out whether two grapheme cluster iterators are iterating the same GraphemeClusters object, and if they are, which is further ahead.

@lrhn lrhn added the feature Proposed language feature that solves one or more problems label Oct 17, 2018
@lrhn lrhn self-assigned this Oct 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Proposed language feature that solves one or more problems
Projects
None yet
Development

No branches or pull requests

2 participants