Skip to content

Minimal Unicode grapheme cluster support #49

Closed
@lrhn

Description

@lrhn

Strawman proposal for #34

As a minimal solution, we add one getter to the String class, get graphemeClusters => GraphemClusters(this);, returning an instance of a new class GrahemeClusters, which is an iterable of GraphemeCluster, which represents an extended grapheme cluster.
The GraphemeCluster and GraphemeClusters classes are defined something like:

abstract class GraphemeCluster {
  /// Whether the cluster represents a sequence of Unicode scalara values.
  /// 
  /// If not, the [runes] will contain only one (invalid) code unit.
  bool get isValid;

  /// The code points making up this grapheme cluster.
  Runes get runes;

  /// Returns a string containing just this grapheme cluster.
  String toString() => string.substring(start, end);

  /// The length of this grapheme cluster.
  /// 
  /// Returns a number that can be added to the start index of this
  /// grapheme (like one returned by [GraphemeClusters.indexOf]), 
  /// to produce an index just after this grapheme cluster.
  int get length;

  /// Whether [other] is another grapheme cluster with the same runes.
  /// 
  /// Returns `true` if [other] is a grapheme cluster, and it has the
  /// same value for [isValid] and the same sequence of [runes].
  bool operator==(Object other);
  int get hashCode;
}


/// A view of a `String` as a sequence of grapheme clusters or invalid code units.
///
/// Many operations are based on "indices". These indices should always be
/// values returned or provided by other operations on this class, with `0` 
/// being the start of the grapheme clusters, and [length] being the end.
/// The [iterator] provides access to start and end indices of the current 
/// grapheme cluster, and `indexOf` or `replaceAllMapped` provides indices.
/// These indices represent *grapheme cluster boundaries*.
/// If an index is used that does not represent a grapheme cluster boundary,
/// then the behavior of the methods are unspecified.
abstract class GraphemeClusters extends Iterable<GraphemeCluster> {
  const factory GrahphemeClusters(String string) = _SomeImplementationClass;

  /// The extended grapheme clusters of this string.
  /// 
  /// The grapheme clusters are found progressively starting at the
  /// beginning of the string. If the string contains invalid encodings,
  /// they are represented by a [GraphemeCluster] with [GraphemeCluster.isValid]
  /// returning false.
  SliceIterator<GraphemeCluster> get iterator;

  /// Whether the grapheme clusters of this contains [other].
  /// 
  /// If [start] and [end] are provided, they must be valid indices,
  /// and then only the slice from start to end is checked for [other].
  /// The default value of [end] is [length].
  bool contains(GraphemeClusters other, [int start = 0, int end])

  /// Finds the first position of [other] in the grapheme clusters of this.
  /// 
  /// Returns an integer the position of the match. This index can be used
  /// as valid arguments to other methods that take indices, including
  /// the [start] and [end] parameters.
  /// 
  /// With a [start], which should be a valid grapheme cluster index,
  /// the search starts at that index instead of at the start of the 
  /// [GraphemeClusters].
  /// With an [end], which should also be a valid grapheme cluster,
  /// the search ends when reaching that position. The default value
  /// for [end] is [length].
  int indexOf(GraphemeClusters other, [int start = 0, int end]);

  /// Whether [other] is a prefix of the grapeheme clusters of this.
  /// 
  /// If [start] is provided, then it must be a valid index, and
  bool startsWith(GrapehemeClusters other, [int start]);
  
  /// Whether [other] is a suffix of the grapeheme clusters of this.
  bool endsWith(GrapehemeClusters other);

  /// Creates a new [GraphemeClusters] containing a slice of this.
  /// 
  /// The returned clusters contain all the clusters between the [start] 
  /// and [end] index positions. Both positions must be valid indices
  /// returned by methods on this class.
  GraphemeClusters getRange(int start, [int end]);

  /// The index position after the last grapheme cluster.
  /// 
  /// This is a valid index position for functions list [subclusters].
  /// It represents the position after the last [GraphemeCluster] of
  /// [iterator].
  int get length;

  bool get isEmpty => length == 0;
  bool get isNotEmpty => length != 0;

  /// Replaces a section of the grapheme clusters with a [replacement].
  GraphemeClusters replaceRange(int start, int end, 
      GraphemeClusters replacement);

  /// Replaces the first occurrence of [pattern] with [replacement].
  GraphemeClusters replaceFirst(
      GraphemeClusters pattern, GraphemeClusters replacement, [int start]);
  /// ...
  GraphemeClusters replaceAll(
      GraphemeClusters pattern, GraphemeClusters replacement);
  /// ...
  GraphemeClusters replaceFirstMapped(
      GraphemeClusters pattern, GraphemeClusters replace(int start, int end));
  /// ...
  GraphemeClusters replaceAllMapped(
      GraphemeClusters pattern, GraphemeClusters replace(int start, int end))

  /// Whether [other] is the same sequence of [GraphemeCluster]s.
  /// 
  /// Returns `true` if [other] is a [GraphemeClusters] and the
  /// [GrapemeClusters.iterator] produces the same number of grapheme
  /// clusters that are pairwise equal according to 
  /// [GrapehmeCluster.operator==].
  bool operator==(Object other);
  int get hashCode;
}

/// An iterator moving over slices of some integer-indexable collection.
abstract class SliceIterator<T> implements Iterator<T> {
  // RuneIterator could implement this interface.

  /// Finds the next slice.
  /// 
  /// Findes the next slice after [end], then moves [start] to the start
  /// of that slice and [end] to its end.
  /// If there is no next slice, [moveNext] returns false and 
  /// then [start] and [end] will have the same value
  bool moveNext();

  /// The start index of [current].
  /// 
  /// Is equal to [end] before the first call to [moveNext] and after
  /// [moveNext] has returned false.
  int get start;

  /// The end index of [current].
  int get end;
}

There are no Patterns on grapheme clusters. We can define a ClusterPattern if necessary, but RegExp won't implement it.

This design has no support for:

  • Normalization
  • Localization

All it needs to be implemented is enough information to recognize Unicode extended grapheme clusters when scanning a string from left to right.

Metadata

Metadata

Assignees

Labels

featureProposed language feature that solves one or more problems

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions