-
Notifications
You must be signed in to change notification settings - Fork 213
Minimal Unicode grapheme cluster support #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
People will depend on whatever the implementation does. I strongly recommend not leaving details like this unspecified, because what that really means is you're specifying it randomly based on the whim of the first popularly-used implementation. I think I would strongly recommend against exposing any |
Using invalid indices can be specified as just assuming it is a grapheme cluster boundary, and acting as if the string started at that point (including emitting whatever invalid code-unit clusters is needed to get to an actual valid grapheme cluster start). That is likely what will happen anyway, so yes, we could define it. I considered The API is designed so that it will still work if the underlying data is UTF-8 encoded. It is highly questionable that a UTF-8 based |
I don't think a GraphemeClusters should have an encoding implied by the interface. Suppose, in some far future, I have a UTF-16-backed string and I concatenate it with a UTF-8-backed string. It's quite possible that a grapheme cluster will cross the boundary from one buffer to the next. IMHO, we should make sure our API doesn't prevent us from implementing such a feature in the future. |
(One way to do that would be to have a new type that is an integer internally, but doesn't expose a way to convert it to an |
Sadly, opaque type aliases don't work well with dynamic code. |
Fixed that for you. :) Seriously - I immediately had the same thought as @Hixie there. Using an opaque type doesn't support dynamism it's true but that feels like the .1% or less use case. @rakudrama had a different approach to avoiding indices based on slices, I wonder if he has any comment here? |
Some prior art: https://golang.org/ref/spec#Type_declarations - particularly the distinction between "alias declarations" and "type definitions". Judging from the discussion above, we want "type definitions", which do support dynamic code. This program: package main
import "fmt"
type foo int
func main() {
a := 12
describe(a)
b := foo(a)
describe(b)
var c interface{}
c = b
describe(c)
c = "hello"
describe(c)
}
func describe(value interface{}) {
fmt.Printf("%v of type %T\n", value, value)
} Prints:
Go implements it by representing |
Our current experiments with support for this are being carried out in this new |
Closing the present issue in favor of the more recent, more specific issue #685 |
Strawman proposal for #34
As a minimal solution, we add one getter to the
String
class,get graphemeClusters => GraphemClusters(this);
, returning an instance of a new classGrahemeClusters
, which is an iterable ofGraphemeCluster
, which represents an extended grapheme cluster.The
GraphemeCluster
andGraphemeClusters
classes are defined something like:There are no
Pattern
s on grapheme clusters. We can define a ClusterPattern if necessary, butRegExp
won't implement it.This design has no support for:
All it needs to be implemented is enough information to recognize Unicode extended grapheme clusters when scanning a string from left to right.
The text was updated successfully, but these errors were encountered: