Skip to content

RFC: Rename char to make it clearer that it is a unicode codepoint/scalar value #12730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
huonw opened this issue Mar 6, 2014 · 21 comments
Closed
Labels
A-Unicode Area: Unicode

Comments

@huonw
Copy link
Member

huonw commented Mar 6, 2014

Our char type is a Unicode scalar value (codepoint excluding the surrogate range), which can lead to confusion because (a) it differs to other languages and (b) it doesn't directly encourage good unicode hygiene ("Oh, a character? that's what the user sees").

Possible names include codepoint, ucs4, or rune like Go.

Other languages names for a unicode scalar value/what char means:

  • Haskell: Char is a codepoint (although surrogates are allowed)
  • D: dchar (char is a "UTF-8 code unit" and wchar is a "UTF-16 code-unit" (i.e. aliases for u8 and u16?): http://dlang.org/type.html)
  • Go: rune
  • C#/Java/Scala etc.: char is a 16-bit integer (i.e. UTF-16 code unit)
  • C/C++: char is (normally) a byte, i.e. a UTF-8 code unit.

(Other languages like Python don't have a type for a single character and don't have a type called char, and so aren't meaningful for this comparison.)

(This issue brought to you by reddit.)

@huonw
Copy link
Member Author

huonw commented Mar 6, 2014

(I'm personally against calling it "rune" since that word feels like a glyph/grapheme rather than a codepoint to me... but there's precedent in Go and in BSD for that name.)

@lifthrasiir
Copy link
Contributor

Also note that UCS-4 was historically not same to UTF-32. Wikipedia says they are now identical, but Unicode FAQ seems to suggest another. ucs (possibly re-acronymed as "Unicode Code, Scalar") might work if the correctness is important.

@Kimundi
Copy link
Member

Kimundi commented Mar 6, 2014

One argument against renaming char would be the precedent that other languages already assign different meanings to it. So the name char would be consistent in the sense that it has no real meaning already. ;)

@lucab
Copy link
Contributor

lucab commented Mar 6, 2014

@Kimundi OTOH it looks like nobody (from the above list) is using char to describe a Unicode scalar value, so we would be adding one more meaning to the list.

@thestinger
Copy link
Contributor

I really don't like rune because it strongly implies a 1:1 relation with a grapheme/glyph.

@liigo
Copy link
Contributor

liigo commented Mar 6, 2014

char is a good name already, and you can't provide a better one. I'd like
to keep it as is.
2014年3月6日 下午6:44于 "Huon Wilson" [email protected]写道:

Our char type is a Unicode scalar valuehttp://www.unicode.org/glossary/#unicode_scalar_value(codepoint excluding the surrogate range), which can lead to confusion
because (a) it differs to other languages and (b) it doesn't directly
encourage good unicode hygiene ("Oh, a character? that's what the user
sees").

Possible names include codepoint, ucs4, or rune like Go.

Other languages names for a unicode scalar value/what char means:

  • Haskell: Char is a codepoint (although surrogates are allowed)
  • D: dchar (char is a "UTF-8 code unit" and wchar is a "UTF-16
    code-unit" (i.e. aliases for u8 and u16?): http://dlang.org/type.html)
  • Go: rune
  • C#/Java/Scala etc.: char is a 16-bit integer (i.e. UTF-16 code unit)
  • C/C++: char is (normally) a byte, i.e. a UTF-8 code unit.

(Other languages like Python don't have a type for a single character and
don't have a type called char, and so aren't meaningful for this
comparison.)


Reply to this email directly or view it on GitHubhttps://github.com//issues/12730
.

@liigo
Copy link
Contributor

liigo commented Mar 6, 2014

Don't like rune, hate usv for its meaningless. +1 for char.
2014年3月6日 下午11:21于 "Luca Bruno" [email protected]写道:

@Kimundi https://github.com/Kimundi OTOH it looks like nobody (from the
above list) is using char to describe a Unicode scalar value, so we would
be adding one more meaning to the list.

I'm in favor of re-using rune as:

  1. developers already know it
  2. describes exactly our case
  3. avoids NIH.

Otherwise some acronym like usv or usv32, which are short and hygienic to
the standard but pose barriers to new-comers.


Reply to this email directly or view it on GitHubhttps://github.com//issues/12730#issuecomment-36898077
.

@brson
Copy link
Contributor

brson commented Mar 6, 2014

This is a minor wart that I'm not inclined to change.

Why does char not allow surrogate code points?

@thestinger
Copy link
Contributor

thestinger commented Mar 6, 2014

@brson: Surrogate code points aren't Unicode scalar values. They're just an implementation detail of UTF-16. The UTF-8 standard explicitly forbids encoding them too.

bors added a commit that referenced this issue Mar 9, 2014
This is mostly a reaction to #12730. If we are going to keep calling them `char`, at least make it clear that they aren't characters but codepoint/scalar.
@SimonSapin
Copy link
Contributor

I think that various names on the table here are inadequate for different reasons:

  • char means different things in different contexts, and what users think of as "characters" is closer to grapheme clustuers, which can be made of multiple code points
  • codepoint is not quite adequate as we exclude surrogate code points
  • ucs4 I think should refer to [char] strings/vectors rather than a single char unit

That leaves rune (from Go), which I think is the best by elimination.

In Go it is exactly an alias for int32, and only represents a code point or Unicode scalar value by convention. This differs from Rust where we restrict char values to the range of Unicode scalar values, but that difference is consistent with the difference between Rust’s str that is strictly UTF-8 (unless you mess it up with unsafe code, which we would consider a bug) and Go’s string type that’s a sequence of bytes, and only by convention often contains UTF-8.

So, proposal:

  • Rename char to rune (rune being a shorter name for Unicode scalar value)
  • Rename accordingly functions and methods that have "char" in their name.
  • Possibly have type ucs4 = [rune] (assuming DST)

@liigo
Copy link
Contributor

liigo commented Mar 20, 2014

Several people include core team ones don't like rune. I don't like it too.
Char is a good name.
2014年3月20日 下午12:14于 "Simon Sapin" [email protected]写道:

I think that various names on the table here are inadequate for different
reasons:

  • char means different things in different contexts, and what users
    think of as "characters" is closer to grapheme clustuershttp://www.unicode.org/glossary/#grapheme_cluster,
    which can be made of multiple code points
  • codepoint is not quite adequate as we exclude surrogate code points
  • ucs4 I think should refer to [char] strings/vectors rather than a
    single char unit

That leaves rune (from Go), which I think is the best by elimination.

In Go it is exactly an alias for int32http://golang.org/pkg/builtin/#rune,
and only represents a code point or Unicode scalar value by convention.
This differs from Rust where we restrict char values to the range of
Unicode scalar values, but that difference is consistent with the
difference between Rust’s str that is strictly UTF-8 (unless you mess it
up with unsafe code, which we would consider a bug) and Go’s string typehttp://golang.org/pkg/builtin/#stringthat’s a sequence of bytes, and only by convention often contains UTF-8.

So, proposal:

  • Rename char to rune (rune being a shorter name for Unicode scalar
    value)
  • Possibly have type ucs4 = [rune](assuming DST)


Reply to this email directly or view it on GitHubhttps://github.com//issues/12730#issuecomment-38133887
.

@SimonSapin
Copy link
Contributor

@liigo Could you explain why you don’t like "rune" and why "character" being ambiguous is not a problem, as you see it?

@liigo
Copy link
Contributor

liigo commented Mar 20, 2014

Someone has been answered your questions, see comments above.
2014年3月20日 下午5:17于 "Simon Sapin" [email protected]写道:

@liigo https://github.com/liigo Could you explain why you don’t like
"rune" and why "character" being ambiguous is not a problem, as you see it?


Reply to this email directly or view it on GitHubhttps://github.com//issues/12730#issuecomment-38147217
.

@huonw
Copy link
Member Author

huonw commented Mar 20, 2014

I think that various names on the table here are inadequate for different reasons:

To be pedantic rune was a name on the table here. :P


Also, this should be an RFC in rust-lang/rfcs, now that we have that process. Closing. (If someone else doesn't step up to write it up, I'm happy to do it... eventually.)

@huonw huonw closed this as completed Mar 20, 2014
@codeyash
Copy link

codeyash commented Feb 23, 2018

char name wasted my many days as I was thinking it as plain c++ char. Too bad if some one like me assume it by name. Now I read type carefully before using.

My point char is confusing name.

If core team doesn't like above names invent new one but not char atleast. It will save someone's time.

It's old topic but it hurts me that's why I'm adding my comment.

@codeyash
Copy link

c8 is good name atleast it will force us to understand what is c8 like i8, u8, i32 etc

@ZenLiuCN
Copy link

char name wasted my many days as I was thinking it as plain c++ char. Too bad if some one like me assume it by name. Now I read type carefully before using.

My point char is confusing name.

If core team doesn't like above names invent new one but not char atleast. It will save someone's time.

It's old topic but it hurts me that's why I'm adding my comment.

agree with this,make misunderstanding for common used name ·char·,make it much harder to get into use rust,as target of some useable Programming Language maybe should respect some ·general knowledge· for most of other languages already made it like ·noun·

@thestinger
Copy link
Contributor

It was the least bad option among everything considered, and it's highly unlikely that it would change at this point with the language stable. Since it's a Unicode scalar value (not just any code point), there's always a 1:1 mapping between strings and [char].

C has signed char, unsigned char, char which is a distinct type that may or may not be signed but is always a distinct type from both signed char and unsigned char with special rules, wchar_t (which varies in size based on platform choices, it's a code point on Linux and a UTF-16 code unit on Windows), char16_t and char32_t. C++ is also adding char8_t and it may come to C too. Even though char16_t implies 16-bit, it's the same type as uint_least16_t and can be larger. It's only guaranteed to be Unicode if the platform defines __STD_UTF_16__ is defined. Similarly, char32_t is only guaranteed Unicode if platform defines __STD_UTF_32__ defined. Since a lot of this has not existed historically, there are many alternatives broadly used in language / library ecosystems.

Coming from this in C and C++, I struggle to see how the the naming of char makes it much harder to get into the language. It takes someone a moment of thought to read and absorb the chosen definition.

On that note, how do I unsubscribe from all threads in a repository? ...

@ZenLiuCN

This comment has been minimized.

@thestinger

This comment has been minimized.

@SimonSapin

This comment has been minimized.

@rust-lang rust-lang locked and limited conversation to collaborators Feb 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Unicode Area: Unicode
Projects
None yet
Development

No branches or pull requests

10 participants