Skip to content

Commit 09202e5

Browse files
committed
Update ranges to Unicode 17
1 parent b6c06dd commit 09202e5

File tree

4 files changed

+40
-40
lines changed

4 files changed

+40
-40
lines changed

implementers-tips.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@
88
- ㊗ (U+3297)
99
- ㊙ (U+3299)
1010
- Do not treat every character in [emoji-data.txt](https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt) in the below data list as emoji. It includes ASCII digits, ASCII asterisk, ASCII hash sign, copyright symbol, trademark symbol, and so on. They should not be treated as emoji unless followed by a U+FE0F. We have to extract only characters with the `Emoji_Presentation` label.
11-
- You can use `/^\p{Emoji_Presentation}/u`, or `/^\p{Basic_Emoji}/v` or `/^\p{RGI_Emoji}/v` in JavaScript to check if a code point is an emoji (as a default emoji presentation character or in the RGI emoji set). __`RGI_Emoji` characters other than `Basic_Emoji`__ ([basic emoji set](https://www.unicode.org/reports/tr51/#def_basic_emoji_set)) __have multiple code points and are not CJK as of Unicode 16. Never use `/^\p{Emoji}/u`__ instead of them because it is useless due to the fact that `/^\p{Emoji}/u.test("1")` is `true` (who on earth would insist that `1` is an emoji?). The `v` flag is available since ES2024 and supported by Node >= 20, Chrome (Edge) >= 112, Firefox >= 116, and Safari >= 17.
11+
- You can use `/^\p{Emoji_Presentation}/u`, or `/^\p{Basic_Emoji}/v` or `/^\p{RGI_Emoji}/v` in JavaScript to check if a code point is an emoji (as a default emoji presentation character or in the RGI emoji set). __`RGI_Emoji` characters other than `Basic_Emoji`__ ([basic emoji set](https://www.unicode.org/reports/tr51/#def_basic_emoji_set)) __have multiple code points and are not CJK as of Unicode 17. Never use `/^\p{Emoji}/u`__ instead of them because it is useless due to the fact that `/^\p{Emoji}/u.test("1")` is `true` (who on earth would insist that `1` is an emoji?). The `v` flag is available since ES2024 and supported by Node >= 20, Chrome (Edge) >= 112, Firefox >= 116, and Safari >= 17.
1212
- `"ES2024"` as `"target"` and `"lib"` in `tsconfig.json` is supported by TypeScript >= 5.7, Vite >= 6, and Vitest >= 3. You should use `"ESNext"` instead of `"ES2024"` for older ecosystems.
13-
- There are no emojis whose East Asian Width is `F` or `H` as of Unicode 16.
13+
- There are no emojis whose East Asian Width is `F` or `H` as of Unicode 17.
1414
- The East Asian Width of Ideographic Variation Selector and Standard Variation Selector is `A`.
15-
- The East Asian Width of characters whose Script is Hangul can be `N` (U+1160–U+11FF). However, there are no characters whose Script is Hangul and East Asian Width is `A` or `Na` as of Unicode 16.
15+
- The East Asian Width of characters whose Script is Hangul can be `N` (U+1160–U+11FF). However, there are no characters whose Script is Hangul and East Asian Width is `A` or `Na` as of Unicode 17.
1616
- You can use `/^\p{sc=Hangul}/u` in JavaScript to check if the Script of a character is Hangul.
1717
- The East Asian Width of unassigned characters (e.g. U+3097) is undefined. You should follow the [guideline by Unicode](https://www.unicode.org/reports/tr11/#Unassigned). Note that U+2FFFE–U+2FFFF and U+2FFFE–U+2FFFF are Noncharacter, not Reserved (Unassigned). The East Asian Width of Noncharacter does not seem to be mentioned in the specifications of the East Asian Width property. Therefore, you can treat them as `W` to join two product terms for U+20000–U+2FFFD and U+30000–U+3FFFD.
1818
- The Unicode category of Ideographic Variation Selector and Standard Variation Selector is `Mn`, not `P` or `S`. It means there is no [Unicode punctuation character](https://spec.commonmark.org/0.31.2/#unicode-punctuation-character) or [non-CJK punctuation character](#non-cjk-punctuation-character) that is also Standard Variation Selector or Ideographic Variation Selector.

ranges.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -59,10 +59,10 @@ node --run print-ranges -- -h
5959
- U+FFE0..U+FFE6 (¢..₩)
6060
- U+FFE8..U+FFEE (│..○)
6161
- U+16FE0..U+16FE4 (𖿠..𖿤)
62-
- U+16FF0..U+16FF1 (𖿰..𖿱)
63-
- U+17000..U+187F7 (𗀀..𘟷)
64-
- U+18800..U+18CD5 (𘠀..𘳕)
65-
- U+18CFF..U+18D08 (𘳿..𘴈)
62+
- U+16FF0..U+16FF6 (𖿰..𖿶)
63+
- U+17000..U+18CD5 (𗀀..𘳕)
64+
- U+18CFF..U+18D1E (𘳿..𘴞)
65+
- U+18D80..U+18DF2 (𘶀..𘷲)
6666
- U+1AFF0..U+1AFF3 (𚿰..𚿳)
6767
- U+1AFF5..U+1AFFB (𚿵..𚿻)
6868
- U+1AFFD..U+1AFFE (𚿽..𚿾)
@@ -124,10 +124,10 @@ const bool is_cjk = 0x1100 <= cp && cp <= 0x11ff
124124
|| 0xffe0 <= cp && cp <= 0xffe6
125125
|| 0xffe8 <= cp && cp <= 0xffee
126126
|| 0x16fe0 <= cp && cp <= 0x16fe4
127-
|| 0x16ff0 <= cp && cp <= 0x16ff1
128-
|| 0x17000 <= cp && cp <= 0x187f7
129-
|| 0x18800 <= cp && cp <= 0x18cd5
130-
|| 0x18cff <= cp && cp <= 0x18d08
127+
|| 0x16ff0 <= cp && cp <= 0x16ff6
128+
|| 0x17000 <= cp && cp <= 0x18cd5
129+
|| 0x18cff <= cp && cp <= 0x18d1e
130+
|| 0x18d80 <= cp && cp <= 0x18df2
131131
|| 0x1aff0 <= cp && cp <= 0x1aff3
132132
|| 0x1aff5 <= cp && cp <= 0x1affb
133133
|| 0x1affd <= cp && cp <= 0x1affe
@@ -192,10 +192,10 @@ const isCjk = 0x1100 <= cp && cp <= 0x11ff
192192
|| 0xffe0 <= cp && cp <= 0xffe6
193193
|| 0xffe8 <= cp && cp <= 0xffee
194194
|| 0x16fe0 <= cp && cp <= 0x16fe4
195-
|| 0x16ff0 <= cp && cp <= 0x16ff1
196-
|| 0x17000 <= cp && cp <= 0x187f7
197-
|| 0x18800 <= cp && cp <= 0x18cd5
198-
|| 0x18cff <= cp && cp <= 0x18d08
195+
|| 0x16ff0 <= cp && cp <= 0x16ff6
196+
|| 0x17000 <= cp && cp <= 0x18cd5
197+
|| 0x18cff <= cp && cp <= 0x18d1e
198+
|| 0x18d80 <= cp && cp <= 0x18df2
199199
|| 0x1aff0 <= cp && cp <= 0x1aff3
200200
|| 0x1aff5 <= cp && cp <= 0x1affb
201201
|| 0x1affd <= cp && cp <= 0x1affe
@@ -222,7 +222,7 @@ const isCjk = 0x1100 <= cp && cp <= 0x11ff
222222
regexp version
223223

224224
```js
225-
const isCjkRegex = /^[\u1100-\u11ff\u20a9\u2329-\u232a\u2630-\u2637\u268a-\u268f\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u2ff0-\u303e\u3041-\u3096\u3099-\u30ff\u3105-\u312f\u3131-\u318e\u3190-\u31e5\u31ef-\u321e\u3220-\u3247\u3250-\ua48c\ua490-\ua4c6\ua960-\ua97c\uac00-\ud7a3\ud7b0-\ud7c6\ud7cb-\ud7fb\uf900-\ufaff\ufe10-\ufe19\ufe30-\ufe52\ufe54-\ufe66\ufe68-\ufe6b\uff01-\uffbe\uffc2-\uffc7\uffca-\uffcf\uffd2-\uffd7\uffda-\uffdc\uffe0-\uffe6\uffe8-\uffee\u{16fe0}-\u{16fe4}\u{16ff0}-\u{16ff1}\u{17000}-\u{187f7}\u{18800}-\u{18cd5}\u{18cff}-\u{18d08}\u{1aff0}-\u{1aff3}\u{1aff5}-\u{1affb}\u{1affd}-\u{1affe}\u{1b000}-\u{1b122}\u{1b132}\u{1b150}-\u{1b152}\u{1b155}\u{1b164}-\u{1b167}\u{1b170}-\u{1b2fb}\u{1d300}-\u{1d356}\u{1d360}-\u{1d376}\u{1f200}\u{1f202}\u{1f210}-\u{1f219}\u{1f21b}-\u{1f22e}\u{1f230}-\u{1f231}\u{1f237}\u{1f23b}\u{1f240}-\u{1f248}\u{1f260}-\u{1f265}\u{20000}-\u{3fffd}]/u;
225+
const isCjkRegex = /^[\u1100-\u11ff\u20a9\u2329-\u232a\u2630-\u2637\u268a-\u268f\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u2ff0-\u303e\u3041-\u3096\u3099-\u30ff\u3105-\u312f\u3131-\u318e\u3190-\u31e5\u31ef-\u321e\u3220-\u3247\u3250-\ua48c\ua490-\ua4c6\ua960-\ua97c\uac00-\ud7a3\ud7b0-\ud7c6\ud7cb-\ud7fb\uf900-\ufaff\ufe10-\ufe19\ufe30-\ufe52\ufe54-\ufe66\ufe68-\ufe6b\uff01-\uffbe\uffc2-\uffc7\uffca-\uffcf\uffd2-\uffd7\uffda-\uffdc\uffe0-\uffe6\uffe8-\uffee\u{16fe0}-\u{16fe4}\u{16ff0}-\u{16ff6}\u{17000}-\u{18cd5}\u{18cff}-\u{18d1e}\u{18d80}-\u{18df2}\u{1aff0}-\u{1aff3}\u{1aff5}-\u{1affb}\u{1affd}-\u{1affe}\u{1b000}-\u{1b122}\u{1b132}\u{1b150}-\u{1b152}\u{1b155}\u{1b164}-\u{1b167}\u{1b170}-\u{1b2fb}\u{1d300}-\u{1d356}\u{1d360}-\u{1d376}\u{1f200}\u{1f202}\u{1f210}-\u{1f219}\u{1f21b}-\u{1f22e}\u{1f230}-\u{1f231}\u{1f237}\u{1f23b}\u{1f240}-\u{1f248}\u{1f260}-\u{1f265}\u{20000}-\u{3fffd}]/u;
226226
```
227227

228228
</details>
@@ -268,10 +268,10 @@ let is_cjk = matches!(
268268
| 0xffe0..=0xffe6
269269
| 0xffe8..=0xffee
270270
| 0x16fe0..=0x16fe4
271-
| 0x16ff0..=0x16ff1
272-
| 0x17000..=0x187f7
273-
| 0x18800..=0x18cd5
274-
| 0x18cff..=0x18d08
271+
| 0x16ff0..=0x16ff6
272+
| 0x17000..=0x18cd5
273+
| 0x18cff..=0x18d1e
274+
| 0x18d80..=0x18df2
275275
| 0x1aff0..=0x1aff3
276276
| 0x1aff5..=0x1affb
277277
| 0x1affd..=0x1affe
@@ -338,10 +338,10 @@ var isCjk =
338338
or >= 0xffe0 and <= 0xffe6
339339
or >= 0xffe8 and <= 0xffee
340340
or >= 0x16fe0 and <= 0x16fe4
341-
or >= 0x16ff0 and <= 0x16ff1
342-
or >= 0x17000 and <= 0x187f7
343-
or >= 0x18800 and <= 0x18cd5
344-
or >= 0x18cff and <= 0x18d08
341+
or >= 0x16ff0 and <= 0x16ff6
342+
or >= 0x17000 and <= 0x18cd5
343+
or >= 0x18cff and <= 0x18d1e
344+
or >= 0x18d80 and <= 0x18df2
345345
or >= 0x1aff0 and <= 0x1aff3
346346
or >= 0x1aff5 and <= 0x1affb
347347
or >= 0x1affd and <= 0x1affe
@@ -406,10 +406,10 @@ is_cjk = 0x1100 <= cp <= 0x11ff \
406406
or 0xffe0 <= cp <= 0xffe6 \
407407
or 0xffe8 <= cp <= 0xffee \
408408
or 0x16fe0 <= cp <= 0x16fe4 \
409-
or 0x16ff0 <= cp <= 0x16ff1 \
410-
or 0x17000 <= cp <= 0x187f7 \
411-
or 0x18800 <= cp <= 0x18cd5 \
412-
or 0x18cff <= cp <= 0x18d08 \
409+
or 0x16ff0 <= cp <= 0x16ff6 \
410+
or 0x17000 <= cp <= 0x18cd5 \
411+
or 0x18cff <= cp <= 0x18d1e \
412+
or 0x18d80 <= cp <= 0x18df2 \
413413
or 0x1aff0 <= cp <= 0x1aff3 \
414414
or 0x1aff5 <= cp <= 0x1affb \
415415
or 0x1affd <= cp <= 0x1affe \
@@ -442,7 +442,7 @@ is_cjk = 0x1100 <= cp <= 0x11ff \
442442
## EAW is treated as "W" if unassigned (defined by Unicode)
443443

444444
> [!NOTE]
445-
> The following result is extracted from https://www.unicode.org/Public/16.0.0/ucd/EastAsianWidth.txt. It is slightly different from https://www.unicode.org/reports/tr11/#Unassigned. U+2FFFE, U+2FFFF, U+3FFFE, and U+3FFFF are missing, but [they are "Noncharacter"](https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-23/#G19653), not ["Unassigned" (or "Reserved")](https://www.unicode.org/glossary/#reserved_code_point). This shows that we do not have to care about whether they are included in the list of CJK code points or not. To simplify the ranges, U+2FFFE and U+2FFFF are merged to U+20000–U+2FFFD here.
445+
> The following result is extracted from https://www.unicode.org/Public/17.0.0/ucd/EastAsianWidth.txt. It is slightly different from https://www.unicode.org/reports/tr11/#Unassigned. U+2FFFE, U+2FFFF, U+3FFFE, and U+3FFFF are missing, but [they are "Noncharacter"](https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-23/#G19653), not ["Unassigned" (or "Reserved")](https://www.unicode.org/glossary/#reserved_code_point). This shows that we do not have to care about whether they are included in the list of CJK code points or not. To simplify the ranges, U+2FFFE and U+2FFFF are merged to U+20000–U+2FFFD here.
446446
447447
- U+3400..U+4DBF (㐀..䶿)
448448
- U+4E00..U+9FFF (一..鿿)

scripts/cjk-ranges.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ function completeUnicodeVersion(version: string): string | undefined {
112112
}
113113
}
114114

115-
const defaultUnicodeVersion = "16";
115+
const defaultUnicodeVersion = "17";
116116

117117
// Unicode version & output type (conditional expression (&& , || , <=) / Rust match)
118118
const args = parseArgs({

specification.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ A <a href="#cjk-punctuation-sequence" id="cjk-punctuation-sequence">CJK punctuat
3333

3434
A <a href="#non-cjk-punctuation-sequence" id="non-cjk-punctuation-sequence">Non-CJK punctuation sequence</a> is a [Non-CJK punctuation character](#non-cjk-punctuation-character) or a sequence of 2 [characters](https://spec.commonmark.org/0.31.2/#character) where the first one is [Non-CJK punctuation character](#non-cjk-punctuation-character) and the second one is [Non-emoji General-use Variation Selector](#non-emoji-general-use-variation-selector).
3535

36-
[^svs-range]: The range except for U+FE0E is computed from https://www.unicode.org/Public/16.0.0/ucd/StandardizedVariants.txt (as of Unicode 16) by extracting those that can follow CJK characters. Also, https://unicode.org/Public/16.0.0/ucd/emoji/emoji-variation-sequences.txt shows that U+FE0E can follow some CJK characters.
36+
[^svs-range]: The range except for U+FE0E is computed from https://www.unicode.org/Public/17.0.0/ucd/StandardizedVariants.txt (as of Unicode 17) by extracting those that can follow CJK characters. Also, https://unicode.org/Public/17.0.0/ucd/emoji/emoji-variation-sequences.txt shows that U+FE0E can follow some CJK characters.
3737

3838
> [!NOTE]
3939
> To see the concrete ranges of each definition, see [ranges.md](ranges.md).
@@ -64,13 +64,13 @@ See [implementers-tips.md](implementers-tips.md).
6464

6565
## Unicode data list
6666

67-
| Data name | Latest | Unicode 16 |
67+
| Data name | Latest | Unicode 17 |
6868
| --- | --- | --- |
69-
| East Asian Width | https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt | https://www.unicode.org/Public/16.0.0/ucd/EastAsianWidth.txt |
70-
| Script | https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt | https://www.unicode.org/Public/16.0.0/ucd/Scripts.txt |
71-
| Block | https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt | https://www.unicode.org/Public/16.0.0/ucd/Blocks.txt |
72-
| Characters followed by Non-emoji General-use Variation Selector Variation Selector | https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt | https://www.unicode.org/Public/16.0.0/ucd/StandardizedVariants.txt |
73-
| Default emoji presentation characters | https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt | https://www.unicode.org/Public/16.0.0/ucd/emoji/emoji-data.txt |
74-
| Characters followed by U+FE0E/U+FE0F | https://unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt | https://unicode.org/Public/16.0.0/ucd/emoji/emoji-variation-sequences.txt |
75-
| Fully-qualified Emojis (without ZWJ) | https://unicode.org/Public/emoji/latest/emoji-sequences.txt | https://unicode.org/Public/16.0.0/emoji/emoji-sequences.txt |
76-
| Emoji qualification test | https://unicode.org/Public/emoji/latest/emoji-test.txt | https://unicode.org/Public/16.0.0/emoji/emoji-test.txt |
69+
| East Asian Width | https://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt | https://www.unicode.org/Public/17.0.0/ucd/EastAsianWidth.txt |
70+
| Script | https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt | https://www.unicode.org/Public/17.0.0/ucd/Scripts.txt |
71+
| Block | https://www.unicode.org/Public/UCD/latest/ucd/Blocks.txt | https://www.unicode.org/Public/17.0.0/ucd/Blocks.txt |
72+
| Characters followed by Non-emoji General-use Variation Selector Variation Selector | https://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt | https://www.unicode.org/Public/17.0.0/ucd/StandardizedVariants.txt |
73+
| Default emoji presentation characters | https://www.unicode.org/Public/UCD/latest/ucd/emoji/emoji-data.txt | https://www.unicode.org/Public/17.0.0/ucd/emoji/emoji-data.txt |
74+
| Characters followed by U+FE0E/U+FE0F | https://unicode.org/Public/UCD/latest/ucd/emoji/emoji-variation-sequences.txt | https://unicode.org/Public/17.0.0/ucd/emoji/emoji-variation-sequences.txt |
75+
| Fully-qualified Emojis (without ZWJ) | https://unicode.org/Public/emoji/latest/emoji-sequences.txt | https://unicode.org/Public/17.0.0/emoji/emoji-sequences.txt |
76+
| Emoji qualification test | https://unicode.org/Public/emoji/latest/emoji-test.txt | https://unicode.org/Public/17.0.0/emoji/emoji-test.txt |

0 commit comments

Comments
 (0)