Partial update to unicode 11 #226

Alexendoo · 2018-07-31T21:31:46Z

I was attempting to update the repo to use unicode 11 data but ran into a few issues, for the time being I'm setting it aside since it was a bigger job than I was expecting, but I opened this in case it's helpful to save some time for anybody attempting to do the same.

Outstanding issues:

unic-segment - https://unicode.org/reports/tr29/#Modifications

This was the part that tripped me up the most, you can see my attempt to tackle it in Alexendoo/rust-unic@unicode-11-partial...unicode-11. I think the forward word boundaries might be correct, but the others are not.

unic-ident - https://www.unicode.org/reports/tr31/#Modifications

There was a few changes to this spec, I didn't get round to seeing if any code changes are required but the tests do pass, so I don't know if anything is needed to be done

IDNA conformance

With the IDNA conformance test updated to use IdnaTestV2.txt I was able to add a test for unic_idna::to_unicode, however it doesn't return an Err when a status of X4_2 is expected - in this PR the test will fail but it can be easily added to be ignored as V2 and C... are if that's intentional behaviour

What this PR does accomplish:

The data, with the exception of unic-ucd-segment and unic-segment are all updated to unicode 11
Temporarily #[ignore]s the test for unicode version of unic-ucd-segment
Adds a char_property for the new Extended_Pictographic emoji property
Updates the IDNA conformance test to use the new format
A small change to the unic-gen segmentation test generator, the format of the test file comments changed slightly
The grapheme conformance tests were updated to be more like the word boundary conformance tests

This change is

unic-segment and unic-ucd-segment are still on data from Unicode 10

The 11.0 IDNA ReadMe.txt has a lowercase "version"

Additionally, fix a bug in the previous implemantation where the test returns early in many situations before reaching the end of file

Similar to unic/segment/tests/words_conformance_tests.rs

eyeplum · 2018-08-01T09:57:09Z

gen/src/writer/ucd/segment_tests.rs

+                "ExtPict" => continue,
+
+                "Extend_ExtCccZwj" => "Extend",
+                "ZWJ_ExtCccZwj" => "ZWJ",


👍 for this. I had this issue when I was trying to get my fork working for Unicode 11.0, this is a clever fix!

Out of curiosity, there seems to be no documentation about Extend_ExtCccZwj and ZWJ_ExtCccZwj (it seems they only appear in grapheme cluster break test data). Any idea of the purpose of these two values?

I noticed @behnam was also asking about this in http://www.unicode.org/review/pri372/ .

I'm not entirely sure where these names came from, in the case of ZWJ I noticed that all of them were renamed to ZWJ_ExtCccZwj so it seemed to make sense to just map the value back to ZWJ

In the case of Extend though there are both Extend and Extend_ExtCccZwj in the 11.0 test file, the choice of which are used seems to depend on the codepoint and not the GB# rule. A bit of grepping + sort showed the different rules and character combos

[9.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend) [0.2] COMBINING GRAPHEME JOINER (Extend) [4.0] COMBINING GRAPHEME JOINER (Extend) [9.0] COMBINING GRAPHEME JOINER (Extend) [4.0] COMBINING DIAERESIS (Extend_ExtCccZwj) [9.0] COMBINING DIAERESIS (Extend_ExtCccZwj) [0.2] COMBINING GRAVE ACCENT (Extend_ExtCccZwj) [4.0] COMBINING GRAVE ACCENT (Extend_ExtCccZwj) [9.0] COMBINING GRAVE ACCENT (Extend_ExtCccZwj)

I don't know why they're distinguished myself, they're all still Extend in https://www.unicode.org/Public/11.0.0/ucd/auxiliary/GraphemeBreakProperty.txt

behnam · 2018-08-20T06:14:33Z

Thanks, @Alexendoo, for submitting this!

We had build issues on the CI. Could you please rebase this so we can see if there are any problems?

Also, they were some non-straight-forward changes in UAX #29 in Unicode 11.0.0 release. (https://unicode.org/reports/tr29/#Modifications) How are we incorporating that to the implementation here?

Alexendoo · 2018-08-20T11:11:57Z

@behnam this PR doesn't handle it, I spent some time on it but I didn't really get anywhere

The tests are expected to fail with expected X4_2 in unic_idna::to_unicode since I wasn't sure if that was intentional or not (it came from added test coverage, not a change in implementation)

I don't know if you'd really want to merge this directly since it's not complete. It's more for visibility if somebody else wants to build the tr29 changes on top of it it will save them a bit of time

Alexendoo added 13 commits July 22, 2018 20:06

Update emoji data to 11.0

9599b43

Add Extended_Pictographic to emoji gen

910c701

Generate emoji data

d70f90c

Add Extended_Pictographic to unic_char_emoji

2cb083b

Update UCD data to 11.0.0

6f49fac

Partially generate UCD 11 data

8373d8a

unic-segment and unic-ucd-segment are still on data from Unicode 10

Update IDNA data to 11.0

4879a04

Allow lowercase in UnicodeVersion from_str

4444a68

The 11.0 IDNA ReadMe.txt has a lowercase "version"

Generate IDNA data

162f2cc

Update IDNA conformance test for v2 test data

38ea9f4

Additionally, fix a bug in the previous implemantation where the test returns early in many situations before reaching the end of file

Update comments with Unicode 10 references

8fcdf1b

Add an assert macro for grapheme_cluster_conformance_tests

0dd20e2

Similar to unic/segment/tests/words_conformance_tests.rs

Update gen for ucd 11.0.0 segmentation

acbf6cb

Alexendoo force-pushed the unicode-11-partial branch 2 times, most recently from 106a8bb to acbf6cb Compare July 31, 2018 21:41

eyeplum reviewed Aug 1, 2018

View reviewed changes

behnam added C: segmentation Unicode Text Segmentation C: ucd Unicode Character Database C: emoji Unicode Emoji A: lib-impl Library Implementation X: wait-on-data-sources labels Aug 20, 2018

eyeplum mentioned this pull request Mar 6, 2019

Upgrade to Unicode 11.0 #259

Open

8 tasks

ctrlcctrlv mentioned this pull request May 29, 2021

Forked library; and some thoughts about whether it's worth it to keep all modules at same Unicode version #279

Open

Alexendoo closed this Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partial update to unicode 11 #226

Partial update to unicode 11 #226

Uh oh!

Alexendoo commented Jul 31, 2018 •

edited by behnam

Loading

Uh oh!

eyeplum Aug 1, 2018 •

edited

Loading

Uh oh!

Alexendoo Aug 1, 2018

Uh oh!

behnam commented Aug 20, 2018

Uh oh!

Alexendoo commented Aug 20, 2018

Uh oh!

Uh oh!

Partial update to unicode 11 #226

Partial update to unicode 11 #226

Uh oh!

Conversation

Alexendoo commented Jul 31, 2018 • edited by behnam Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

unic-segment - https://unicode.org/reports/tr29/#Modifications

unic-ident - https://www.unicode.org/reports/tr31/#Modifications

IDNA conformance

Uh oh!

eyeplum Aug 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Alexendoo Aug 1, 2018

Choose a reason for hiding this comment

Uh oh!

behnam commented Aug 20, 2018

Uh oh!

Alexendoo commented Aug 20, 2018

Uh oh!

Uh oh!

Alexendoo commented Jul 31, 2018 •

edited by behnam

Loading

eyeplum Aug 1, 2018 •

edited

Loading