Skip to content

Partial update to unicode 11 #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

Alexendoo
Copy link

@Alexendoo Alexendoo commented Jul 31, 2018

I was attempting to update the repo to use unicode 11 data but ran into a few issues, for the time being I'm setting it aside since it was a bigger job than I was expecting, but I opened this in case it's helpful to save some time for anybody attempting to do the same.

Outstanding issues:

unic-segment - https://unicode.org/reports/tr29/#Modifications

This was the part that tripped me up the most, you can see my attempt to tackle it in Alexendoo/rust-unic@unicode-11-partial...unicode-11. I think the forward word boundaries might be correct, but the others are not.

unic-ident - https://www.unicode.org/reports/tr31/#Modifications

There was a few changes to this spec, I didn't get round to seeing if any code changes are required but the tests do pass, so I don't know if anything is needed to be done

IDNA conformance

With the IDNA conformance test updated to use IdnaTestV2.txt I was able to add a test for unic_idna::to_unicode, however it doesn't return an Err when a status of X4_2 is expected - in this PR the test will fail but it can be easily added to be ignored as V2 and C... are if that's intentional behaviour

What this PR does accomplish:

  • The data, with the exception of unic-ucd-segment and unic-segment are all updated to unicode 11
  • Temporarily #[ignore]s the test for unicode version of unic-ucd-segment
  • Adds a char_property for the new Extended_Pictographic emoji property
  • Updates the IDNA conformance test to use the new format
  • A small change to the unic-gen segmentation test generator, the format of the test file comments changed slightly
  • The grapheme conformance tests were updated to be more like the word boundary conformance tests

This change is Reviewable

@Alexendoo Alexendoo force-pushed the unicode-11-partial branch 2 times, most recently from 106a8bb to acbf6cb Compare July 31, 2018 21:41
"ExtPict" => continue,

"Extend_ExtCccZwj" => "Extend",
"ZWJ_ExtCccZwj" => "ZWJ",
Copy link
Member

@eyeplum eyeplum Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for this. I had this issue when I was trying to get my fork working for Unicode 11.0, this is a clever fix!

Out of curiosity, there seems to be no documentation about Extend_ExtCccZwj and ZWJ_ExtCccZwj (it seems they only appear in grapheme cluster break test data). Any idea of the purpose of these two values?

I noticed @behnam was also asking about this in http://www.unicode.org/review/pri372/ .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure where these names came from, in the case of ZWJ I noticed that all of them were renamed to ZWJ_ExtCccZwj so it seemed to make sense to just map the value back to ZWJ

In the case of Extend though there are both Extend and Extend_ExtCccZwj in the 11.0 test file, the choice of which are used seems to depend on the codepoint and not the GB# rule. A bit of grepping + sort showed the different rules and character combos

[9.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend)
[0.2] COMBINING GRAPHEME JOINER (Extend)
[4.0] COMBINING GRAPHEME JOINER (Extend)
[9.0] COMBINING GRAPHEME JOINER (Extend)

[4.0] COMBINING DIAERESIS (Extend_ExtCccZwj)
[9.0] COMBINING DIAERESIS (Extend_ExtCccZwj)
[0.2] COMBINING GRAVE ACCENT (Extend_ExtCccZwj)
[4.0] COMBINING GRAVE ACCENT (Extend_ExtCccZwj)
[9.0] COMBINING GRAVE ACCENT (Extend_ExtCccZwj)

I don't know why they're distinguished myself, they're all still Extend in https://www.unicode.org/Public/11.0.0/ucd/auxiliary/GraphemeBreakProperty.txt

@behnam
Copy link
Member

behnam commented Aug 20, 2018

Thanks, @Alexendoo, for submitting this!

We had build issues on the CI. Could you please rebase this so we can see if there are any problems?

Also, they were some non-straight-forward changes in UAX #29 in Unicode 11.0.0 release. (https://unicode.org/reports/tr29/#Modifications) How are we incorporating that to the implementation here?

@behnam behnam added C: segmentation Unicode Text Segmentation C: ucd Unicode Character Database C: emoji Unicode Emoji A: lib-impl Library Implementation X: wait-on-data-sources labels Aug 20, 2018
@Alexendoo
Copy link
Author

@behnam this PR doesn't handle it, I spent some time on it but I didn't really get anywhere

The tests are expected to fail with expected X4_2 in unic_idna::to_unicode since I wasn't sure if that was intentional or not (it came from added test coverage, not a change in implementation)

I don't know if you'd really want to merge this directly since it's not complete. It's more for visibility if somebody else wants to build the tr29 changes on top of it it will save them a bit of time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: lib-impl Library Implementation C: emoji Unicode Emoji C: segmentation Unicode Text Segmentation C: ucd Unicode Character Database X: wait-on-data-sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants