-
Notifications
You must be signed in to change notification settings - Fork 24
Partial update to unicode 11 #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
unic-segment and unic-ucd-segment are still on data from Unicode 10
The 11.0 IDNA ReadMe.txt has a lowercase "version"
Additionally, fix a bug in the previous implemantation where the test returns early in many situations before reaching the end of file
Similar to unic/segment/tests/words_conformance_tests.rs
106a8bb
to
acbf6cb
Compare
"ExtPict" => continue, | ||
|
||
"Extend_ExtCccZwj" => "Extend", | ||
"ZWJ_ExtCccZwj" => "ZWJ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for this. I had this issue when I was trying to get my fork working for Unicode 11.0, this is a clever fix!
Out of curiosity, there seems to be no documentation about Extend_ExtCccZwj
and ZWJ_ExtCccZwj
(it seems they only appear in grapheme cluster break test data). Any idea of the purpose of these two values?
I noticed @behnam was also asking about this in http://www.unicode.org/review/pri372/ .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure where these names came from, in the case of ZWJ
I noticed that all of them were renamed to ZWJ_ExtCccZwj
so it seemed to make sense to just map the value back to ZWJ
In the case of Extend
though there are both Extend
and Extend_ExtCccZwj
in the 11.0 test file, the choice of which are used seems to depend on the codepoint and not the GB# rule. A bit of grepping + sort showed the different rules and character combos
[9.0] EMOJI MODIFIER FITZPATRICK TYPE-6 (Extend)
[0.2] COMBINING GRAPHEME JOINER (Extend)
[4.0] COMBINING GRAPHEME JOINER (Extend)
[9.0] COMBINING GRAPHEME JOINER (Extend)
[4.0] COMBINING DIAERESIS (Extend_ExtCccZwj)
[9.0] COMBINING DIAERESIS (Extend_ExtCccZwj)
[0.2] COMBINING GRAVE ACCENT (Extend_ExtCccZwj)
[4.0] COMBINING GRAVE ACCENT (Extend_ExtCccZwj)
[9.0] COMBINING GRAVE ACCENT (Extend_ExtCccZwj)
I don't know why they're distinguished myself, they're all still Extend
in https://www.unicode.org/Public/11.0.0/ucd/auxiliary/GraphemeBreakProperty.txt
Thanks, @Alexendoo, for submitting this! We had build issues on the CI. Could you please rebase this so we can see if there are any problems? Also, they were some non-straight-forward changes in UAX #29 in Unicode 11.0.0 release. (https://unicode.org/reports/tr29/#Modifications) How are we incorporating that to the implementation here? |
@behnam this PR doesn't handle it, I spent some time on it but I didn't really get anywhere The tests are expected to fail with expected I don't know if you'd really want to merge this directly since it's not complete. It's more for visibility if somebody else wants to build the tr29 changes on top of it it will save them a bit of time |
I was attempting to update the repo to use unicode 11 data but ran into a few issues, for the time being I'm setting it aside since it was a bigger job than I was expecting, but I opened this in case it's helpful to save some time for anybody attempting to do the same.
Outstanding issues:
unic-segment - https://unicode.org/reports/tr29/#Modifications
This was the part that tripped me up the most, you can see my attempt to tackle it in Alexendoo/rust-unic@unicode-11-partial...unicode-11. I think the forward word boundaries might be correct, but the others are not.
unic-ident - https://www.unicode.org/reports/tr31/#Modifications
There was a few changes to this spec, I didn't get round to seeing if any code changes are required but the tests do pass, so I don't know if anything is needed to be done
IDNA conformance
With the IDNA conformance test updated to use IdnaTestV2.txt I was able to add a test for
unic_idna::to_unicode
, however it doesn't return anErr
when a status ofX4_2
is expected - in this PR the test will fail but it can be easily added to be ignored asV2
andC...
are if that's intentional behaviourWhat this PR does accomplish:
#[ignore]
s the test for unicode version of unic-ucd-segmentThis change is