Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162

choznerol · 2022-11-12T04:02:37Z

Pull Request

Related issue

Fixes #144

What does this PR do?

As titled, use kVariants.txt as a dictionary to enhance Chinese normalization.

TBD

1. Also normalizing `old` and `wrong!` variants

.. , it is relevant to normalize Chinese characters by unifying Z Simplified and Semantic variants before transliterating them into Pinyin.

There are also old and wrong! variants in kVariants.txt. I didn't see a reason not also to handle them, so they are also convert.

2. Confirm direction of conversion

For = old, sem, wrong! variants, I think it's obvious we want to convert from Source Ideograph to Destination Ideograph. However, for simp I personally think the same but am not 100% sure if there would be other considerations. The reason I think traditional variants should be the normalized form includes:

Traditional variants seem to be the source of truth, just like Source Destination in = old, sem and wrong! all represent source of truth.
A log of simplified variants seems to be rendered unsuccessfully (the boxes of Unicode codepoint like 𧦛). I would worry if ToPinyin could be handled these simplified variants correctly if they are chosen as normalized form.

3. Alternatives to copying and embedding the dictionary

Import and Rework the dictionary to be a key-value binding of each variant, ...

Does the import here means something like embedding the kVariants.txt inside dictionaries/txt/cjk/... directly?

@choznerol, at least yes! 😄

#144 (comment)

If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR.

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

Close meilisearch#144

ManyTheFish · 2022-11-14T13:24:58Z

Hello @choznerol, Nice PR!
The whole code logic is good, So I'll only suggest small things.

1. Also normalizing old and wrong! variants

There are also old and wrong! variants in kVariants.txt. I didn't see a reason not also to handle them, so they are also convert.

I'm trusting you on this. 😄

2. Confirm direction of conversion

A log of simplified variants seems to be rendered unsuccessfully (the boxes of Unicode codepoint like 𧦛). I would worry if ToPinyin could be handled these simplified variants correctly if they are chosen as normalized form.

I tried to convert every codepoint of the dictionary into Pinyin to see in which proportions we managed to have a conversion. Most of the time, the pinyin normalizer manages to convert the output variants, but, sometimes, the output has no conversion where the input has.
If we want to maximize the Pinyin conversion, then I suggest changing the normalizer into:

        // Normalize Z, Simplified, Semantic, Old, and Wrong variants
        let kvariant = match KVARIANTS.get(&c) {
            Some(kvariant) => kvariant.destination_ideograph,
            None => c,
        };

        // Normalize to Pinyin
        // If we don't manage to convert the kvariant, we try to convert the original character.
        // If none of them are converted, we return the kvariant.
        match kvariant.to_pinyin().or_else(|| c.to_pinyin()) {
            Some(converted) => {
                let with_tone = converted.with_tone();

                Some(with_tone.to_string().into())
            }
            None => Some(kvariant.into()),
        }

However, if you think that maximizing the Pinyin conversion is a bad idea, let me know.

3. Alternatives to copying and embedding the dictionary

If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR.

Yes! I'd prefer to have a dictionary keeping only the useful data in order to reduce the crate size.
So converting each line from 釠 (U+91E0) wrong! 亂 (U+4E82) to something like 釠,wrong!,亂 or even a more compressed form would be better.
Your suggestion of creating a separate crate is a good idea, therefore, we already have several dictionaries that have been manually converted and pushed in raw in Charabia like the nonspacing-marks.
If you are interested in creating it, I just request to do it in a separate PR! 😬

ManyTheFish · 2022-11-14T13:27:00Z

src/normalizer/chinese.rs

@@ -4,6 +4,9 @@ use super::CharNormalizer;
 use crate::detection::{Language, Script};
 use crate::normalizer::CharOrStr;
 use crate::Token;
+use kvariants::KVARIANTS;
+
+mod kvariants;

 /// Normalize Chinese characters by converting them into Pinyin characters.


We should change this documentation

Added b25c6d3

This reverts commit fae1c05.

choznerol · 2022-11-19T04:59:21Z

I tried to convert every codepoint of the dictionary into Pinyin to see in which proportions we managed to have a conversion. Most of the time, the pinyin normalizer manages to convert the output variants, but, sometimes, the output has no conversion where the input has.
If we want to maximize the Pinyin conversion, then I suggest changing the normalizer into:

However, if you think that maximizing the Pinyin conversion is a bad idea, let me know.

I also tried the to_pinyin() conversion in fae1c05 for both the source ideograph and destination ideograph. The result can be found on demo_(simplified|all)_kvariant_to_pinyin_converiton:

Here is my interpretation of demo_simplified_kvariant_to_pinyin_converiton:

2541 SAME_PINYINs: Most kvariant normalization actually has no effect on the final pinyin.
1053 SOURCE_NO_PINYINs: These simplified variants can now benefit from the kvariant normalization. That's great.
211 BOTH_NO_PINYINs: The kvariant normalization has no effect on the final result. There's nothing we can do here.
156 DIFFERENT_PINYINs: By roughly browse through the cases I can recognize, I can't easily tell which one always results in a better pinyin. Sometimes it's just due to lack of context, for example, 辗/輾 can be either zhǎn or niǎn depending on whether it means "toss and turn" or "rolling" in the current context.
8 DESTINATION_NO_PINYINs: Your snippet from 2. of Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162 (comment) can fix this! Added e3fb3fd .

This interpretation also applys to demo_all_kvariant_to_pinyin_converiton

Co-authored-by: ManyTheFish <[email protected]>

choznerol · 2022-11-19T05:14:53Z

3. Alternatives to copying and embedding the dictionary

If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR.

Yes! I'd prefer to have a dictionary keeping only the useful data in order to reduce the crate size. So converting each line from 釠 (U+91E0) wrong! 亂 (U+4E82) to something like 釠,wrong!,亂 or even a more compressed form would be better. Your suggestion of creating a separate crate is a good idea, therefore, we already have several dictionaries that have been manually converted and pushed in raw in Charabia like the nonspacing-marks. If you are interested in creating it, I just request to do it in a separate PR! 😬

Ah, I see, that makes sense! I am interested in addressing this. Will do in another follow-up PR.

…-simplified-and-semantic-variants

ManyTheFish

Well, what an excellent job!
Let's merge it. Then, if you want to dig deeper into creating an isolated crate for the dictionaries, I let you open a new PR for it, and I'll create a new crate name in consequence.

Thanks again for your work on this!

Bors merge

bors · 2022-11-21T16:07:51Z

Build succeeded:

tests

choznerol · 2022-11-22T12:56:47Z

src/normalizer/chinese/kvariants.rs

+    //   㓻 (U+34FB)	sem	    剛 (U+525B)
+    //   ...
+    //
+    let file = fs::File::open("dictionaries/txt/chinese/kVariants.tsv").unwrap();


Hi @ManyTheFish I just realized (during dictionary compression survey) that I probably should have use include_bytes! instead of File::open here. Not sure if this will break when packaged and released. I'm working on a follow-up PR to fix it.

Hello @choznerol, nice! Thank you!

165: Fix incorrect File::read for kVariants.tsv r=ManyTheFish a=choznerol # Pull Request ## Related issue Fixes https://github.com/meilisearch/charabia/pull/162/files#r1029294766 ## What does this PR do? In #162, I use `File::open` to import `kVariants.tsv`, which I'm not sure if it will work after packaged to create. In this PR I switch to use `include_str!` instead. ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Lawrence Chou <[email protected]>

Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants

4a8e204

Close meilisearch#144

ManyTheFish requested changes Nov 14, 2022

View reviewed changes

choznerol added 3 commits November 15, 2022 21:24

Update comment to mention kvariant normalization

b25c6d3

Demo pinyin conversion with or without kvariant normalization

fae1c05

Revert "Demo pinyin conversion with or without kvariant normalization"

7c10087

This reverts commit fae1c05.

Fallback to source idograph for pinyin after kvariant normalization

e3fb3fd

Co-authored-by: ManyTheFish <[email protected]>

Merge branch 'main' into 144/enhance-chinese-normalizer-by-unifying-z…

70e15c4

…-simplified-and-semantic-variants

ManyTheFish approved these changes Nov 21, 2022

View reviewed changes

bors bot merged commit 514ae5c into meilisearch:main Nov 21, 2022

choznerol deleted the 144/enhance-chinese-normalizer-by-unifying-z-simplified-and-semantic-variants branch November 22, 2022 01:28

choznerol commented Nov 22, 2022

View reviewed changes

ns-ychou mentioned this pull request Nov 22, 2022

Import kVariants.tsv correctly with include_str! instead of File::open #164

Closed

3 tasks

choznerol mentioned this pull request Nov 22, 2022

Fix incorrect File::read for kVariants.tsv #165

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162

Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162

Uh oh!

choznerol commented Nov 12, 2022 •

edited

Loading

Uh oh!

ManyTheFish commented Nov 14, 2022 •

edited

Loading

Uh oh!

ManyTheFish Nov 14, 2022

Uh oh!

choznerol Nov 19, 2022

Uh oh!

choznerol commented Nov 19, 2022 •

edited

Loading

Uh oh!

choznerol commented Nov 19, 2022

3. Alternatives to copying and embedding the dictionary

Uh oh!

ManyTheFish left a comment

Uh oh!

bors bot commented Nov 21, 2022

Uh oh!

choznerol Nov 22, 2022 •

edited

Loading

Uh oh!

choznerol Nov 22, 2022

Uh oh!

ManyTheFish Nov 22, 2022

Uh oh!

Uh oh!

Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162

Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162

Uh oh!

Conversation

choznerol commented Nov 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Related issue

What does this PR do?

TBD

1. Also normalizing old and wrong! variants

2. Confirm direction of conversion

3. Alternatives to copying and embedding the dictionary

PR checklist

Uh oh!

ManyTheFish commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Also normalizing old and wrong! variants

2. Confirm direction of conversion

3. Alternatives to copying and embedding the dictionary

Uh oh!

ManyTheFish Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

choznerol Nov 19, 2022

Choose a reason for hiding this comment

Uh oh!

choznerol commented Nov 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choznerol commented Nov 19, 2022

3. Alternatives to copying and embedding the dictionary

Uh oh!

ManyTheFish left a comment

Choose a reason for hiding this comment

Uh oh!

bors bot commented Nov 21, 2022

Uh oh!

choznerol Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

choznerol Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

ManyTheFish Nov 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

choznerol commented Nov 12, 2022 •

edited

Loading

1. Also normalizing `old` and `wrong!` variants

ManyTheFish commented Nov 14, 2022 •

edited

Loading

choznerol commented Nov 19, 2022 •

edited

Loading

choznerol Nov 22, 2022 •

edited

Loading