Skip to content

Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

choznerol
Copy link
Contributor

@choznerol choznerol commented Nov 12, 2022

Pull Request

Related issue

Fixes #144

What does this PR do?

As titled, use kVariants.txt as a dictionary to enhance Chinese normalization.

TBD

1. Also normalizing old and wrong! variants

.. , it is relevant to normalize Chinese characters by unifying Z Simplified and Semantic variants before transliterating them into Pinyin.

There are also old and wrong! variants in kVariants.txt. I didn't see a reason not also to handle them, so they are also convert.

2. Confirm direction of conversion

For = old, sem, wrong! variants, I think it's obvious we want to convert from Source Ideograph to Destination Ideograph. However, for simp I personally think the same but am not 100% sure if there would be other considerations. The reason I think traditional variants should be the normalized form includes:

  1. Traditional variants seem to be the source of truth, just like Source Destination in = old, sem and wrong! all represent source of truth.
  2. A log of simplified variants seems to be rendered unsuccessfully (the boxes of Unicode codepoint like 𧦛). I would worry if ToPinyin could be handled these simplified variants correctly if they are chosen as normalized form.
    image

3. Alternatives to copying and embedding the dictionary

Import and Rework the dictionary to be a key-value binding of each variant, ...

Does the import here means something like embedding the kVariants.txt inside dictionaries/txt/cjk/... directly?

@choznerol, at least yes! 😄

#144 (comment)

If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR.

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@ManyTheFish
Copy link
Member

ManyTheFish commented Nov 14, 2022

Hello @choznerol, Nice PR!
The whole code logic is good, So I'll only suggest small things.

1. Also normalizing old and wrong! variants

There are also old and wrong! variants in kVariants.txt. I didn't see a reason not also to handle them, so they are also convert.

I'm trusting you on this. 😄

2. Confirm direction of conversion

A log of simplified variants seems to be rendered unsuccessfully (the boxes of Unicode codepoint like 𧦛). I would worry if ToPinyin could be handled these simplified variants correctly if they are chosen as normalized form.

I tried to convert every codepoint of the dictionary into Pinyin to see in which proportions we managed to have a conversion. Most of the time, the pinyin normalizer manages to convert the output variants, but, sometimes, the output has no conversion where the input has.
If we want to maximize the Pinyin conversion, then I suggest changing the normalizer into:

        // Normalize Z, Simplified, Semantic, Old, and Wrong variants
        let kvariant = match KVARIANTS.get(&c) {
            Some(kvariant) => kvariant.destination_ideograph,
            None => c,
        };

        // Normalize to Pinyin
        // If we don't manage to convert the kvariant, we try to convert the original character.
        // If none of them are converted, we return the kvariant.
        match kvariant.to_pinyin().or_else(|| c.to_pinyin()) {
            Some(converted) => {
                let with_tone = converted.with_tone();

                Some(with_tone.to_string().into())
            }
            None => Some(kvariant.into()),
        }

However, if you think that maximizing the Pinyin conversion is a bad idea, let me know.

3. Alternatives to copying and embedding the dictionary

If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR.

Yes! I'd prefer to have a dictionary keeping only the useful data in order to reduce the crate size.
So converting each line from 釠 (U+91E0) wrong! 亂 (U+4E82) to something like 釠,wrong!,亂 or even a more compressed form would be better.
Your suggestion of creating a separate crate is a good idea, therefore, we already have several dictionaries that have been manually converted and pushed in raw in Charabia like the nonspacing-marks.
If you are interested in creating it, I just request to do it in a separate PR! 😬

@@ -4,6 +4,9 @@ use super::CharNormalizer;
use crate::detection::{Language, Script};
use crate::normalizer::CharOrStr;
use crate::Token;
use kvariants::KVARIANTS;

mod kvariants;

/// Normalize Chinese characters by converting them into Pinyin characters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added b25c6d3

@choznerol
Copy link
Contributor Author

choznerol commented Nov 19, 2022

I tried to convert every codepoint of the dictionary into Pinyin to see in which proportions we managed to have a conversion. Most of the time, the pinyin normalizer manages to convert the output variants, but, sometimes, the output has no conversion where the input has.
If we want to maximize the Pinyin conversion, then I suggest changing the normalizer into:

However, if you think that maximizing the Pinyin conversion is a bad idea, let me know.

I also tried the to_pinyin() conversion in fae1c05 for both the source ideograph and destination ideograph. The result can be found on demo_(simplified|all)_kvariant_to_pinyin_converiton:

demo_simplified_kvariant_to_pinyin_converiton

Here is my interpretation of demo_simplified_kvariant_to_pinyin_converiton:

  • 2541 SAME_PINYINs: Most kvariant normalization actually has no effect on the final pinyin.
  • 1053 SOURCE_NO_PINYINs: These simplified variants can now benefit from the kvariant normalization. That's great.
  • 211 BOTH_NO_PINYINs: The kvariant normalization has no effect on the final result. There's nothing we can do here.
  • 156 DIFFERENT_PINYINs: By roughly browse through the cases I can recognize, I can't easily tell which one always results in a better pinyin. Sometimes it's just due to lack of context, for example, 辗/輾 can be either zhǎn or niǎn depending on whether it means "toss and turn" or "rolling" in the current context.
  • 8 DESTINATION_NO_PINYINs: Your snippet from 2. of Normalize Chinese by Z, Simplified, Semantic, Old, and Wrong variants #162 (comment) can fix this! Added e3fb3fd .

This interpretation also applys to demo_all_kvariant_to_pinyin_converiton

@choznerol
Copy link
Contributor Author

3. Alternatives to copying and embedding the dictionary

If there is a preferred way to improve vendoring the dictionary (e.g. create a crate for this?), I'd love to look into it, but probably in a separate follow-up PR.

Yes! I'd prefer to have a dictionary keeping only the useful data in order to reduce the crate size. So converting each line from 釠 (U+91E0) wrong! 亂 (U+4E82) to something like 釠,wrong!,亂 or even a more compressed form would be better. Your suggestion of creating a separate crate is a good idea, therefore, we already have several dictionaries that have been manually converted and pushed in raw in Charabia like the nonspacing-marks. If you are interested in creating it, I just request to do it in a separate PR! 😬

Ah, I see, that makes sense! I am interested in addressing this. Will do in another follow-up PR.

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, what an excellent job!
Let's merge it. Then, if you want to dig deeper into creating an isolated crate for the dictionaries, I let you open a new PR for it, and I'll create a new crate name in consequence.

Thanks again for your work on this!

Bors merge

@bors
Copy link
Contributor

bors bot commented Nov 21, 2022

Build succeeded:

@bors bors bot merged commit 514ae5c into meilisearch:main Nov 21, 2022
@choznerol choznerol deleted the 144/enhance-chinese-normalizer-by-unifying-z-simplified-and-semantic-variants branch November 22, 2022 01:28
// 㓻 (U+34FB) sem 剛 (U+525B)
// ...
//
let file = fs::File::open("dictionaries/txt/chinese/kVariants.tsv").unwrap();
Copy link
Contributor Author

@choznerol choznerol Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ManyTheFish I just realized (during dictionary compression survey) that I probably should have use include_bytes! instead of File::open here. Not sure if this will break when packaged and released. I'm working on a follow-up PR to fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #165

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @choznerol, nice! Thank you!

bors bot added a commit that referenced this pull request Nov 22, 2022
165: Fix incorrect File::read for kVariants.tsv r=ManyTheFish a=choznerol

# Pull Request

## Related issue

Fixes https://github.com/meilisearch/charabia/pull/162/files#r1029294766

## What does this PR do?

In #162, I use `File::open` to import `kVariants.tsv`, which I'm not sure if it will work after packaged to create. In this PR I switch to use `include_str!` instead.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Lawrence Chou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance Chinese normalizer by unifying Z, Simplified, and Semantic variants
2 participants