Add Japanese normalizer to cover Katakana to Hiragana #149

choznerol · 2022-10-08T09:30:43Z

Pull Request

Related issue

Fixes #131

What does this PR do?

Add a new Normalizer for Japanese, which converts Katakana to Hiragana

# Limitations

Converting from Kanji is not supported

From #131:

... for instance, ダメ, is also spelled 駄目, or だめ
... wana_kana seems promising to convert everything in Hiragana

After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:

to_hiragana("ダメ駄目だめ") will be "だめ駄目だめ"
to_romaji("ダメ駄目だめ") will be "dame駄目dame"

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

meili-bot · 2022-10-08T09:30:46Z

This message is sent automatically

Hello @choznerol,
Thank you very much for contributing to Meilisearch ❤️.
However, the team is not available on the weekend, but they will be back on Monday 😊

Close meilisearch#131

choznerol · 2022-10-09T03:36:04Z

src/normalizer/japanese.rs

+                debug_assert!(
+                    token.lemma().len() == new_lemma.len(),
+                    concat!(
+                        r#"`to_hiragana` changed the lemma len from {} to {} but the current `char_map` computation "#,
+                        r#"expected them to be equal. If `to_hiragana` does change len of char somehow, consider "#,
+                        r#"calling `to_hiragana(char)` char by char instead of only calling `to_hiragana(lemma)` once."#
+                    ),
+                    token.lemma().len(),
+                    new_lemma.len()
+                );
+                let old_new_chars = token.lemma().chars().zip(new_lemma.chars());


Unlike to_pinyin operates on each char, to_hiragana() can operate on the whole lemma at once. This provides some performance benefit because each to_hiragana() call has some branching cost. However, to_hiragana(lemma) instead of to_hiragana(char) also makes the char_map implementation a bit more complicated, and I am not sure about the trade off here.

you seem to be 100% sure that this conversion will always keep the same length as the original character.
In this case, the char_map is useless and shouldn't be created, so, I suggest rewriting your normalizer without bothering with the char map. This would simplify, a lot, your implementation. 😊

ManyTheFish

Hello @choznerol,
I requested some changes to your PR, this would simplify the implementation.
However, the general PR is great, thanks!

ManyTheFish · 2022-10-10T14:14:39Z

src/normalizer/japanese.rs

+        if is_hiragana(token.lemma()) {
+            // No need to convert
+
+            if options.create_char_map && token.char_map.is_none() {
+                let mut char_map = Vec::new();
+                for c in token.lemma().chars() {
+                    char_map.push((c.len_utf8() as u8, c.len_utf8() as u8));
+                }
+                token.char_map = Some(char_map);
+            }
+        } else {


Having an Identity char map is the same that not having it, so don't bother creating it.

Suggested change

if is_hiragana(token.lemma()) {

// No need to convert

if options.create_char_map && token.char_map.is_none() {

let mut char_map = Vec::new();

for c in token.lemma().chars() {

char_map.push((c.len_utf8() as u8, c.len_utf8() as u8));

}

token.char_map = Some(char_map);

}

} else {

if !is_hiragana(token.lemma()) {

Oh, I originally thought we must create one when create_char_map is true 😅 Seems this is not the case, so the implementation can be much simpler. Thanks for the suggestion!

Addressed in 49bc172

ManyTheFish · 2022-10-10T14:29:51Z

src/normalizer/japanese.rs

+                debug_assert!(
+                    token.lemma().len() == new_lemma.len(),
+                    concat!(
+                        r#"`to_hiragana` changed the lemma len from {} to {} but the current `char_map` computation "#,
+                        r#"expected them to be equal. If `to_hiragana` does change len of char somehow, consider "#,
+                        r#"calling `to_hiragana(char)` char by char instead of only calling `to_hiragana(lemma)` once."#
+                    ),
+                    token.lemma().len(),
+                    new_lemma.len()
+                );
+                let old_new_chars = token.lemma().chars().zip(new_lemma.chars());


you seem to be 100% sure that this conversion will always keep the same length as the original character.
In this case, the char_map is useless and shouldn't be created, so, I suggest rewriting your normalizer without bothering with the char map. This would simplify, a lot, your implementation. 😊

choznerol · 2022-10-10T16:11:01Z

you seem to be 100% sure that this conversion will always keep the same length as the original character.

I only manually tested the mappings I found in the crate repo. To be more confident about the assertion, I searched the unit tests about to_hiragana and assert their len() unchanged, and all the unit tests still pass 👍 PSeitz/wana_kana_rust@805c9de

ManyTheFish · 2022-10-11T16:33:40Z

@choznerol, in this case just ignore the char_map creation for the implementation of this Normalizer 👍

ManyTheFish

Requesting changes related to the discussion on the dedicated issue.

Thanks a lot for your investment in this work! 👍

ManyTheFish · 2022-10-17T13:52:14Z

Cargo.toml

@@ -22,6 +22,7 @@ unicode-segmentation = "1.6.0"
 whatlang = "0.16.1"
 lindera = { version = "=0.16.0", features = ["ipadic"], optional = true }
 pinyin = { version = "0.9", default-features = false, features = ["with_tone"], optional = true }
+wana_kana = "2.1.0"


Suggested change

wana_kana = "2.1.0"

wana_kana = { version = "2.1.0", optional = true}

ManyTheFish · 2022-10-17T13:53:20Z

Cargo.toml

@@ -22,6 +22,7 @@ unicode-segmentation = "1.6.0"
 whatlang = "0.16.1"
 lindera = { version = "=0.16.0", features = ["ipadic"], optional = true }
 pinyin = { version = "0.9", default-features = false, features = ["with_tone"], optional = true }
+wana_kana = "2.1.0"

 [features]
 default = ["chinese", "hebrew", "japanese", "thai"]


Suggested change

default = ["chinese", "hebrew", "japanese", "thai"]

default = ["chinese", "hebrew", "japanese", "thai"]

# allow japanese character transliteration (put this under line 38)

japanese-transliteration = ["dep:wana_kana"]

ManyTheFish · 2022-10-17T13:54:53Z

src/normalizer/mod.rs

@@ -3,6 +3,8 @@ use once_cell::sync::Lazy;
 #[cfg(feature = "chinese")]
 pub use self::chinese::ChineseNormalizer;
 pub use self::control_char::ControlCharNormalizer;
+#[cfg(feature = "japanese")]


Suggested change

#[cfg(feature = "japanese")]

#[cfg(feature = "japanese-transliteration")]

ManyTheFish · 2022-10-17T13:55:02Z

src/normalizer/mod.rs

@@ -12,6 +14,8 @@ use crate::Token;
 #[cfg(feature = "chinese")]
 mod chinese;
 mod control_char;
+#[cfg(feature = "japanese")]


Suggested change

#[cfg(feature = "japanese")]

#[cfg(feature = "japanese-transliteration")]

ManyTheFish · 2022-10-17T13:55:10Z

src/normalizer/mod.rs

@@ -22,6 +26,8 @@ pub static NORMALIZERS: Lazy<Vec<Box<dyn Normalizer>>> = Lazy::new(|| {
        Box::new(LowercaseNormalizer),
        #[cfg(feature = "chinese")]
        Box::new(ChineseNormalizer),
+        #[cfg(feature = "japanese")]


Suggested change

#[cfg(feature = "japanese")]

#[cfg(feature = "japanese-transliteration")]

meilisearch#149 (review) meilisearch#131 (comment) Co-authored-by: ManyTheFish <[email protected]>

ManyTheFish

Perfect!
I let bors run the tests, and if everything goes well, your PR will be automatically merged!
Thank you for your time!

bors merge

bors · 2022-10-17T15:29:43Z

Build succeeded:

tests

meili-bot · 2022-10-17T15:29:48Z

This message is sent automatically

Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form.

choznerol mentioned this pull request Oct 9, 2022

Implement a Japanese specialized Normalizer #131

Closed

choznerol force-pushed the 131/japanese-normalizer branch from 61fabac to 9eafd31 Compare October 9, 2022 03:10

Add Japanese normalizer to convert Katakana to Hiragana

26fb497

Close meilisearch#131

choznerol force-pushed the 131/japanese-normalizer branch from 9eafd31 to 26fb497 Compare October 9, 2022 03:13

choznerol commented Oct 9, 2022

View reviewed changes

choznerol marked this pull request as ready for review October 9, 2022 03:36

curquiza requested a review from ManyTheFish October 10, 2022 08:39

ManyTheFish requested changes Oct 10, 2022

View reviewed changes

Skip redunrant char_map creation for Japanese

49bc172

ManyTheFish requested changes Oct 17, 2022

View reviewed changes

choznerol and others added 2 commits October 17, 2022 23:11

Disable japanese-transliteration by default

dfcaa62

meilisearch#149 (review) meilisearch#131 (comment) Co-authored-by: ManyTheFish <[email protected]>

Merge branch 'main' into 131/japanese-normalizer

572f930

ManyTheFish approved these changes Oct 17, 2022

View reviewed changes

bors bot merged commit 0719a97 into meilisearch:main Oct 17, 2022

	wana_kana = "2.1.0"
	wana_kana = { version = "2.1.0", optional = true}

	#[cfg(feature = "japanese")]
	#[cfg(feature = "japanese-transliteration")]

Add Japanese normalizer to cover Katakana to Hiragana #149

Add Japanese normalizer to cover Katakana to Hiragana #149

Uh oh!

Conversation

choznerol commented Oct 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Related issue

What does this PR do?

# Limitations

Converting from Kanji is not supported

PR checklist

Uh oh!

meili-bot commented Oct 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ManyTheFish left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

choznerol commented Oct 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ManyTheFish commented Oct 11, 2022

Uh oh!

ManyTheFish left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ManyTheFish left a comment

Choose a reason for hiding this comment

Uh oh!

bors bot commented Oct 17, 2022

Uh oh!

meili-bot commented Oct 17, 2022

Uh oh!

Uh oh!

choznerol commented Oct 8, 2022 •

edited

Loading

choznerol commented Oct 10, 2022 •

edited

Loading