Skip to content

Add Japanese normalizer to cover Katakana to Hiragana #149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 17, 2022

Conversation

choznerol
Copy link
Contributor

@choznerol choznerol commented Oct 8, 2022

Pull Request

Related issue

Fixes #131

What does this PR do?

  • Add a new Normalizer for Japanese, which converts Katakana to Hiragana

# Limitations

Converting from Kanji is not supported

From #131:

... for instance, ダメ, is also spelled 駄目, or だめ
... wana_kana seems promising to convert everything in Hiragana

After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:

  • to_hiragana("ダメ駄目だめ") will be "だめ駄目だめ"
  • to_romaji("ダメ駄目だめ") will be "dame駄目dame"

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@meili-bot
Copy link
Contributor

This message is sent automatically

Hello @choznerol,
Thank you very much for contributing to Meilisearch ❤️.
However, the team is not available on the weekend, but they will be back on Monday 😊

@choznerol choznerol force-pushed the 131/japanese-normalizer branch from 9eafd31 to 26fb497 Compare October 9, 2022 03:13
Comment on lines 42 to 52
debug_assert!(
token.lemma().len() == new_lemma.len(),
concat!(
r#"`to_hiragana` changed the lemma len from {} to {} but the current `char_map` computation "#,
r#"expected them to be equal. If `to_hiragana` does change len of char somehow, consider "#,
r#"calling `to_hiragana(char)` char by char instead of only calling `to_hiragana(lemma)` once."#
),
token.lemma().len(),
new_lemma.len()
);
let old_new_chars = token.lemma().chars().zip(new_lemma.chars());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike to_pinyin operates on each char, to_hiragana() can operate on the whole lemma at once. This provides some performance benefit because each to_hiragana() call has some branching cost. However, to_hiragana(lemma) instead of to_hiragana(char) also makes the char_map implementation a bit more complicated, and I am not sure about the trade off here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem to be 100% sure that this conversion will always keep the same length as the original character.
In this case, the char_map is useless and shouldn't be created, so, I suggest rewriting your normalizer without bothering with the char map. This would simplify, a lot, your implementation. 😊

@choznerol choznerol marked this pull request as ready for review October 9, 2022 03:36
@curquiza curquiza requested a review from ManyTheFish October 10, 2022 08:39
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @choznerol,
I requested some changes to your PR, this would simplify the implementation.
However, the general PR is great, thanks!

Comment on lines 21 to 31
if is_hiragana(token.lemma()) {
// No need to convert

if options.create_char_map && token.char_map.is_none() {
let mut char_map = Vec::new();
for c in token.lemma().chars() {
char_map.push((c.len_utf8() as u8, c.len_utf8() as u8));
}
token.char_map = Some(char_map);
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an Identity char map is the same that not having it, so don't bother creating it.

Suggested change
if is_hiragana(token.lemma()) {
// No need to convert
if options.create_char_map && token.char_map.is_none() {
let mut char_map = Vec::new();
for c in token.lemma().chars() {
char_map.push((c.len_utf8() as u8, c.len_utf8() as u8));
}
token.char_map = Some(char_map);
}
} else {
if !is_hiragana(token.lemma()) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I originally thought we must create one when create_char_map is true 😅 Seems this is not the case, so the implementation can be much simpler. Thanks for the suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 49bc172

Comment on lines 42 to 52
debug_assert!(
token.lemma().len() == new_lemma.len(),
concat!(
r#"`to_hiragana` changed the lemma len from {} to {} but the current `char_map` computation "#,
r#"expected them to be equal. If `to_hiragana` does change len of char somehow, consider "#,
r#"calling `to_hiragana(char)` char by char instead of only calling `to_hiragana(lemma)` once."#
),
token.lemma().len(),
new_lemma.len()
);
let old_new_chars = token.lemma().chars().zip(new_lemma.chars());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem to be 100% sure that this conversion will always keep the same length as the original character.
In this case, the char_map is useless and shouldn't be created, so, I suggest rewriting your normalizer without bothering with the char map. This would simplify, a lot, your implementation. 😊

@choznerol
Copy link
Contributor Author

choznerol commented Oct 10, 2022

you seem to be 100% sure that this conversion will always keep the same length as the original character.

I only manually tested the mappings I found in the crate repo. To be more confident about the assertion, I searched the unit tests about to_hiragana and assert their len() unchanged, and all the unit tests still pass 👍 PSeitz/wana_kana_rust@805c9de

@ManyTheFish
Copy link
Member

@choznerol, in this case just ignore the char_map creation for the implementation of this Normalizer 👍

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes related to the discussion on the dedicated issue.

Thanks a lot for your investment in this work! 👍

Cargo.toml Outdated
@@ -22,6 +22,7 @@ unicode-segmentation = "1.6.0"
whatlang = "0.16.1"
lindera = { version = "=0.16.0", features = ["ipadic"], optional = true }
pinyin = { version = "0.9", default-features = false, features = ["with_tone"], optional = true }
wana_kana = "2.1.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
wana_kana = "2.1.0"
wana_kana = { version = "2.1.0", optional = true}

Cargo.toml Outdated
@@ -22,6 +22,7 @@ unicode-segmentation = "1.6.0"
whatlang = "0.16.1"
lindera = { version = "=0.16.0", features = ["ipadic"], optional = true }
pinyin = { version = "0.9", default-features = false, features = ["with_tone"], optional = true }
wana_kana = "2.1.0"

[features]
default = ["chinese", "hebrew", "japanese", "thai"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default = ["chinese", "hebrew", "japanese", "thai"]
default = ["chinese", "hebrew", "japanese", "thai"]
# allow japanese character transliteration (put this under line 38)
japanese-transliteration = ["dep:wana_kana"]

@@ -3,6 +3,8 @@ use once_cell::sync::Lazy;
#[cfg(feature = "chinese")]
pub use self::chinese::ChineseNormalizer;
pub use self::control_char::ControlCharNormalizer;
#[cfg(feature = "japanese")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[cfg(feature = "japanese")]
#[cfg(feature = "japanese-transliteration")]

@@ -12,6 +14,8 @@ use crate::Token;
#[cfg(feature = "chinese")]
mod chinese;
mod control_char;
#[cfg(feature = "japanese")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[cfg(feature = "japanese")]
#[cfg(feature = "japanese-transliteration")]

@@ -22,6 +26,8 @@ pub static NORMALIZERS: Lazy<Vec<Box<dyn Normalizer>>> = Lazy::new(|| {
Box::new(LowercaseNormalizer),
#[cfg(feature = "chinese")]
Box::new(ChineseNormalizer),
#[cfg(feature = "japanese")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[cfg(feature = "japanese")]
#[cfg(feature = "japanese-transliteration")]

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect!
I let bors run the tests, and if everything goes well, your PR will be automatically merged!
Thank you for your time!

bors merge

@bors
Copy link
Contributor

bors bot commented Oct 17, 2022

Build succeeded:

@bors bors bot merged commit 0719a97 into meilisearch:main Oct 17, 2022
@meili-bot
Copy link
Contributor

This message is sent automatically

Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a Japanese specialized Normalizer
3 participants