-
Notifications
You must be signed in to change notification settings - Fork 96
Add Japanese normalizer to cover Katakana to Hiragana #149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This message is sent automatically Hello @choznerol, |
61fabac
to
9eafd31
Compare
9eafd31
to
26fb497
Compare
src/normalizer/japanese.rs
Outdated
debug_assert!( | ||
token.lemma().len() == new_lemma.len(), | ||
concat!( | ||
r#"`to_hiragana` changed the lemma len from {} to {} but the current `char_map` computation "#, | ||
r#"expected them to be equal. If `to_hiragana` does change len of char somehow, consider "#, | ||
r#"calling `to_hiragana(char)` char by char instead of only calling `to_hiragana(lemma)` once."# | ||
), | ||
token.lemma().len(), | ||
new_lemma.len() | ||
); | ||
let old_new_chars = token.lemma().chars().zip(new_lemma.chars()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike to_pinyin
operates on each char
, to_hiragana()
can operate on the whole lemma
at once. This provides some performance benefit because each to_hiragana()
call has some branching cost. However, to_hiragana(lemma)
instead of to_hiragana(char)
also makes the char_map
implementation a bit more complicated, and I am not sure about the trade off here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you seem to be 100% sure that this conversion will always keep the same length as the original character.
In this case, the char_map is useless and shouldn't be created, so, I suggest rewriting your normalizer without bothering with the char map. This would simplify, a lot, your implementation. 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @choznerol,
I requested some changes to your PR, this would simplify the implementation.
However, the general PR is great, thanks!
src/normalizer/japanese.rs
Outdated
if is_hiragana(token.lemma()) { | ||
// No need to convert | ||
|
||
if options.create_char_map && token.char_map.is_none() { | ||
let mut char_map = Vec::new(); | ||
for c in token.lemma().chars() { | ||
char_map.push((c.len_utf8() as u8, c.len_utf8() as u8)); | ||
} | ||
token.char_map = Some(char_map); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an Identity
char map is the same that not having it, so don't bother creating it.
if is_hiragana(token.lemma()) { | |
// No need to convert | |
if options.create_char_map && token.char_map.is_none() { | |
let mut char_map = Vec::new(); | |
for c in token.lemma().chars() { | |
char_map.push((c.len_utf8() as u8, c.len_utf8() as u8)); | |
} | |
token.char_map = Some(char_map); | |
} | |
} else { | |
if !is_hiragana(token.lemma()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I originally thought we must create one when create_char_map
is true 😅 Seems this is not the case, so the implementation can be much simpler. Thanks for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 49bc172
src/normalizer/japanese.rs
Outdated
debug_assert!( | ||
token.lemma().len() == new_lemma.len(), | ||
concat!( | ||
r#"`to_hiragana` changed the lemma len from {} to {} but the current `char_map` computation "#, | ||
r#"expected them to be equal. If `to_hiragana` does change len of char somehow, consider "#, | ||
r#"calling `to_hiragana(char)` char by char instead of only calling `to_hiragana(lemma)` once."# | ||
), | ||
token.lemma().len(), | ||
new_lemma.len() | ||
); | ||
let old_new_chars = token.lemma().chars().zip(new_lemma.chars()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you seem to be 100% sure that this conversion will always keep the same length as the original character.
In this case, the char_map is useless and shouldn't be created, so, I suggest rewriting your normalizer without bothering with the char map. This would simplify, a lot, your implementation. 😊
I only manually tested the mappings I found in the crate repo. To be more confident about the assertion, I searched the unit tests about |
@choznerol, in this case just ignore the char_map creation for the implementation of this Normalizer 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes related to the discussion on the dedicated issue.
Thanks a lot for your investment in this work! 👍
Cargo.toml
Outdated
@@ -22,6 +22,7 @@ unicode-segmentation = "1.6.0" | |||
whatlang = "0.16.1" | |||
lindera = { version = "=0.16.0", features = ["ipadic"], optional = true } | |||
pinyin = { version = "0.9", default-features = false, features = ["with_tone"], optional = true } | |||
wana_kana = "2.1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wana_kana = "2.1.0" | |
wana_kana = { version = "2.1.0", optional = true} |
Cargo.toml
Outdated
@@ -22,6 +22,7 @@ unicode-segmentation = "1.6.0" | |||
whatlang = "0.16.1" | |||
lindera = { version = "=0.16.0", features = ["ipadic"], optional = true } | |||
pinyin = { version = "0.9", default-features = false, features = ["with_tone"], optional = true } | |||
wana_kana = "2.1.0" | |||
|
|||
[features] | |||
default = ["chinese", "hebrew", "japanese", "thai"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default = ["chinese", "hebrew", "japanese", "thai"] | |
default = ["chinese", "hebrew", "japanese", "thai"] | |
# allow japanese character transliteration (put this under line 38) | |
japanese-transliteration = ["dep:wana_kana"] |
src/normalizer/mod.rs
Outdated
@@ -3,6 +3,8 @@ use once_cell::sync::Lazy; | |||
#[cfg(feature = "chinese")] | |||
pub use self::chinese::ChineseNormalizer; | |||
pub use self::control_char::ControlCharNormalizer; | |||
#[cfg(feature = "japanese")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[cfg(feature = "japanese")] | |
#[cfg(feature = "japanese-transliteration")] |
src/normalizer/mod.rs
Outdated
@@ -12,6 +14,8 @@ use crate::Token; | |||
#[cfg(feature = "chinese")] | |||
mod chinese; | |||
mod control_char; | |||
#[cfg(feature = "japanese")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[cfg(feature = "japanese")] | |
#[cfg(feature = "japanese-transliteration")] |
src/normalizer/mod.rs
Outdated
@@ -22,6 +26,8 @@ pub static NORMALIZERS: Lazy<Vec<Box<dyn Normalizer>>> = Lazy::new(|| { | |||
Box::new(LowercaseNormalizer), | |||
#[cfg(feature = "chinese")] | |||
Box::new(ChineseNormalizer), | |||
#[cfg(feature = "japanese")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[cfg(feature = "japanese")] | |
#[cfg(feature = "japanese-transliteration")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect!
I let bors run the tests, and if everything goes well, your PR will be automatically merged!
Thank you for your time!
bors merge
Build succeeded: |
This message is sent automatically Thank you for contributing to Meilisearch. If you are participating in Hacktoberfest, and you would like to receive some gift from Meilisearch too, please complete this form. |
Pull Request
Related issue
Fixes #131
What does this PR do?
# Limitations
Converting from Kanji is not supported
From #131:
After some experiments and checking convert options, it seems like wana_kana does not support converting Kanji to Hiragana or Romaji. For example:
to_hiragana("ダメ駄目だめ")
will be"だめ駄目だめ"
to_romaji("ダメ駄目だめ")
will be"dame駄目dame"
PR checklist
Please check if your PR fulfills the following requirements:
Thank you so much for contributing to Meilisearch!