-
Notifications
You must be signed in to change notification settings - Fork 96
Improve Arabic Normalizer #204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current normalizer only remove tatweel and kashida, this commit add more normalization rules. - Remove all diacritics. - Normalize yeh. - Normalize alef. - Remove all tatweel. Currently it's a draft until tests are added.
Arabic alphabet: Arabic text should be normalized by:
For normalizing Arabic fn normalize_arabic_char(c: char) -> Option<CharOrStr> {
match c {
'ـ' => None,
'أ' | 'إ' | 'آ' | 'ٱ' => Some('ا'.into()),
'ى' => Some('ي'.into()),
'َ' | 'ُ' | 'ِ' | 'ٰ' | 'ٓ' | 'ْ' | 'ۡ' | 'ً' | 'ٍ' | 'ٌ' | 'ّ' => None,
_ => Some(c.into()),
}
} To detect if fn is_shoud_normalize(c: char) -> bool {
match c {
'ـ' | 'أ' | 'إ' | 'آ' | 'ٱ' | 'ى' | 'َ' | 'ُ' | 'ِ' | 'ٰ' | 'ٓ' | 'ْ' | 'ۡ' | 'ً' | 'ٍ' | 'ٌ'
| 'ّ' => true,
_ => false,
}
} I am not sure how to do test function because I am not fully aware of these
|
Adding support for arabic Taa Marbuta 'ة' To test proper normalization: - This: `النهاردة` should be normalized to `النهارده`
c0fbe24
|
Hello @DrAliRagab, I feel that the diacritics are already normalized by the nonspacing-marks normalizer. |
Yes, It is. |
Removing arabic diacritics normalization rule as it's already normalized by the [nonspacing-marks](https://github.com/meilisearch/charabia/blob/main/charabia/src/normalizer/nonspacing_mark.rs) normalizer.
Hello @DrAliRagab, |
Co-authored-by: Many the fish <[email protected]>
As suggested by @ManyTheFish , Only "wasla" needs to be added
- Removing `Alef` tests as we already removed Alef normalization and implement a new `Alef wasla` test. - fix match expression to comply with `cargo Clippy`. Now both `cargo test` and `cargo clippy` will not through any errors
I hope it's ready for merge now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thank you!
bors merge
Build succeeded:
|
Pull Request
Current normalizer only remove tatweel, this commit add more normalization rules.
Currently it's a draft until tests are added.
Related issue
Improve support for Arabic language: meilisearch/product#139