Skip to content

Improve Arabic Normalizer #204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 11, 2023
Merged

Improve Arabic Normalizer #204

merged 7 commits into from
Apr 11, 2023

Conversation

DrAliRagab
Copy link
Contributor

Pull Request

Current normalizer only remove tatweel, this commit add more normalization rules.

  • Remove all diacritics.
  • Normalize yeh.
  • Normalize alef.
  • Remove all tatweel.

Currently it's a draft until tests are added.

Related issue

Improve support for Arabic language: meilisearch/product#139

Current normalizer only remove tatweel and kashida, this commit add more normalization rules.
- Remove all diacritics.
- Normalize yeh.
- Normalize alef.
- Remove all tatweel.

Currently it's a draft until tests are added.
@DrAliRagab
Copy link
Contributor Author

Arabic alphabet:ا,ب,ت,ث,ج,ح,خ,د,ذ,ر,ز,س,ش,ص,ض,ط,ظ,ع,غ,ف,ق,ك,ل,م,ن,ه,و,ي,ء

Arabic text should be normalized by:

  • removing the arabic Tatweel (ـ) characters.
  • normalizing the arabic Alef أ,إ,آ,ٱ to ا
  • normalizing the arabic Yeh ى to ي
  • removing the arabic diacritics: Fatḥah, Damma, Kasrah, Alif Khanjariyah, Maddah, Sukun, Tanwin, Shaddah
  • Arabic diacritics: َ, ُ, ِ, ٰ, ٓ, ْ, ۡ, ً, ٍ, ٌ, ّ,
    Arabic alphabet
    Arabic diacritics
    Kashida

For normalizing Arabic char, I used match

fn normalize_arabic_char(c: char) -> Option<CharOrStr> {
    match c {
        'ـ' => None,
        'أ' | 'إ' | 'آ' | 'ٱ' => Some('ا'.into()),
        'ى' => Some('ي'.into()),
        'َ' | 'ُ' | 'ِ' | 'ٰ' | 'ٓ' | 'ْ' | 'ۡ' | 'ً' | 'ٍ' | 'ٌ' | 'ّ' => None,
        _ => Some(c.into()),
    }
}

To detect if char should be normalized, I also used match

fn is_shoud_normalize(c: char) -> bool {
    match c {
        'ـ' | 'أ' | 'إ' | 'آ' | 'ٱ' | 'ى' | 'َ' | 'ُ' | 'ِ' | 'ٰ' | 'ٓ' | 'ْ' | 'ۡ' | 'ً' | 'ٍ' | 'ٌ'
        | 'ّ' => true,
        _ => false,
    }
}

I am not sure how to do test function because I am not fully aware of these char_end, byte_end and char_map.
To test the function:

  • This: الْحُرُوف الْعَرَبِيَّة should be normalized to الحروف العربية
  • This: آليس should be normalized to اليس
  • This: إبرأهيم should be normalized to ابراهيم
  • This: يومى should be normalized to يومي

Adding support for arabic Taa Marbuta 'ة'
To test proper normalization:
- This: `النهاردة` should be normalized to `النهارده`
@DrAliRagab
Copy link
Contributor Author

c0fbe24
Adding support for arabic Taa Marbuta 'ة'
To test proper normalization:

  • This: النهاردة should be normalized to النهارده

Add tests for various Arabic normalization rules
@DrAliRagab DrAliRagab marked this pull request as ready for review April 6, 2023 12:34
@curquiza curquiza requested a review from ManyTheFish April 6, 2023 12:39
@ManyTheFish
Copy link
Member

Hello @DrAliRagab, I feel that the diacritics are already normalized by the nonspacing-marks normalizer.

@DrAliRagab
Copy link
Contributor Author

Hello @DrAliRagab, I feel that the diacritics are already normalized by the nonspacing-marks normalizer.

Yes, It is.
Using nonspacing-marks is an awesome idea. I didn't know about it.
I think I should remove the diacritics from this Arabic normalization rules and keep only Alef, Yeh, Taa Marbuta and Tatweel

Removing arabic diacritics normalization rule as it's already normalized by the [nonspacing-marks](https://github.com/meilisearch/charabia/blob/main/charabia/src/normalizer/nonspacing_mark.rs) normalizer.
@ManyTheFish
Copy link
Member

Hello @DrAliRagab,
I made a review with only a small change request, the rest is all good and ready to be merged! 😊

DrAliRagab and others added 3 commits April 6, 2023 18:03
As suggested by @ManyTheFish , Only "wasla" needs to be added
- Removing `Alef` tests as we already removed Alef normalization and implement a new `Alef wasla` test.
- fix match expression to comply with `cargo Clippy`.

Now both `cargo test` and `cargo clippy` will not through any errors
@DrAliRagab
Copy link
Contributor Author

I hope it's ready for merge now

@DrAliRagab DrAliRagab requested a review from ManyTheFish April 8, 2023 22:29
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thank you!

bors merge

@bors
Copy link
Contributor

bors bot commented Apr 11, 2023

Build succeeded:

  • tests

@bors bors bot merged commit 7349df3 into meilisearch:main Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants