-
Notifications
You must be signed in to change notification settings - Fork 96
Description
In the Information Retrieval (IR) context, removing Nonspacing Marks like diacritics is a good way to increase recall without losing much precision, like in Latin, Arabic, or Hebrew.
Technical Approach
Implement a new Normalizer, named NonspacingMarkNormalizer
, that removes the nonspacing marks from a provided token (find a naive implementation with the exhaustive list in the Misc
section).
Because there are a lot of sparse character ranges to match, it would be inefficient to create a big if-forest to know if a character is a nonspacing mark.
This way, I suggest trying several implementations of the naive implementation below in a small local project.
Interesting Rust Crates
- hyperfine: a small command-line tool to benchmark several binaries
- roaring-rs: a bitmap data structure that has an efficient
contains
method - once_cell: a good Library to create lazy statics already used in the repository
Misc
- naive implementation of
is_nonspacing_mark
- related discussion about the Arabic Language
Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement aSegmenter
or aNormalizer
.
Thanks a lot for your Contribution! 🤝