-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.
drawback
Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text.
enhancement
Add a new setting in the TokenizerBuilder
forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline.
Whatlang, the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the Detector::with_allowlist method.
Technical approach:
- add an optional
allowlist
parameter to the methoddetect
of theDetect
trait in detection/mod.rs - add a
segment_with_allowlist
and asegment_str_with_allowlist
with an additionalallowlist
parameter to theSegment
trait in segmenter/mod.rs - add an
allowlist
method to theTokenizerBuilder
struct in tokenizer.rs
The allowlist
should be a hashmap of Script
-> [Languages]
Files expected to be modified
Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement aSegmenter
or aNormalizer
.
Thanks a lot for your Contribution! 🤝