-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Currently, the Arabic Script is segmented on whitespaces and punctuation.
Drawback
Following the dedicated discussion on Arabic Language support and the linked issues, the agglutinative words are not segmented, for example in this comments:
the agglutinated word
الشجرة
=>The Tree
is a combination ofالـ
andشجرة
الـ
is equivalent toThe
and it's always connected (not space separated) to the next word.
Enhancement
We should find a specialized segmenter for the Arabic Script, or else, a dictionary to implement our own segmenter inspired by the Thaï Segmenter.
Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement aSegmenter
or aNormalizer
.
Thanks a lot for your Contribution! 🤝