Skip to content

Arabic script: Implement specialized Segmenter #133

@ManyTheFish

Description

@ManyTheFish

Currently, the Arabic Script is segmented on whitespaces and punctuation.

Drawback

Following the dedicated discussion on Arabic Language support and the linked issues, the agglutinative words are not segmented, for example in this comments:

the agglutinated word الشجرة => The Tree is a combination of الـ and شجرة
الـ is equivalent to The and it's always connected (not space separated) to the next word.

Enhancement

We should find a specialized segmenter for the Arabic Script, or else, a dictionary to implement our own segmenter inspired by the Thaï Segmenter.


Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions