Add an allowlist to the tokenizer builder

Today Charabia detects automatically the Language of the provided text choosing the best tokenization pipeline in consequence.

#### drawback
Sometimes the detection is not accurate, mainly when the provided text is short, and the user can't choose manually the Languages contained in the provided text. 

#### enhancement
Add a new setting in the `TokenizerBuilder` forcing the detection to choose in a subset of Languages, and when there are no choices, skip the detection and pick directly the specialized pipeline.
[Whatlang](https://crates.io/crates/whatlang), the library used to detect the Language, provides a way to set a subset of Languages that can be detected with the [Detector::with_allowlist](https://docs.rs/whatlang/latest/whatlang/struct.Detector.html) method.

Technical approach:
1) add an optional `allowlist` parameter to the method `detect` of the `Detect` trait in [detection/mod.rs](https://github.com/meilisearch/charabia/blob/main/src/detection/mod.rs)
2) add a `segment_with_allowlist` and a `segment_str_with_allowlist` with an additional `allowlist` parameter to the `Segment` trait in [segmenter/mod.rs](https://github.com/meilisearch/charabia/blob/main/src/segmenter/mod.rs)
3) add an `allowlist` method to the `TokenizerBuilder` struct in [tokenizer.rs](https://github.com/meilisearch/charabia/blob/main/src/tokenizer.rs)

The `allowlist` should be a hashmap of `Script` -> `[Languages]`

#### Files expected to be modified
- [tokenizer.rs](https://github.com/meilisearch/charabia/blob/main/src/tokenizer.rs)
- [segmenter/mod.rs](https://github.com/meilisearch/charabia/blob/main/src/segmenter/mod.rs)
- [detection/mod.rs](https://github.com/meilisearch/charabia/blob/main/src/detection/mod.rs)

> Hey! 👋 
Before starting any implementation, make sure that you read the [CONTRIBUTING.md](https://github.com/meilisearch/charabia/blob/main/CONTRIBUTING.md#contributing) file.
In addition to the recurrent rules, you can find some guides to easily implement a `Segmenter` or a `Normalizer`.
Thanks a lot for your Contribution! 🤝

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an allowlist to the tokenizer builder #132

drawback

enhancement

Files expected to be modified

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add an allowlist to the tokenizer builder #132

Description

drawback

enhancement

Files expected to be modified

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions