-
Notifications
You must be signed in to change notification settings - Fork 83
Store detected Language per document during indexing #646
Description
⚠️ : This issue is not an easy one, it requires some knowledge in Rust and more work than the other issues.
I highly encourage beginners to take another issue.
Summary
Meilisearch automatically detects the Script and the Language during indexing and search.
Because the searches only contain small texts, it is almost impossible to efficiently detect the used Language.
However, during indexing, Meilisearch receives complete documents on which it is easier to detect the Language, And so, instead of knowing the Language used in the search query, we could know the Language used in the data where we search in.
related to: meilisearch/product#532 (reply in thread)
technical approach
Create a new database
The first step is to create a new database in the index named script_language_docids
in the Index that stores as the key: the Script
concatenated to the Language
and as the value: a RoaringBitmap
containing all the concerned docids, be aware that the key needs a specialized codec.
related files:
Extract and index data
During word position extraction we should store the detected languages in a hashmap linked with the docids in order to send the hashmap to the main thread at the end of the extraction task.
Then the main thread will have to store these data in the script_language_docids
database.
Be aware that the same document can contain several Languages, and so, should be indexed as the value of several Script/Language pairs.
related files:
Delete data
When removing documents, we should take care of removing the corresponding docids from the script_language_docids
database.
Then, when the database is cleared, the script_language_docids
database should be cleared too.
related files:
Todo
- create a new database
- implementation
- update this database during indexing
- implementation
- tests
- update this database during deletion
- implementation
- tests