Skip to content
This repository was archived by the owner on Apr 4, 2023. It is now read-only.
This repository was archived by the owner on Apr 4, 2023. It is now read-only.

Store detected Language per document during indexing #646

@ManyTheFish

Description

@ManyTheFish

⚠️: This issue is not an easy one, it requires some knowledge in Rust and more work than the other issues.
I highly encourage beginners to take another issue.

Summary

Meilisearch automatically detects the Script and the Language during indexing and search.
Because the searches only contain small texts, it is almost impossible to efficiently detect the used Language.
However, during indexing, Meilisearch receives complete documents on which it is easier to detect the Language, And so, instead of knowing the Language used in the search query, we could know the Language used in the data where we search in.

related to: meilisearch/product#532 (reply in thread)

technical approach

Create a new database

The first step is to create a new database in the index named script_language_docids in the Index that stores as the key: the Script concatenated to the Language and as the value: a RoaringBitmap containing all the concerned docids, be aware that the key needs a specialized codec.

related files:

Extract and index data

During word position extraction we should store the detected languages in a hashmap linked with the docids in order to send the hashmap to the main thread at the end of the extraction task.
Then the main thread will have to store these data in the script_language_docids database.
Be aware that the same document can contain several Languages, and so, should be indexed as the value of several Script/Language pairs.

related files:

Delete data

When removing documents, we should take care of removing the corresponding docids from the script_language_docids database.
Then, when the database is cleared, the script_language_docids database should be cleared too.

related files:

Todo

  • create a new database
    • implementation
  • update this database during indexing
    • implementation
    • tests
  • update this database during deletion
    • implementation
    • tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions