Thai language support #133

punnadittr · 2020-07-20T08:29:06Z

punnadittr
Jul 20, 2020

Is your feature request related to a problem? Please describe.
Thai language is written in a long string without any spaces between words. Therefore it needs a proper tokenizer to work with.
If we use simple tokenizer, it will match only the start of each spaced segment.

For example: "เขากำลังกินข้าว" is like "heiseatingrice" so the middle words won't be found with query 'eating' or 'rice'.

Describe the solution you'd like
Support for Thai language with proper word segmentation.

Describe alternatives you've considered
Algolia: No support for Thai language. When I search for a keyword in the middle of a sentence or a string, it won't match. It only matches the start of a spaced segment.

Elasticsearch: Does have Thai language segmentation (Thai tokenizer, ICU_tokenizer), but the complexity of maintaining the service is too high (I work alone).

I am currently using SQL LIKE queries for search which is limited for scaling and functionality.

Additional context
If you need more information, please let me know.

Roms1383 · 2021-01-21T15:03:46Z

Roms1383
Jan 21, 2021

As I wrote on Slack, I can read / write / speak Thai to some extent (not fluently, but well enough) so I'm pretty new to Rust but I would be willing to offer some support.

0 replies

Roms1383 · 2021-01-23T03:02:20Z

Roms1383
Jan 23, 2021

As an addition, one of my friends sent me this PyThaiNLP library and it makes me eager to understand which features you are looking for when it comes to tokenization (I've only briefly looked at Meilisearch related docs so far, and tokenization mechanics details are still a bit unclear to me).

0 replies

bact · 2021-04-14T18:21:39Z

bact
Apr 14, 2021

May be try this Thai tokenizer in Rust https://github.com/NattapongSiri/tokenizer_rs

0 replies

bact · 2021-08-10T19:57:00Z

bact
Aug 10, 2021

Another Rust library that provides Thai word tokenization
https://github.com/PyThaiNLP/nlpO3/

4 replies

gmourier Aug 10, 2021
Maintainer

Thank you @bact and everyone else who previously participated in the discussion.

This is something we'd like to move forward with but we already have work to do on other topics and we don't have anyone who speaks Thai on the team right now.

We are currently looking for help on these topics to support other languages.

We would be extremely happy to have input from contributors. Should we provide support or specific information to facilitate contributions for language support?

bact Aug 11, 2021

I think we can try with something very basic, like stopwords, tokenizer, providing those resources are already exist and it's only a matter of provider some adapter/wrapper to make it accessible by MeilliSearch.

Roms1383 Mar 14, 2022

maybe another crate worth looking at: chamkho

curquiza Mar 24, 2022
Maintainer

@ManyTheFish just for your information

gmourier · 2022-08-30T13:12:15Z

gmourier
Aug 30, 2022
Maintainer

Hello 👋

We have just released v0.29.0rc1, which is a release candidate of v0.29.0 🔥

This version introduces the Thai language support.

Binaries are attached to the release, or you can use the docker image:

docker run -it --rm \
    -p 7700:7700 \
    getmeili/meilisearch:v0.29.0rc1

Let us know about any bugs or feedback! 😄 It would be really helpful.

FYI, the official v0.29.0 release will be available on 3rd October.

0 replies

gmourier · 2022-10-12T17:20:15Z

gmourier
Oct 12, 2022
Maintainer

Hi everyone 👋

Sorry to be late. I completely forgot to make an announcement here.

You may already know it. We have released version v0.29, which includes Thai language support 🇹🇭.

(Consider using the v0.29.1, we have made an important fix to make it run on Debian 10)

Thanks to everyone who helped and made this topic progress.

Ps: I'm leaving this conversation open in case any improvements/suggestions need to be made.

2 replies

bact Oct 12, 2022

Congrats and thank you. If I like to take a look more do you mind to give some quick pointers please? cheers

ManyTheFish Oct 13, 2022
Collaborator

The thaï support has been added by default on Meilisearch, so you should see the changes when searching in a Thaï dataset.
We've added a specialized segmenter based on a word dictionary, there is no normalization so far, but there are some contributions made during the Hactoberfest that should add normalizers like removing nonspacing marks.
These changes should be released for the next version.
However, if you think of something that could enhance the Thaï Language support, this discussion is the place to do it! 👍

Thank you for your interest in our project!

Thai language support #133

Uh oh!

Replies: 6 comments · 6 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gmourier Aug 10, 2021 Maintainer

Uh oh!

Uh oh!

Uh oh!

curquiza Mar 24, 2022 Maintainer

Uh oh!

gmourier Aug 30, 2022 Maintainer

Uh oh!

Uh oh!

gmourier Oct 12, 2022 Maintainer

Uh oh!

Uh oh!

ManyTheFish Oct 13, 2022 Collaborator

Replies: 6 comments 6 replies

gmourier Aug 10, 2021
Maintainer

curquiza Mar 24, 2022
Maintainer

gmourier
Aug 30, 2022
Maintainer

gmourier
Oct 12, 2022
Maintainer

ManyTheFish Oct 13, 2022
Collaborator