Thai language support #133
Replies: 6 comments 6 replies
-
As I wrote on Slack, I can read / write / speak Thai to some extent (not fluently, but well enough) so I'm pretty new to Rust but I would be willing to offer some support. |
Beta Was this translation helpful? Give feedback.
-
As an addition, one of my friends sent me this PyThaiNLP library and it makes me eager to understand which features you are looking for when it comes to tokenization (I've only briefly looked at Meilisearch related docs so far, and tokenization mechanics details are still a bit unclear to me). |
Beta Was this translation helpful? Give feedback.
-
May be try this Thai tokenizer in Rust https://github.com/NattapongSiri/tokenizer_rs |
Beta Was this translation helpful? Give feedback.
-
Another Rust library that provides Thai word tokenization |
Beta Was this translation helpful? Give feedback.
-
Hello 👋 We have just released v0.29.0rc1, which is a release candidate of v0.29.0 🔥 This version introduces the Thai language support. Binaries are attached to the release, or you can use the docker image: docker run -it --rm \
-p 7700:7700 \
getmeili/meilisearch:v0.29.0rc1 Let us know about any bugs or feedback! 😄 It would be really helpful. FYI, the official v0.29.0 release will be available on 3rd October. |
Beta Was this translation helpful? Give feedback.
-
Hi everyone 👋 Sorry to be late. I completely forgot to make an announcement here. You may already know it. We have released version v0.29, which includes Thai language support 🇹🇭. (Consider using the v0.29.1, we have made an important fix to make it run on Debian 10) Thanks to everyone who helped and made this topic progress. Ps: I'm leaving this conversation open in case any improvements/suggestions need to be made. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Is your feature request related to a problem? Please describe.
Thai language is written in a long string without any spaces between words. Therefore it needs a proper tokenizer to work with.
If we use simple tokenizer, it will match only the start of each spaced segment.
For example: "เขากำลังกินข้าว" is like "heiseatingrice" so the middle words won't be found with query 'eating' or 'rice'.
Describe the solution you'd like
Support for Thai language with proper word segmentation.
Describe alternatives you've considered
Algolia: No support for Thai language. When I search for a keyword in the middle of a sentence or a string, it won't match. It only matches the start of a spaced segment.
Elasticsearch: Does have Thai language segmentation (Thai tokenizer, ICU_tokenizer), but the complexity of maintaining the service is too high (I work alone).
I am currently using SQL LIKE queries for search which is limited for scaling and functionality.
Additional context
If you need more information, please let me know.
Beta Was this translation helpful? Give feedback.
All reactions