HTML strip normalizer on the tokenizer #474

qdequele · 2022-05-20T14:01:09Z

qdequele
May 20, 2022
Maintainer

Sometimes I help users who send data that contains HTML without knowing that this dramatically affects the relevancy when searching, especially when words touch HTML tags.

Would it be possible to implement a normalizer that removes all HTML tags in document collections? Would it be possible to activate/deactivate it via settings or to activate it by default?

I found a crate to do it 🙂.

gmourier · 2022-06-02T12:19:21Z

gmourier
Jun 2, 2022
Maintainer

Hey @qdequele 👋

This could be a first addition in the direction of having a transformation pipeline.

Note: There is a possible workaround. Doing this on the user-side before sending the documents (the workaround is limited to people that can manipulate the documents).

It's not obvious that this is a priority right now to be solved but the normalizer thing could be explored quickly.

@ManyTheFish What do you think?

3 replies

ManyTheFish Jun 2, 2022
Collaborator

Handling this in the tokenizer is possible but is harder than just adding a Normalizer because it passes after the Segmenter.
That means that the text <span><a href=\"#\">Summer</a> is nice</span> would be segmented as ["<", "span", ">", "<", "a", " ", "href", "=", "\"", "#", "\"", ">", "Summer", "<", "/", "a", ">", " ", "is", " ", "nice", "<", "/", "span", ">"] before the normalization phase.
However, it would be possible to pre-segment the text in order to isolate HTML tags and consider them as separators 🤔

qdequele Jun 27, 2022
Maintainer Author

With this crate, it removes totally all HTML tags to keep only text.

ManyTheFish Jul 4, 2022
Collaborator

Yes @qdequele! But it will completely break the highlight if we use this library in a preprocessing part.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

HTML strip normalizer on the tokenizer #474

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Meilisearch

HTML strip normalizer on the tokenizer #474

Uh oh!

qdequele May 20, 2022 Maintainer

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

gmourier Jun 2, 2022 Maintainer

Uh oh!

ManyTheFish Jun 2, 2022 Collaborator

Uh oh!

qdequele Jun 27, 2022 Maintainer Author

Uh oh!

ManyTheFish Jul 4, 2022 Collaborator

qdequele
May 20, 2022
Maintainer

Replies: 1 comment 3 replies

gmourier
Jun 2, 2022
Maintainer

ManyTheFish Jun 2, 2022
Collaborator

qdequele Jun 27, 2022
Maintainer Author

ManyTheFish Jul 4, 2022
Collaborator