HTML strip normalizer on the tokenizer #474
qdequele
started this conversation in
Feedback & Feature Proposal
Replies: 1 comment 3 replies
-
Hey @qdequele 👋 This could be a first addition in the direction of having a transformation pipeline. Note: There is a possible workaround. Doing this on the user-side before sending the documents (the workaround is limited to people that can manipulate the documents). It's not obvious that this is a priority right now to be solved but the normalizer thing could be explored quickly. @ManyTheFish What do you think? |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Sometimes I help users who send data that contains HTML without knowing that this dramatically affects the relevancy when searching, especially when words touch HTML tags.
Would it be possible to implement a normalizer that removes all HTML tags in document collections? Would it be possible to activate/deactivate it via settings or to activate it by default?
I found a crate to do it 🙂.
Beta Was this translation helpful? Give feedback.
All reactions