-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Description
My content contains potentially dangerous HTML and often broken HTML: I don't control the source.
On output this breaks the page and might cause XSS-injections.
Describe the solution you'd like
Either one, or more, of the following:
- Documented best practices WRT indexing HTML.
- Documented best practices WRT rendering indexed (potentially broken or dirty) HTML. Highlighting included.
- Have the
_formatted
fields strip all HTML after cropping, and before highlighting. - Have the
_formatted
fields encode all HTML after cropping, and after highlighting, but before inserting the highlight em-tags.
Describe alternatives you've considered
I see several opportunities and like to discuss those, before writing a PR:
- Leave it outside the responsibility of Meilisearch. In this case maybe a warning in documentation about potential XSS is prudent. In this case user would need to:
- Sanitize, encode or strip HTML before indexing.
- Sanitize, encode or strip HTMLclient-side, before returning to the user. As is done in e.g. html sanitize #539
- Sanitize, encode or strip HTML in a proxy or backend. The (HTML/JS) client would request from a custom backend which searches in Meilisearch and then sanitizes the results.
- Always encode or strip HTML before placing the fields in
_formatted
. Meilisearch would be opinionated, choose one and enforce that for all users. This is how e.g. elasticsearch does its result_fields. - Have an option to encode or strip HTML. A user would need to add additional options, e.g.
attributesToStrip=*,overview
alongside the crop and highlight feature, which would either encode or strip the configured fields.
Please check some tests I added in 6b0f4eb which demonstrate the problem caused when cropping or highlighting. The XSS is not covered as that is implied in those tests, but not directly related to cropping and highlighting.
Alternatives have some down- and upsides:
- 1.i: Will loose information that the tokenizer might (in future?) employ to improve ranking. E.g. rank a word in
<strong>
higher than one in<small>
. - 1.ii: Makes the highlighter complex and often not work. Will show garbled content to users, in many cases.
- 1.iii: Requires serverside app. Useful when already needed for e.g. access control, but adds complexity and performance overhead.
-
- Any of those do allow great control over the output for several use-cases. E.g. an implementor might want to keep tags like p, br and to have some line-support, or em and strong to have some markup, but may want to remove headers, a-tags etc.
-
- Is simple and straightforward for the implmentor but gives little control. Also requires some thinking on how to deal with cropping offsets and highlighting: should those apply to the sanitized, cropped, or stripped version or to the orig? Since both the
_formatted
and original fields are returned, users still have the ability to use the raw, original value and e.g. sanitize that themselves.
- Is simple and straightforward for the implmentor but gives little control. Also requires some thinking on how to deal with cropping offsets and highlighting: should those apply to the sanitized, cropped, or stripped version or to the orig? Since both the
-
- is more complex to implement. In the simplest form, it may cover some use-cases, but will always exclude some (like stripping only attributes or only certain tags, etc). Or it would quickly turn into a full-blown HTML-sanitizer feature that is controlled through search-parameters, which sounds like a bad plan to me. :D.
amirouche and ppamorim
Metadata
Metadata
Assignees
Labels
No labels