How to deal with potentially dangerous or broken HTML, some options:

My content contains potentially dangerous HTML and often broken HTML: I don't control the source.
On output this breaks the page and might cause XSS-injections.

**Describe the solution you'd like**
Either one, or more, of the following:

* Documented best practices WRT indexing HTML.
* Documented best practices WRT rendering indexed (potentially broken or dirty) HTML. Highlighting included.
* Have the `_formatted` fields strip all HTML after cropping, and before highlighting.
* Have the `_formatted` fields encode all HTML after cropping, and after highlighting, but before inserting the highlight em-tags. 

**Describe alternatives you've considered**
I see several opportunities and like to discuss those, before writing a PR:

1. Leave it outside the responsibility of Meilisearch. In this case maybe a warning in documentation about potential XSS is prudent. In this case user would need to:
   1. Sanitize, encode or strip HTML before indexing.
   1. Sanitize, encode or strip HTMLclient-side, before returning to the user. As is done in e.g. #539 
   1. Sanitize, encode or strip HTML in a proxy or backend. The (HTML/JS) client would request from a custom backend which searches in Meilisearch and then sanitizes the results.
1. Always encode or strip HTML before placing the fields in `_formatted`. Meilisearch would be opinionated, choose one and enforce that for all users. This is [how e.g. elasticsearch does its result_fields](https://www.elastic.co/guide/en/app-search/current/sanitization-guide.html).
1. Have an option to encode or strip HTML. A user would need to add additional options, e.g. `attributesToStrip=*,overview` alongside the crop and highlight feature, which would either encode or strip the configured fields.

Please check some tests I added in [6b0f4eb](https://github.com/Flockingbird/MeiliSearch/commit/6b0f4ebffc86a0faa37c6c4f3d12cde9e99aec04) which demonstrate the problem caused when cropping or highlighting. The XSS is not covered as that is implied in those tests, but not directly related to cropping and highlighting.
Alternatives have some down- and upsides: 

* 1.i: Will loose information that the tokenizer might (in future?) employ to improve ranking. E.g. rank a word in `<strong>` higher than one in `<small>`.
* 1.ii: Makes the highlighter complex and often not work. Will show garbled content to users, in many cases.
* 1.iii: Requires serverside app. Useful when already needed for e.g. access control, but adds complexity and performance overhead.
* 1. Any of those do allow great control over the output for several use-cases. E.g. an implementor might want to keep tags like p, br and to have some line-support, or em and strong to have some markup, but may want to remove headers, a-tags etc.
* 2. Is simple and straightforward for the implmentor but gives little control. Also requires some thinking on how to deal with cropping offsets and highlighting: should those apply to the sanitized, cropped, or stripped version or to the orig? Since both the `_formatted` and original fields are returned, users still have the ability to use the raw, original value and e.g. sanitize that themselves.
* 3. is more complex to implement. In the simplest form, it may cover some use-cases, but will always exclude some (like stripping only attributes or only certain tags, etc). Or it would quickly turn into a full-blown HTML-sanitizer feature that is controlled through search-parameters, which sounds like a bad plan to me. :D.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to deal with potentially dangerous or broken HTML, some options: #1409

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to deal with potentially dangerous or broken HTML, some options: #1409

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions