Skip to content

How to deal with potentially dangerous or broken HTML, some options: #1409

@berkes

Description

@berkes

My content contains potentially dangerous HTML and often broken HTML: I don't control the source.
On output this breaks the page and might cause XSS-injections.

Describe the solution you'd like
Either one, or more, of the following:

  • Documented best practices WRT indexing HTML.
  • Documented best practices WRT rendering indexed (potentially broken or dirty) HTML. Highlighting included.
  • Have the _formatted fields strip all HTML after cropping, and before highlighting.
  • Have the _formatted fields encode all HTML after cropping, and after highlighting, but before inserting the highlight em-tags.

Describe alternatives you've considered
I see several opportunities and like to discuss those, before writing a PR:

  1. Leave it outside the responsibility of Meilisearch. In this case maybe a warning in documentation about potential XSS is prudent. In this case user would need to:
    1. Sanitize, encode or strip HTML before indexing.
    2. Sanitize, encode or strip HTMLclient-side, before returning to the user. As is done in e.g. html sanitize #539
    3. Sanitize, encode or strip HTML in a proxy or backend. The (HTML/JS) client would request from a custom backend which searches in Meilisearch and then sanitizes the results.
  2. Always encode or strip HTML before placing the fields in _formatted. Meilisearch would be opinionated, choose one and enforce that for all users. This is how e.g. elasticsearch does its result_fields.
  3. Have an option to encode or strip HTML. A user would need to add additional options, e.g. attributesToStrip=*,overview alongside the crop and highlight feature, which would either encode or strip the configured fields.

Please check some tests I added in 6b0f4eb which demonstrate the problem caused when cropping or highlighting. The XSS is not covered as that is implied in those tests, but not directly related to cropping and highlighting.
Alternatives have some down- and upsides:

  • 1.i: Will loose information that the tokenizer might (in future?) employ to improve ranking. E.g. rank a word in <strong> higher than one in <small>.
  • 1.ii: Makes the highlighter complex and often not work. Will show garbled content to users, in many cases.
  • 1.iii: Requires serverside app. Useful when already needed for e.g. access control, but adds complexity and performance overhead.
    1. Any of those do allow great control over the output for several use-cases. E.g. an implementor might want to keep tags like p, br and to have some line-support, or em and strong to have some markup, but may want to remove headers, a-tags etc.
    1. Is simple and straightforward for the implmentor but gives little control. Also requires some thinking on how to deal with cropping offsets and highlighting: should those apply to the sanitized, cropped, or stripped version or to the orig? Since both the _formatted and original fields are returned, users still have the ability to use the raw, original value and e.g. sanitize that themselves.
    1. is more complex to implement. In the simplest form, it may cover some use-cases, but will always exclude some (like stripping only attributes or only certain tags, etc). Or it would quickly turn into a full-blown HTML-sanitizer feature that is controlled through search-parameters, which sounds like a bad plan to me. :D.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions