Skip to content

Conversation

@noeddl
Copy link
Contributor

@noeddl noeddl commented Sep 19, 2025

Fixes #821
Depends on okfde/froide#1122 (both PRs should be reviewed together)

This PR improves the handling of compound words and reduces false positives in search.

Changes

  • Compound word handling

    • Preprocesses search queries so that subtokens of compound words are connected with AND instead of OR (as described in this article). This ensures that all subtokens must be present in a document for it to match.
    • Uses the output of the decompounder token filter in search queries only if the subtokens actually form the full compound word.
    • Prevents false positives caused by very short subtokens and nested subtokens.
  • General search improvements

    • Removes stop words from non-exact queries.
    • Ensures quoted phrases only match exact results.
    • Adds tests to cover and document the current search behavior.

Notes

  • Search is still far from perfect. The decompounder algorithm itself could be improved, but this likely requires significant effort (e.g. a custom Elasticsearch/Lucene plugin).
  • A lighter alternative for improvement could be extending the decompounder dictionary.
  • Additional improvements could be achieved by using a more sophisticated stemmer or lemmatizer.
  • Longer-term, we may want to consider switching to more advanced approaches such as semantic search.

This is needed for being able to use the recently introduced parameters `no_sub_matches`
and `no_overlapping_matches` for the Hyphenation decompounder token filter.
@noeddl noeddl force-pushed the noeddl/improve-search branch 11 times, most recently from eaa85dc to f906a95 Compare November 7, 2025 13:45
@noeddl noeddl force-pushed the noeddl/improve-search branch from f906a95 to 42bf970 Compare November 7, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants