Skip to content

Add tests for split_text_by_words() covering special characters and multilingual edge cases #2

@chigwell

Description

@chigwell

User Story
As a software developer,
I want comprehensive test coverage for split_text_by_words() in test_eknowledge.py
so that edge cases with special characters and non-ASCII text are handled reliably.

Background
The current implementation of split_text_by_words() (in eknowledge/main.py) splits text by whitespace but lacks validation for punctuation-heavy or multilingual inputs. For example:

  • Emojis or hyphenated words (e.g., "state-of-the-art") may be split incorrectly.
  • Non-ASCII characters (e.g., "café", "日本語") could cause unexpected behavior.
  • Punctuation attached to words (e.g., "Hello!World") isn’t tested, risking silent failures.

This gap increases the risk of downstream errors in knowledge graph generation, as malformed chunks may produce invalid relationships.

Acceptance Criteria

  • Modify tests/test_eknowledge.py to add test cases for split_text_by_words():
    • Test strings with emojis (e.g., "🚀 rocket → moon 🌕").
    • Test non-ASCII characters (e.g., "München 99€, 東京").
    • Test punctuation-heavy text (e.g., "Hello!World?This,is;a:test").
    • Test mixed cases (e.g., "COVID-19 vs. SARS-CoV-2").
  • Validate that:
    • Chunks respect word boundaries (punctuation treated as part of words unless separated by whitespace).
    • Whitespace normalization handles tabs/newlines.
    • Tests fail if the current implementation splits "don’t" into ["don", "t"] or "café" into ["caf", "é"].
  • Run pytest to confirm new tests pass/fail as expected.
  • Ensure test coverage for split_text_by_words() increases by ≥15% (measured via coverage.py).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions