User Story
As a software developer,
I want comprehensive test coverage for split_text_by_words() in test_eknowledge.py
so that edge cases with special characters and non-ASCII text are handled reliably.
Background
The current implementation of split_text_by_words() (in eknowledge/main.py) splits text by whitespace but lacks validation for punctuation-heavy or multilingual inputs. For example:
- Emojis or hyphenated words (e.g., "state-of-the-art") may be split incorrectly.
- Non-ASCII characters (e.g., "café", "日本語") could cause unexpected behavior.
- Punctuation attached to words (e.g., "Hello!World") isn’t tested, risking silent failures.
This gap increases the risk of downstream errors in knowledge graph generation, as malformed chunks may produce invalid relationships.
Acceptance Criteria
User Story
As a software developer,
I want comprehensive test coverage for
split_text_by_words()intest_eknowledge.pyso that edge cases with special characters and non-ASCII text are handled reliably.
Background
The current implementation of
split_text_by_words()(ineknowledge/main.py) splits text by whitespace but lacks validation for punctuation-heavy or multilingual inputs. For example:This gap increases the risk of downstream errors in knowledge graph generation, as malformed chunks may produce invalid relationships.
Acceptance Criteria
tests/test_eknowledge.pyto add test cases forsplit_text_by_words():pytestto confirm new tests pass/fail as expected.split_text_by_words()increases by ≥15% (measured via coverage.py).