Add tests for split_text_by_words() covering special characters and multilingual edge cases

**User Story**  
As a software developer,  
I want comprehensive test coverage for `split_text_by_words()` in `test_eknowledge.py`  
so that edge cases with special characters and non-ASCII text are handled reliably.

**Background**  
The current implementation of `split_text_by_words()` (in `eknowledge/main.py`) splits text by whitespace but lacks validation for punctuation-heavy or multilingual inputs. For example:  
- Emojis or hyphenated words (e.g., "state-of-the-art") may be split incorrectly.  
- Non-ASCII characters (e.g., "café", "日本語") could cause unexpected behavior.  
- Punctuation attached to words (e.g., "Hello!World") isn’t tested, risking silent failures.  

This gap increases the risk of downstream errors in knowledge graph generation, as malformed chunks may produce invalid relationships.

**Acceptance Criteria**  
- [ ] Modify `tests/test_eknowledge.py` to add test cases for `split_text_by_words()`:  
  - Test strings with emojis (e.g., "🚀 rocket → moon 🌕").  
  - Test non-ASCII characters (e.g., "München 99€, 東京").  
  - Test punctuation-heavy text (e.g., "Hello!World?This,is;a:test").  
  - Test mixed cases (e.g., "COVID-19 vs. SARS-CoV-2").  
- [ ] Validate that:  
  - Chunks respect word boundaries (punctuation treated as part of words unless separated by whitespace).  
  - Whitespace normalization handles tabs/newlines.  
  - Tests fail if the current implementation splits "don’t" into ["don", "t"] or "café" into ["caf", "é"].  
- [ ] Run `pytest` to confirm new tests pass/fail as expected.  
- [ ] Ensure test coverage for `split_text_by_words()` increases by ≥15% (measured via coverage.py).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tests for split_text_by_words() covering special characters and multilingual edge cases #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Add tests for split_text_by_words() covering special characters and multilingual edge cases #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions