Fix #2982: Improve _HTMLWordTruncator#3002
Conversation
|
I was thinking about this recently. I don't like the maintainability of keep adding alphabet blocks. I'm sure it is probably not generally true but can we assume anything outside DBC is latin-like (words separated by space)? That should simplify the regex with some generalizability. Something like, special casing DBC but reverting to original implementation as a fallback: _word_regex = re.compile(r"{DBC}|(\w[\w'-]*)".format(
# DBC means CJK-like characters. An character can stand for a word.
DBC=("([\u4E00-\u9FFF])|" # CJK Unified Ideographs
"([\u3400-\u4DBF])|" # CJK Unified Ideographs Extension A
"([\uF900-\uFAFF])|" # CJK Compatibility Ideographs
"([\U00020000-\U0002A6DF])|" # CJK Unified Ideographs Extension B
"([\U0002F800-\U0002FA1F])|" # CJK Compatibility Ideographs Supplement
"([\u3040-\u30FF])|" # Hiragana and Katakana
"([\u1100-\u11FF])|" # Hangul Jamo
"([\uAC00-\uD7FF])|" # Hangul Compatibility Jamo
"([\u3130-\u318F])" # Hangul Syllables
)), re.UNICODE)If I remember correctly, this was producing correct results but I didn't test extensively. Mainly because I don't speak Vietnamese, so I wasn't sure about my tests being proper :). |
105e02d to
747fec5
Compare
|
@avaris |
|
Many thanks to @manhhomienbienthuy for the enhancement and to @avaris for reviewing. 🏅 |
Pull Request Checklist
Resolves: #2982