-
Notifications
You must be signed in to change notification settings - Fork 43
Description
I noticed that there are some inconsistencies with how whitespace and punctuation are treated, and it causes some precision issues when trying to correlate with the original sentence. For example, a Japanese comma:、
is converted to a standard comma + space ,
or this combination: 。」
is converted to: . "
(period space quote). I'm wondering fi there is a reason why punctuation is converted, and why spaces are added...and also if there is a way to preserve the information so that I could correlate perfectly each token index with the original sentence.
My use case is pretty common, generating the furigana for a sentence, but I want to know precisely the index in the original sentence. Another thing that might help this case is to include index locations for everything.