Whitespace/punctuation inconsistency

I noticed that there are some inconsistencies with how whitespace and punctuation are treated, and it causes some precision issues when trying to correlate with the original sentence. For example, a Japanese comma:`、` is converted to a standard comma + space `, ` or this combination: `。」` is converted to: `. " ` (period space quote). I'm wondering fi there is a reason why punctuation is converted, and why spaces are added...and also if there is a way to preserve the information so that I could correlate perfectly each token index with the original sentence.

My use case is pretty common, generating the furigana for a sentence, but I want to know precisely the index in the original sentence. Another thing that might help this case is to include index locations for everything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whitespace/punctuation inconsistency #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Whitespace/punctuation inconsistency #19

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions