Skip to content

Whitespace/punctuation inconsistency #19

@tslater

Description

@tslater

I noticed that there are some inconsistencies with how whitespace and punctuation are treated, and it causes some precision issues when trying to correlate with the original sentence. For example, a Japanese comma: is converted to a standard comma + space , or this combination: 。」 is converted to: . " (period space quote). I'm wondering fi there is a reason why punctuation is converted, and why spaces are added...and also if there is a way to preserve the information so that I could correlate perfectly each token index with the original sentence.

My use case is pretty common, generating the furigana for a sentence, but I want to know precisely the index in the original sentence. Another thing that might help this case is to include index locations for everything.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions