Skip to content

encoding/csv: non-UTF8 sequences are (unnecessarily) mangled #19410

Closed
@aktau

Description

@aktau

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

Go 1.8

What operating system and processor architecture are you using (go env)?

Linux and OSX (not that it matters).

What did you do?

What did you expect to see?

What did you see instead?

I'll paste my post from #16791 instead of answering the above questions separately:

The problem is that in some situations I encounter, the columns are just byte strings, and they may contain byte sequences that are not valid UTF-8. Go, when reading (and probably also writing) them, mangles these:

Input: <some-unprintable-byte-sequence-that-doesn't-contain-quotes-nor-commas>,aktau
As hex: 0941 b41c 2c61 6b74 6175 0a

Note that the 3rd byte 0xb4 in the first column is above the pass-through bytes of UTF-8 (>=0x7F) as detailed here https://research.swtch.com/utf8. When passed through Go's CSV reader, the following is returned:

Input (hex):      0941 b41c
Go mangled (hex): 0941 efbf bd1c

I didn't understand what was happening at first, until I looked at Go's CSV reading/writing source and saw that it used the *Rune functions, even though none of the special characters in (traditional) CSV are multi-byte.

I expect it to not mangle my bytes if they don't contain special characters. Namely the delimiter (usually a comma), a newline (or carriage return) and double quotes. In my specific case, those characters are not present in the field that gets altered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions