Description
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (go version
)?
Go 1.8
What operating system and processor architecture are you using (go env
)?
Linux and OSX (not that it matters).
What did you do?
What did you expect to see?
What did you see instead?
I'll paste my post from #16791 instead of answering the above questions separately:
The problem is that in some situations I encounter, the columns are just byte strings, and they may contain byte sequences that are not valid UTF-8. Go, when reading (and probably also writing) them, mangles these:
Input: <some-unprintable-byte-sequence-that-doesn't-contain-quotes-nor-commas>,aktau
As hex: 0941 b41c 2c61 6b74 6175 0a
Note that the 3rd byte 0xb4
in the first column is above the pass-through bytes of UTF-8 (>=0x7F
) as detailed here https://research.swtch.com/utf8. When passed through Go's CSV reader, the following is returned:
Input (hex): 0941 b41c
Go mangled (hex): 0941 efbf bd1c
I didn't understand what was happening at first, until I looked at Go's CSV reading/writing source and saw that it used the *Rune
functions, even though none of the special characters in (traditional) CSV are multi-byte.
I expect it to not mangle my bytes if they don't contain special characters. Namely the delimiter (usually a comma), a newline (or carriage return) and double quotes. In my specific case, those characters are not present in the field that gets altered.