-
Notifications
You must be signed in to change notification settings - Fork 18k
encoding/csv: non-UTF8 sequences are (unnecessarily) mangled #19410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I looked into this, but I'm not sure how the csv package should behave here by default. Currently the input is always read as UTF-8, but (as noted) the byte sequences mentioned in this issue are not UTF8 sequences. A simple solution would be to try reading everything as UTF-8 runes and only falling back to reading raw bytes for invalid runes (utf8.RuneError). Doing this by default could break existing uses of the CSV package, which depend on this. Unless the old behaviour is considered a bug, this behaviour could instead be opt-in via a new field in Reader. Alternatively if a new field is added to Reader, instead of just falling back to reading bytes the new field could change the Reader to always read bytes instead of runes and only read runes when needed (basically to check for r.Comma or r.Comment if any of those is set to a multi-byte rune and if not, the rune would be unread) |
Thanks for looking into it @nussjustin. A couple of things:
As for the proposed solutions:
I don't see how this is better than not attempting to interpret the bytestream as UTF-8 at all. This proposal would do effectively that: if a sequence of bytes is valid UTF-8, Go won't do anything to it (it doesn't now either), if a sequence of bytes is not valid UTF-8 then fall back to byte reading (so it won't do anything to them). This is the same as not doing any UTF-8 processing. When avoiding UTF-8 ( Pinging @pborman, the original author of the package, it was introduced with UTF-8 validation included in commit 00f7cd4. |
r.Comment doesn't need much special handling, as it's only called once per line at the beginning. But reading everything byte-by-byte would also need to be careful to correlty handle spaces when TrimLeadingSpace == true. Currently parseField uses unicode.IsSpace to check if a rune is a space. If have played with refactoring the Reader logic to work with bytes in most cases and only use runes when either r.Comma is a multi-byte rune or TrimLeadingSpace is true and the runes are at the beginning of a field. It seems to work, but I'm still not sure how this issue should be handled. My first suggestion was meant to be a simple, low overhead change that wouldn't affect most of the code and would only need to touch readRune. But rereading the code I'm not sure anymore it would would fix the problem completely/correctly. |
RFC 4180 declares that TEXTDATA is: TEXTDATA = %x20-21 / %x23-2B / %x2D-7E That is, it is only ASCII. There is no expectation in the RFC that non-ASCII will work, so the current implementation satisfies RFC 4180, and in addition, handles UTF-8 strings as well. I don't think there should be any expectation that a random string of bytes that happen to not contain 0x12, 0x15, 0x20, 0x22, or 0x2c bytes ( What is the actual use case that caused confusion? As for performance, yes, I am sure this can be improved in performance. Performance was not the main consideration when it was written. Handling of UTF-8, IIRC, was requested (required?) for the csv package to make it into the standard library, as well as pointing to an RFC (there are many CSV formats). |
Why, you're absolutely right (technically correct, the best kind of correct).
This is where I see it differently:
Said another way: what's the reason for turning a non-UTF8 byte sequence into The following can also be found in the spec:
|
@pborman, I forgot to answer your other question:
Please see the original issue report I made. There's not really confusion, just that it was unexpected (to me) that |
CL 72150 switches to use a line-by-line parsing style, which is agnostic to the exact UTF-8 encoding of lines. I haven't tested it, but it should fix this. |
Change https://golang.org/cl/72150 mentions this issue: |
I noted the issue is closed and really, what I am about to say, completely unrelated to CSV handling. I am trying to compile go branch 1.4 using Microsoft compiler. And spotted lines in the function bool The lines that attracted my attention read:
Sure, it would be technically correct to say 1.4 is unsupported ( I know :) ), but what really made me interested is passing pointer to Rune to Windows API call. Might it be the code even encodes Rune in a way Windows is aware of. Hope you will smile with me. Hope, on the scale of netizen I earn no worse than "troll the terrible" with this passing note :) |
I assume you are talking about cmd/dist/windows.c file. The cmd/dist command was used to bootstrap building of all other commands and libraries. So it was very limited in what it did, and no other code depended on it. Russ wrote the code, and he called uint16 characters that Windows uses to encode utf16 /*
* Windows uses 16-bit rune strings in the APIs.
...
*/
...
typedef unsigned short Rune; I don't see any problem calling Windows utf16 uint16 Rune in this small program (it is just one file). What else would you call them?
I will smile with you if you tell me what is funny here. Alex |
I will smile with you if you tell me what is funny here. Alex Smile is not about some sharp words that bring in "trouble" or "funny". Just a smile. For what it's worth, to declare an entire type, to later say "I don't see any trouble" deserves reminiscence of something you will remember from childhood, that will make you smile. Childhood is probably the time when people try to give everything a new and beautiful name, to make the world look new and beautiful. If some of this "hard work" does not make you smile later, it means you forgot how it was :) |
@MironAtHome you lost me with your explanations. But I will smile with you anyways. No harm in smiling. :-) Alex |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?Go 1.8
What operating system and processor architecture are you using (
go env
)?Linux and OSX (not that it matters).
What did you do?
What did you expect to see?
What did you see instead?
I'll paste my post from #16791 instead of answering the above questions separately:
The problem is that in some situations I encounter, the columns are just byte strings, and they may contain byte sequences that are not valid UTF-8. Go, when reading (and probably also writing) them, mangles these:
Note that the 3rd byte
0xb4
in the first column is above the pass-through bytes of UTF-8 (>=0x7F
) as detailed here https://research.swtch.com/utf8. When passed through Go's CSV reader, the following is returned:I didn't understand what was happening at first, until I looked at Go's CSV reading/writing source and saw that it used the
*Rune
functions, even though none of the special characters in (traditional) CSV are multi-byte.I expect it to not mangle my bytes if they don't contain special characters. Namely the delimiter (usually a comma), a newline (or carriage return) and double quotes. In my specific case, those characters are not present in the field that gets altered.
The text was updated successfully, but these errors were encountered: