Skip to content

encoding/csv: non-UTF8 sequences are (unnecessarily) mangled #19410

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aktau opened this issue Mar 5, 2017 · 12 comments
Closed

encoding/csv: non-UTF8 sequences are (unnecessarily) mangled #19410

aktau opened this issue Mar 5, 2017 · 12 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@aktau
Copy link
Contributor

aktau commented Mar 5, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

Go 1.8

What operating system and processor architecture are you using (go env)?

Linux and OSX (not that it matters).

What did you do?

What did you expect to see?

What did you see instead?

I'll paste my post from #16791 instead of answering the above questions separately:

The problem is that in some situations I encounter, the columns are just byte strings, and they may contain byte sequences that are not valid UTF-8. Go, when reading (and probably also writing) them, mangles these:

Input: <some-unprintable-byte-sequence-that-doesn't-contain-quotes-nor-commas>,aktau
As hex: 0941 b41c 2c61 6b74 6175 0a

Note that the 3rd byte 0xb4 in the first column is above the pass-through bytes of UTF-8 (>=0x7F) as detailed here https://research.swtch.com/utf8. When passed through Go's CSV reader, the following is returned:

Input (hex):      0941 b41c
Go mangled (hex): 0941 efbf bd1c

I didn't understand what was happening at first, until I looked at Go's CSV reading/writing source and saw that it used the *Rune functions, even though none of the special characters in (traditional) CSV are multi-byte.

I expect it to not mangle my bytes if they don't contain special characters. Namely the delimiter (usually a comma), a newline (or carriage return) and double quotes. In my specific case, those characters are not present in the field that gets altered.

@bradfitz bradfitz added this to the Go1.9Maybe milestone Mar 21, 2017
@bradfitz bradfitz added help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Mar 21, 2017
@nussjustin
Copy link
Contributor

I looked into this, but I'm not sure how the csv package should behave here by default. Currently the input is always read as UTF-8, but (as noted) the byte sequences mentioned in this issue are not UTF8 sequences.

A simple solution would be to try reading everything as UTF-8 runes and only falling back to reading raw bytes for invalid runes (utf8.RuneError). Doing this by default could break existing uses of the CSV package, which depend on this. Unless the old behaviour is considered a bug, this behaviour could instead be opt-in via a new field in Reader.

Alternatively if a new field is added to Reader, instead of just falling back to reading bytes the new field could change the Reader to always read bytes instead of runes and only read runes when needed (basically to check for r.Comma or r.Comment if any of those is set to a multi-byte rune and if not, the rune would be unread)

@aktau
Copy link
Contributor Author

aktau commented Apr 27, 2017

Thanks for looking into it @nussjustin. A couple of things:

  • Nowhere in the encoding/csv package docs does it say that the bytes are read/written (or need to be) UTF-8: https://godoc.org/encoding/csv. So this behaviour is surprising to me. I'm not sure how the Go team historically deals with these things: change the implementation or document the package more to state what it actually does?
  • Nowhere in RFC 4180, which the package claims to implement, is it said that the input stream needs to be valid UTF-8.

As for the proposed solutions:

A simple solution would be to try reading everything as UTF-8 runes and only falling back to reading raw bytes for invalid runes (utf8.RuneError).

I don't see how this is better than not attempting to interpret the bytestream as UTF-8 at all. This proposal would do effectively that: if a sequence of bytes is valid UTF-8, Go won't do anything to it (it doesn't now either), if a sequence of bytes is not valid UTF-8 then fall back to byte reading (so it won't do anything to them). This is the same as not doing any UTF-8 processing.

When avoiding UTF-8 (rune) functions, the only trickier case in the code would be if r.Comma or r.Comment are multibyte, but that can be solved by doing a byte-by-byte match for the cases that need it (it makes memory efficient slice reading a bit more cumbersome, but hey). You already indicate as much in the last comment.

Pinging @pborman, the original author of the package, it was introduced with UTF-8 validation included in commit 00f7cd4.

@nussjustin
Copy link
Contributor

nussjustin commented Apr 28, 2017

r.Comment doesn't need much special handling, as it's only called once per line at the beginning.

But reading everything byte-by-byte would also need to be careful to correlty handle spaces when TrimLeadingSpace == true. Currently parseField uses unicode.IsSpace to check if a rune is a space.

If have played with refactoring the Reader logic to work with bytes in most cases and only use runes when either r.Comma is a multi-byte rune or TrimLeadingSpace is true and the runes are at the beginning of a field.

It seems to work, but I'm still not sure how this issue should be handled.

My first suggestion was meant to be a simple, low overhead change that wouldn't affect most of the code and would only need to touch readRune. But rereading the code I'm not sure anymore it would would fix the problem completely/correctly.

@pborman
Copy link
Contributor

pborman commented Apr 28, 2017

RFC 4180 declares that TEXTDATA is:

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

That is, it is only ASCII. There is no expectation in the RFC that non-ASCII will work, so the current implementation satisfies RFC 4180, and in addition, handles UTF-8 strings as well. I don't think there should be any expectation that a random string of bytes that happen to not contain 0x12, 0x15, 0x20, 0x22, or 0x2c bytes (\n\r ",) should be passed through more or less unaltered.

What is the actual use case that caused confusion?

As for performance, yes, I am sure this can be improved in performance. Performance was not the main consideration when it was written. Handling of UTF-8, IIRC, was requested (required?) for the csv package to make it into the standard library, as well as pointing to an RFC (there are many CSV formats).

@ALTree ALTree changed the title encoding/csv: non-UTF8 sequences are (unnecesarily) mangled encoding/csv: non-UTF8 sequences are (unnecessarily) mangled Apr 28, 2017
@aktau
Copy link
Contributor Author

aktau commented Apr 28, 2017

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

That is, it is only ASCII. There is no expectation in the RFC that non-ASCII will work, so the current implementation satisfies RFC 4180

Why, you're absolutely right (technically correct, the best kind of correct).

, and in addition, handles UTF-8 strings as well. I don't think there should be any expectation that a random string of bytes that happen to not contain 0x12, 0x15, 0x20, 0x22, or 0x2c bytes (\n\r ",) should be passed through more or less unaltered.

This is where I see it differently:

  1. The spec says it has to be ASCII, so the handling of would seem to be UTF-8 is out-of-spec. (not entirely true, see end of the post).
  2. The spec does not say what has to be done in non-ASCII cases. It could be left alone, or mangled (as is currently the case).
  3. (unrelated) The spec also does not allow configuring the separator, but the code allows this.

Said another way: what's the reason for turning a non-UTF8 byte sequence into utf8.RuneError? What use could the user have for this? If there was a positive effect somewhere, this would be easier to swallow.

The following can also be found in the spec:

Optional parameters: charset, header

 Common usage of CSV is US-ASCII, but other character sets defined
 by IANA for the "text" tree may be used in conjunction with the
 "charset" parameter.

 The "header" parameter indicates the presence or absence of the
 header line.  Valid values are "present" or "absent".
 Implementors choosing not to use this parameter must make their
 own decisions as to whether the header line is present or absent.

@aktau
Copy link
Contributor Author

aktau commented Apr 28, 2017

@pborman, I forgot to answer your other question:

What is the actual use case that caused confusion?

Please see the original issue report I made. There's not really confusion, just that it was unexpected (to me) that csv.Reader would turn invalid UTF-8 sequences into utf8.RuneError.

@bradfitz bradfitz modified the milestones: Go1.9Maybe, Go1.10 Jul 20, 2017
@dsnet
Copy link
Member

dsnet commented Oct 20, 2017

CL 72150 switches to use a line-by-line parsing style, which is agnostic to the exact UTF-8 encoding of lines. I haven't tested it, but it should fix this.

@dsnet dsnet self-assigned this Oct 20, 2017
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/72150 mentions this issue: encoding/csv: simplify and optimize Reader

@MironAtHome
Copy link

I noted the issue is closed and really, what I am about to say, completely unrelated to CSV handling.

I am trying to compile go branch 1.4 using Microsoft compiler. And spotted lines in the function

bool
isdir(char *p)
( windows.c )

The lines that attracted my attention read:
DWORD attr;
Rune *r;

torune(&r, p);
attr = GetFileAttributesW(r);

Sure, it would be technically correct to say 1.4 is unsupported ( I know :) ), but what really made me interested is passing pointer to Rune to Windows API call. Might it be the code even encodes Rune in a way Windows is aware of. Hope you will smile with me. Hope, on the scale of netizen I earn no worse than "troll the terrible" with this passing note :)

@alexbrainman
Copy link
Member

but what really made me interested is passing pointer to Rune to Windows API call. Might it be the code even encodes Rune in a way Windows is aware of.

I assume you are talking about cmd/dist/windows.c file. The cmd/dist command was used to bootstrap building of all other commands and libraries. So it was very limited in what it did, and no other code depended on it. Russ wrote the code, and he called uint16 characters that Windows uses to encode utf16 Rune

/*
 * Windows uses 16-bit rune strings in the APIs.
 ...
 */
...
typedef unsigned short Rune;

I don't see any problem calling Windows utf16 uint16 Rune in this small program (it is just one file). What else would you call them?

Hope you will smile with me.

I will smile with you if you tell me what is funny here.

Alex

@MironAtHome
Copy link

I will smile with you if you tell me what is funny here.

Alex


Smile is not about some sharp words that bring in "trouble" or "funny". Just a smile. For what it's worth, to declare an entire type, to later say "I don't see any trouble" deserves reminiscence of something you will remember from childhood, that will make you smile. Childhood is probably the time when people try to give everything a new and beautiful name, to make the world look new and beautiful. If some of this "hard work" does not make you smile later, it means you forgot how it was :)

@alexbrainman
Copy link
Member

@MironAtHome you lost me with your explanations. But I will smile with you anyways. No harm in smiling. :-)

Alex

@golang golang locked and limited conversation to collaborators Mar 18, 2019
@rsc rsc unassigned dsnet Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

8 participants