encoding/csv: unhelpful error location with badly quoted fields #19019

josharian · 2017-02-09T23:51:25Z

I'm working with hundreds of 100,000+ line tsv files. They contain quotes in inappropriate places, so I am using LazyQuotes. However, somewhere in the guts of a few of the files, there are quotes at the beginning of a field, which breaks csv.Reader. That is #6287. When that happens, and LazyQuotes is on, the reported error location is at the very end of the file. See https://play.golang.org/p/y0jOObyBTV for an illustration of this.

However, if I turn LazyQuotes off, I get an error location on some other quotes that do no harm (with LazyQuotes on). I am left with an unreadable file that I cannot even fix manually, since there are some errors somewhere in hundreds of thousands of lines, but encoding/csv does not tell me where.

The error reported when LazyQuotes is on should probably at the beginning of the field in question if EOF is reached.

Also related: #3150.

I have some workarounds available. Just adding this to the list of encoding/csv quote issues. :)

nussjustin · 2017-04-28T19:59:01Z

For ErrFieldCount (which seems to be the problem child here) the column is always reported as 0 to make it clear that the whole record is the problem. I think the bigger problem here is the reported line. In the code from the playground link above, the Reader reads until EOF which is in line 5, but the record starts in line 3 and the other lines shouldn't even be part of the line. This could be solved by tracking the line a record started on and using this for errors instead or adding it as additional field to ParseError.

nussjustin · 2017-07-31T17:59:11Z

/cc @bradfitz @pborman

gopherbot · 2017-08-02T17:56:24Z

Change https://golang.org/cl/52830 mentions this issue: encoding/csv: report line start line in errors

nussjustin · 2017-08-02T17:59:10Z

I send a change which improves the issue described here by reporting the line where a record started, as proposed in my first comment. With this the Reader still reports column 0 for ErrFieldCount, but it should be much easier to find the problems in the files.

Errors returned by Reader contain the line where the Reader originally encountered the error. This can be suboptimal since that line does not always correspond with the line the current record/field started at. This can easily happen with LazyQuotes as seen in #19019, but also happens for example when a quoted fields has no closing quote and the parser hits EOF before it finds another quote. When this happens finding the erroneous field can be somewhat complicated and time consuming, and in most cases it would be better to report the line where the record started. This change updates Reader to keep track of the line on which a record begins and uses it for errors instead of the current line, making it easier to find errors. Although a user-visible change, this should have no impact on existing code, since most users don't explicitly work with the line in the error and probably already expect the new behaviour. Updates #19019 Change-Id: Ic9bc70fad2651c69435d614d537e7a9266819b05 Reviewed-on: https://go-review.googlesource.com/52830 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>

nussjustin · 2017-08-14T06:13:47Z

5d14ac7 was just merged, so with tip the code from the play ground example now reports

line 3, column 4: extraneous " in field
line 3, column 0: wrong number of fields in line

This seems much better to me, since it now reports the correct line. The column is still 0, but this is an explicit design choice and I think that in this case (ErrFieldCount) there isn't much value to the exact column when we already have the line.

josharian · 2017-08-14T14:57:48Z

@nussjustin seems good enough to me. Thanks!

gopherbot · 2017-10-20T08:23:16Z

Change https://golang.org/cl/72050 mentions this issue: encoding/csv: deprecate ParseError.Column

gopherbot · 2017-10-20T19:32:50Z

Change https://golang.org/cl/72150 mentions this issue: encoding/csv: simplify and optimize Reader

The Reader implementation is slow because it operates on a rune-by-rune basis via bufio.Reader.ReadRune. We speed this up by operating on entire lines that we read from bufio.Reader.ReadSlice. In order to ensure that we read the full line, we augment ReadSlice in our Reader.readLine method to automatically expand the slice if bufio.ErrBufferFull is every hit. This change happens to fix #19410 because it no longer relies on rune-by-rune parsing and only searches for the relevant delimiter rune. In order to keep column accounting simple and consistent, this change reverts parts of CL 52830. This CL is an alternative to CL 36270 and builds on some of the ideas from that change by Diogo Pinela. name old time/op new time/op delta Read-8 3.12µs ± 1% 2.54µs ± 2% -18.76% (p=0.000 n=10+9) ReadWithFieldsPerRecord-8 3.12µs ± 1% 2.53µs ± 1% -18.91% (p=0.000 n=9+9) ReadWithoutFieldsPerRecord-8 3.13µs ± 0% 2.57µs ± 3% -18.07% (p=0.000 n=10+10) ReadLargeFields-8 52.3µs ± 1% 5.3µs ± 2% -89.93% (p=0.000 n=10+9) ReadReuseRecord-8 2.05µs ± 1% 1.40µs ± 1% -31.48% (p=0.000 n=10+9) ReadReuseRecordWithFieldsPerRecord-8 2.05µs ± 1% 1.41µs ± 0% -31.03% (p=0.000 n=10+9) ReadReuseRecordWithoutFieldsPerRecord-8 2.06µs ± 1% 1.40µs ± 1% -31.70% (p=0.000 n=9+10) ReadReuseRecordLargeFields-8 50.9µs ± 0% 4.1µs ± 3% -92.01% (p=0.000 n=10+10) name old alloc/op new alloc/op Read-8 664B ± 0% 664B ± 0% ReadWithFieldsPerRecord-8 664B ± 0% 664B ± 0% ReadWithoutFieldsPerRecord-8 664B ± 0% 664B ± 0% ReadLargeFields-8 3.94kB ± 0% 3.94kB ± 0% ReadReuseRecord-8 24.0B ± 0% 24.0B ± 0% ReadReuseRecordWithFieldsPerRecord-8 24.0B ± 0% 24.0B ± 0% ReadReuseRecordWithoutFieldsPerRecord-8 24.0B ± 0% 24.0B ± 0% ReadReuseRecordLargeFields-8 2.98kB ± 0% 2.98kB ± 0% name old allocs/op new allocs/op Read-8 18.0 ± 0% 18.0 ± 0% ReadWithFieldsPerRecord-8 18.0 ± 0% 18.0 ± 0% ReadWithoutFieldsPerRecord-8 18.0 ± 0% 18.0 ± 0% ReadLargeFields-8 24.0 ± 0% 24.0 ± 0% ReadReuseRecord-8 8.00 ± 0% 8.00 ± 0% ReadReuseRecordWithFieldsPerRecord-8 8.00 ± 0% 8.00 ± 0% ReadReuseRecordWithoutFieldsPerRecord-8 8.00 ± 0% 8.00 ± 0% ReadReuseRecordLargeFields-8 12.0 ± 0% 12.0 ± 0% Updates #22352 Updates #19019 Fixes #16791 Fixes #19410 Change-Id: I31c27cfcc56880e6abac262f36c947179b550bbf Reviewed-on: https://go-review.googlesource.com/72150 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Joe Tsai <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>

gopherbot · 2017-10-21T00:08:48Z

Change https://golang.org/cl/72310 mentions this issue: encoding/csv: add ParseError.RecordLine

CL 72150 fixes #22352 by reverting the problematic parts of that CL where the line number and column number were inconsistent with each other. This CL adds back functionality to address the issue that CL 72150 was trying to solve in the first place. That is, it reports the starting line of the record, so that users have a frame of reference to start with when debugging what went wrong. In the event of gnarly CSV files with multiline quoted strings, a parse failure likely occurs somewhere between the start of the record and the point where the parser finally detected an error. Since ParserError.{Line,Column} reports where the *error* occurs, we add a RecordLine field to report where the record starts. Also take this time to cleanup and modernize TestRead. Fixes #19019 Fixes #22352 Change-Id: I16cebf0b81922c35f75804c7073e9cddbfd11a04 Reviewed-on: https://go-review.googlesource.com/72310 Reviewed-by: Ian Lance Taylor <[email protected]>

josharian added this to the Unplanned milestone Feb 9, 2017

josharian mentioned this issue Apr 28, 2017

proposal: encoding/csv: add a way to limit bytes read per field/record #20169

Open

pborman mentioned this issue Aug 1, 2017

encoding/csv: parser does not preserve \r\n line ending #21201

Closed

josharian closed this as completed Aug 14, 2017

dsnet mentioned this issue Oct 20, 2017

encoding/csv: incorrect line:column for error #22352

Closed

ALTree mentioned this issue Mar 16, 2018

encoding/csv: leading " in non-quoted field leads to error #24422

Closed

golang locked and limited conversation to collaborators Oct 21, 2018

gopherbot added the FrozenDueToAge label Oct 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/csv: unhelpful error location with badly quoted fields #19019

encoding/csv: unhelpful error location with badly quoted fields #19019

josharian commented Feb 9, 2017

nussjustin commented Apr 28, 2017 •

edited

Loading

nussjustin commented Jul 31, 2017

gopherbot commented Aug 2, 2017

nussjustin commented Aug 2, 2017

nussjustin commented Aug 14, 2017

josharian commented Aug 14, 2017

gopherbot commented Oct 20, 2017

gopherbot commented Oct 20, 2017

gopherbot commented Oct 21, 2017

encoding/csv: unhelpful error location with badly quoted fields #19019

encoding/csv: unhelpful error location with badly quoted fields #19019

Comments

josharian commented Feb 9, 2017

nussjustin commented Apr 28, 2017 • edited Loading

nussjustin commented Jul 31, 2017

gopherbot commented Aug 2, 2017

nussjustin commented Aug 2, 2017

nussjustin commented Aug 14, 2017

josharian commented Aug 14, 2017

gopherbot commented Oct 20, 2017

gopherbot commented Oct 20, 2017

gopherbot commented Oct 21, 2017

nussjustin commented Apr 28, 2017 •

edited

Loading