-
Notifications
You must be signed in to change notification settings - Fork 18k
encoding/csv: unhelpful error location with badly quoted fields #19019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
For ErrFieldCount (which seems to be the problem child here) the column is always reported as 0 to make it clear that the whole record is the problem. I think the bigger problem here is the reported line. In the code from the playground link above, the Reader reads until EOF which is in line 5, but the record starts in line 3 and the other lines shouldn't even be part of the line. This could be solved by tracking the line a record started on and using this for errors instead or adding it as additional field to ParseError. |
Change https://golang.org/cl/52830 mentions this issue: |
I send a change which improves the issue described here by reporting the line where a record started, as proposed in my first comment. With this the Reader still reports column 0 for ErrFieldCount, but it should be much easier to find the problems in the files. |
Errors returned by Reader contain the line where the Reader originally encountered the error. This can be suboptimal since that line does not always correspond with the line the current record/field started at. This can easily happen with LazyQuotes as seen in #19019, but also happens for example when a quoted fields has no closing quote and the parser hits EOF before it finds another quote. When this happens finding the erroneous field can be somewhat complicated and time consuming, and in most cases it would be better to report the line where the record started. This change updates Reader to keep track of the line on which a record begins and uses it for errors instead of the current line, making it easier to find errors. Although a user-visible change, this should have no impact on existing code, since most users don't explicitly work with the line in the error and probably already expect the new behaviour. Updates #19019 Change-Id: Ic9bc70fad2651c69435d614d537e7a9266819b05 Reviewed-on: https://go-review.googlesource.com/52830 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Ian Lance Taylor <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
5d14ac7 was just merged, so with tip the code from the play ground example now reports
This seems much better to me, since it now reports the correct line. The column is still 0, but this is an explicit design choice and I think that in this case (ErrFieldCount) there isn't much value to the exact column when we already have the line. |
@nussjustin seems good enough to me. Thanks! |
Change https://golang.org/cl/72050 mentions this issue: |
Change https://golang.org/cl/72150 mentions this issue: |
The Reader implementation is slow because it operates on a rune-by-rune basis via bufio.Reader.ReadRune. We speed this up by operating on entire lines that we read from bufio.Reader.ReadSlice. In order to ensure that we read the full line, we augment ReadSlice in our Reader.readLine method to automatically expand the slice if bufio.ErrBufferFull is every hit. This change happens to fix #19410 because it no longer relies on rune-by-rune parsing and only searches for the relevant delimiter rune. In order to keep column accounting simple and consistent, this change reverts parts of CL 52830. This CL is an alternative to CL 36270 and builds on some of the ideas from that change by Diogo Pinela. name old time/op new time/op delta Read-8 3.12µs ± 1% 2.54µs ± 2% -18.76% (p=0.000 n=10+9) ReadWithFieldsPerRecord-8 3.12µs ± 1% 2.53µs ± 1% -18.91% (p=0.000 n=9+9) ReadWithoutFieldsPerRecord-8 3.13µs ± 0% 2.57µs ± 3% -18.07% (p=0.000 n=10+10) ReadLargeFields-8 52.3µs ± 1% 5.3µs ± 2% -89.93% (p=0.000 n=10+9) ReadReuseRecord-8 2.05µs ± 1% 1.40µs ± 1% -31.48% (p=0.000 n=10+9) ReadReuseRecordWithFieldsPerRecord-8 2.05µs ± 1% 1.41µs ± 0% -31.03% (p=0.000 n=10+9) ReadReuseRecordWithoutFieldsPerRecord-8 2.06µs ± 1% 1.40µs ± 1% -31.70% (p=0.000 n=9+10) ReadReuseRecordLargeFields-8 50.9µs ± 0% 4.1µs ± 3% -92.01% (p=0.000 n=10+10) name old alloc/op new alloc/op Read-8 664B ± 0% 664B ± 0% ReadWithFieldsPerRecord-8 664B ± 0% 664B ± 0% ReadWithoutFieldsPerRecord-8 664B ± 0% 664B ± 0% ReadLargeFields-8 3.94kB ± 0% 3.94kB ± 0% ReadReuseRecord-8 24.0B ± 0% 24.0B ± 0% ReadReuseRecordWithFieldsPerRecord-8 24.0B ± 0% 24.0B ± 0% ReadReuseRecordWithoutFieldsPerRecord-8 24.0B ± 0% 24.0B ± 0% ReadReuseRecordLargeFields-8 2.98kB ± 0% 2.98kB ± 0% name old allocs/op new allocs/op Read-8 18.0 ± 0% 18.0 ± 0% ReadWithFieldsPerRecord-8 18.0 ± 0% 18.0 ± 0% ReadWithoutFieldsPerRecord-8 18.0 ± 0% 18.0 ± 0% ReadLargeFields-8 24.0 ± 0% 24.0 ± 0% ReadReuseRecord-8 8.00 ± 0% 8.00 ± 0% ReadReuseRecordWithFieldsPerRecord-8 8.00 ± 0% 8.00 ± 0% ReadReuseRecordWithoutFieldsPerRecord-8 8.00 ± 0% 8.00 ± 0% ReadReuseRecordLargeFields-8 12.0 ± 0% 12.0 ± 0% Updates #22352 Updates #19019 Fixes #16791 Fixes #19410 Change-Id: I31c27cfcc56880e6abac262f36c947179b550bbf Reviewed-on: https://go-review.googlesource.com/72150 Reviewed-by: Ian Lance Taylor <[email protected]> Run-TryBot: Joe Tsai <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Change https://golang.org/cl/72310 mentions this issue: |
CL 72150 fixes #22352 by reverting the problematic parts of that CL where the line number and column number were inconsistent with each other. This CL adds back functionality to address the issue that CL 72150 was trying to solve in the first place. That is, it reports the starting line of the record, so that users have a frame of reference to start with when debugging what went wrong. In the event of gnarly CSV files with multiline quoted strings, a parse failure likely occurs somewhere between the start of the record and the point where the parser finally detected an error. Since ParserError.{Line,Column} reports where the *error* occurs, we add a RecordLine field to report where the record starts. Also take this time to cleanup and modernize TestRead. Fixes #19019 Fixes #22352 Change-Id: I16cebf0b81922c35f75804c7073e9cddbfd11a04 Reviewed-on: https://go-review.googlesource.com/72310 Reviewed-by: Ian Lance Taylor <[email protected]>
I'm working with hundreds of 100,000+ line tsv files. They contain quotes in inappropriate places, so I am using LazyQuotes. However, somewhere in the guts of a few of the files, there are quotes at the beginning of a field, which breaks csv.Reader. That is #6287. When that happens, and LazyQuotes is on, the reported error location is at the very end of the file. See https://play.golang.org/p/y0jOObyBTV for an illustration of this.
However, if I turn LazyQuotes off, I get an error location on some other quotes that do no harm (with LazyQuotes on). I am left with an unreadable file that I cannot even fix manually, since there are some errors somewhere in hundreds of thousands of lines, but encoding/csv does not tell me where.
The error reported when LazyQuotes is on should probably at the beginning of the field in question if EOF is reached.
Also related: #3150.
I have some workarounds available. Just adding this to the list of encoding/csv quote issues. :)
The text was updated successfully, but these errors were encountered: