Skip to content

encoding/csv: Reading is slow #16791

Closed
Closed
@ALTree

Description

@ALTree
$ go version
go version go1.7 linux/amd64

Reading of csv files is, out of the box, quite slow (tl;dr: 3x slower than a simple Java program, 1.5x slower than the obvious python code). A typical example:

package main

import (
    "bufio"
    "encoding/csv"
    "fmt"
    "io"
    "os"
)

func main() {
    f, _ := os.Open("mock_data.csv")
    defer f.Close()

    r := csv.NewReader(f)
    for {
        line, err := r.Read()
        if err == io.EOF {
            break
        }
        if line[0] == "42" {
            fmt.Println(line)
        }
    }

}

Python3 equivalent:

import csv
with open('mock_data.csv') as f:
    r = csv.reader(f)
    for row in r:
        if row[0] == "42":
            print(row)

Equivalent Java code [EDIT: not actually equivalent, please see pauldraper comment below for a better test]

import java.io.BufferedReader;
import java.io.FileReader;

public class ReadCsv {
    public static void main(String[] args) {
        BufferedReader br;
        String line;
        try {
            br = new BufferedReader(new FileReader("mock_data.csv"));
            while ((line = br.readLine()) != null) {
                String[] data = line.split(",");
                if (data[0].equals("42")) {
                    System.out.println(line);
                }
            }
        } catch (Exception e) {}
    }
}

Tested on a 50MB, 1'000'002 lines csv file generated as:

data = ",Carl,Gauss,[email protected],Male,30.4.17.77\n"
with open("mock_data.csv", "w") as f:
    f.write("id,first_name,last_name,email,gender,ip_address\n")
    f.write(("1"+data)*int(1e6))
    f.write("42"+data);

Results:

Go:       avg 1.489 secs
Python:   avg 0.933 secs  (1.5x faster)
Java:     avg 0.493 secs  (3.0x faster)

Go error reporting is obviously better than the one you can have with that Java code, and I'm not sure about Python, but people has been complaining about encoding/csv slowness, so it's probably worth investigating whether the csv package can be made faster.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions