Skip to content

Switch to pyarrow engine when reading CSV files #382

Closed
@hagenw

Description

@hagenw

As discussed in https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i pyarrow can also be used to handle data types inside pandas and read CSV files.

I did a small test with a 304 MB CSV file containing 1,461,090 rows:

  • reading with pandas: 3.99 s
  • reading with pyarrow: 0.64 s

The most obvious difference is then that the resulting data types are called int64[pyarrow], string[pyarrow], and so on. There might also be other differences as it was stated in the article that not all operations are yet supported by the pyarrow data types, but maybe we are lucky and can use it already for the cases we have in audformat.

It might also well align with #321 and #376

Benchmark code
import pandas as pd
import time

path = 'db.csv'
start = time.time()
df = pd.read_csv(path)
end = time.time()
print(f'{end - start:.2f} s')
start = time.time()
df = pd.read_csv(path, engine='pyarrow', dtype_backend='pyarrow')
end = time.time()
print(f'{end - start:.2f} s')

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions