Closed
Description
As discussed in https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i pyarrow
can also be used to handle data types inside pandas
and read CSV files.
I did a small test with a 304 MB CSV file containing 1,461,090 rows:
- reading with
pandas
: 3.99 s - reading with
pyarrow
: 0.64 s
The most obvious difference is then that the resulting data types are called int64[pyarrow]
, string[pyarrow]
, and so on. There might also be other differences as it was stated in the article that not all operations are yet supported by the pyarrow
data types, but maybe we are lucky and can use it already for the cases we have in audformat
.
It might also well align with #321 and #376
Benchmark code
import pandas as pd
import time
path = 'db.csv'
start = time.time()
df = pd.read_csv(path)
end = time.time()
print(f'{end - start:.2f} s')
start = time.time()
df = pd.read_csv(path, engine='pyarrow', dtype_backend='pyarrow')
end = time.time()
print(f'{end - start:.2f} s')