Skip to content

Read from multiple csv files #2451

Closed
@lucasrodes

Description

@lucasrodes

It would be interesting to add the functionality to read from multiple files, much in the way that dask does it (not sure about ray).

Assume I have several csv files named 'data-001.csv', 'data-002.csv', ... , in dask I would load them all in one single dataframe using:

import dask.dataframe as dd
ddf = dd.read_csv(
    'data-*.csv'
)
df = ddf.compute()

I understand that this is specific to dask and is not in the general pandas API, but perhaps could be interesting to add this functionality?

Currently I am doing the following to rapidly load the dataset and still work with modin:

import dask.dataframe as dd
import modin.pandas as pd
ddf = dd.read_csv(
    'data-*.csv'
)
df = ddf.compute()
df = pd.DataFrame(df)

This can be extremely useful when working, for instance, with Google Cloud Platform. I realized while working with a BigQuery table, which I exported to a bucket as multiple CSVs (this is the way Google allows me to do it via their GUI). Once downloaded, it'd be nice to load them all at once and avoid doing pd.concat, as it requires lot of memory.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions