Skip to content

Larger-than-memory datasets with Zarr #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Andrewq11 opened this issue Jul 17, 2024 · 0 comments · Fixed by #186
Closed

Larger-than-memory datasets with Zarr #132

Andrewq11 opened this issue Jul 17, 2024 · 0 comments · Fixed by #186
Assignees
Labels
feature Annotates any PR that adds new features; Used in the release process
Milestone

Comments

@Andrewq11
Copy link
Contributor

Andrewq11 commented Jul 17, 2024

Context

Polaris would like to support larger-than-memory datasets. The current data model uses a Pandas DataFrame at its core. Since the entire DataFrame is always loaded into memory, this is a bottleneck to support large datasets. Datasets may not fit in memory at all or datasets impose unrealistic constraints on the device people can use with such large datasets.

With pointer columns we increased the effective dataset size we can store in a DataFrame, but we still run into limits when the number of datapoints is large. For example, for a dataset with 1 billion rows, even just storing a single column of np.float64 would take 8GB. To support larger-than-memory datasets, this is the main bottleneck we will need to address.

Description

After considering multiple alternatives such as Polars, Dask and Zarr, we concluded that:

  • The main use case for Polaris datasets is to train or to evaluate models. This implies that indexing the data in a data-loader like pattern is the most important operation we need to officially support.
  • Considering this scope, Zarr is the solution we would like to pursue to implement larger-than-memory datasets. Alternatives such as Polars and Dask are designed to efficiently perform more complicated operations (e.g. join, merge, transforms) on DataFrame-like objects which are out-of-scope for now, whereas Zarr is a file storage format.

We implemented a proof-of-concept here.

Acceptance Criteria

  • A larger-than-memory dataset can be randomly accessed through Polaris.
  • Backwards compatibility is maintained by implementing this as a new Dataset type.
  • The new Dataset class implements the same interface to be compatible with the Subset class.

Links

@Andrewq11 Andrewq11 added the feature Annotates any PR that adds new features; Used in the release process label Jul 17, 2024
@cwognum cwognum added this to the XL Datasets milestone Aug 13, 2024
@cwognum cwognum changed the title Support downloading larger-than-memory datasets Larger-than-memory datasets with Zarr Aug 13, 2024
@cwognum cwognum self-assigned this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Annotates any PR that adds new features; Used in the release process
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants