Skip to content

Larger-than-memory datasets with Zarr #132

Closed
@Andrewq11

Description

@Andrewq11

Context

Polaris would like to support larger-than-memory datasets. The current data model uses a Pandas DataFrame at its core. Since the entire DataFrame is always loaded into memory, this is a bottleneck to support large datasets. Datasets may not fit in memory at all or datasets impose unrealistic constraints on the device people can use with such large datasets.

With pointer columns we increased the effective dataset size we can store in a DataFrame, but we still run into limits when the number of datapoints is large. For example, for a dataset with 1 billion rows, even just storing a single column of np.float64 would take 8GB. To support larger-than-memory datasets, this is the main bottleneck we will need to address.

Description

After considering multiple alternatives such as Polars, Dask and Zarr, we concluded that:

  • The main use case for Polaris datasets is to train or to evaluate models. This implies that indexing the data in a data-loader like pattern is the most important operation we need to officially support.
  • Considering this scope, Zarr is the solution we would like to pursue to implement larger-than-memory datasets. Alternatives such as Polars and Dask are designed to efficiently perform more complicated operations (e.g. join, merge, transforms) on DataFrame-like objects which are out-of-scope for now, whereas Zarr is a file storage format.

We implemented a proof-of-concept here.

Acceptance Criteria

  • A larger-than-memory dataset can be randomly accessed through Polaris.
  • Backwards compatibility is maintained by implementing this as a new Dataset type.
  • The new Dataset class implements the same interface to be compatible with the Subset class.

Links

Metadata

Metadata

Assignees

Labels

featureAnnotates any PR that adds new features; Used in the release process

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions