Larger-than-memory datasets with Zarr

# Context
Polaris would like to support larger-than-memory datasets. The current data model uses a Pandas DataFrame at its core. Since the entire DataFrame is always loaded into memory, this is a bottleneck to support large datasets. Datasets may not fit in memory at all or datasets impose unrealistic constraints on the device people can use with such large datasets. 

With [pointer columns](https://polaris-hub.github.io/polaris/stable/tutorials/dataset_zarr.html#pointer-columns) we increased the effective dataset size we can store in a DataFrame, but we still run into limits when the number of datapoints is large. For example, for a dataset with 1 billion rows, even just storing a single column of `np.float64` would take 8GB. To support larger-than-memory datasets, this is the main bottleneck we will need to address.

# Description
After considering multiple alternatives such as [Polars](https://pola.rs/), [Dask](https://www.dask.org/) and [Zarr](https://zarr.readthedocs.io/), we concluded that: 
- The main use case for Polaris datasets is to train or to evaluate models. This implies that indexing the data in a data-loader like pattern is the most important operation we need to officially support.
- Considering this scope, Zarr is the solution we would like to pursue to implement larger-than-memory datasets. Alternatives such as Polars and Dask are designed to efficiently perform more complicated operations (e.g. join, merge, transforms) on DataFrame-like objects which are out-of-scope for now, whereas Zarr is a file storage format.

We implemented a proof-of-concept [here](https://github.com/polaris-hub/polaris-private/blob/experiment/belka-dataset/scripts/cas/belka/nb/03.%20Subset%20API.ipynb).

# Acceptance Criteria
- A larger-than-memory dataset can be randomly accessed through Polaris. 
- Backwards compatibility is maintained by implementing this as a new Dataset type.
- The new Dataset class implements the same interface to be compatible with the `Subset` class.

# Links
- [Document comparing alternative designs](https://docs.google.com/document/d/1L3JgHdgM6fRd-oYs4U5aDITQvVguS4-ZGlO7eaPXkcc/edit#heading=h.f0z109nuyp0x)
- [Proof of Concept for a Zarr-only dataset](https://github.com/polaris-hub/polaris-private/blob/experiment/belka-dataset/scripts/cas/belka/nb/03.%20Subset%20API.ipynb)
- [Design document on XL Dataset and Benchmarks](https://docs.google.com/document/d/1bRu8bVHCGlECTHXFsoQLp0ur8AA0WtjEo9hDzC0QAhQ/edit)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Larger-than-memory datasets with Zarr #132

Context

Description

Acceptance Criteria

Links

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Larger-than-memory datasets with Zarr #132

Description

Context

Description

Acceptance Criteria

Links

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions