You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Polaris would like to support larger-than-memory datasets. The current data model uses a Pandas DataFrame at its core. Since the entire DataFrame is always loaded into memory, this is a bottleneck to support large datasets. Datasets may not fit in memory at all or datasets impose unrealistic constraints on the device people can use with such large datasets.
With pointer columns we increased the effective dataset size we can store in a DataFrame, but we still run into limits when the number of datapoints is large. For example, for a dataset with 1 billion rows, even just storing a single column of np.float64 would take 8GB. To support larger-than-memory datasets, this is the main bottleneck we will need to address.
Description
After considering multiple alternatives such as Polars, Dask and Zarr, we concluded that:
The main use case for Polaris datasets is to train or to evaluate models. This implies that indexing the data in a data-loader like pattern is the most important operation we need to officially support.
Considering this scope, Zarr is the solution we would like to pursue to implement larger-than-memory datasets. Alternatives such as Polars and Dask are designed to efficiently perform more complicated operations (e.g. join, merge, transforms) on DataFrame-like objects which are out-of-scope for now, whereas Zarr is a file storage format.
Uh oh!
There was an error while loading. Please reload this page.
Context
Polaris would like to support larger-than-memory datasets. The current data model uses a Pandas DataFrame at its core. Since the entire DataFrame is always loaded into memory, this is a bottleneck to support large datasets. Datasets may not fit in memory at all or datasets impose unrealistic constraints on the device people can use with such large datasets.
With pointer columns we increased the effective dataset size we can store in a DataFrame, but we still run into limits when the number of datapoints is large. For example, for a dataset with 1 billion rows, even just storing a single column of
np.float64
would take 8GB. To support larger-than-memory datasets, this is the main bottleneck we will need to address.Description
After considering multiple alternatives such as Polars, Dask and Zarr, we concluded that:
We implemented a proof-of-concept here.
Acceptance Criteria
Subset
class.Links
The text was updated successfully, but these errors were encountered: