Skip to content

Serverless parallelization of reference generation #123

@TomNicholas

Description

@TomNicholas

Finding byte ranges in every file in an archival dataset is an embarrassingly-parallel problem, which might be a good fit for serverless.

This step is analogous to the parallel=True option to xr.open_mfdataset, which wraps xr.open_dataset in @dask.delayed for each file to parallelize the opening step (totally separate from any dask.array tree reduction after the arrays have been created).

https://github.com/pydata/xarray/blob/12123be8608f3a90c6a6d8a1cdc4845126492364/xarray/backends/api.py#L1046

With that motivation, it's been suggested that we add a delayed-like function to Cubed cubed-dev/cubed#311, which could in theory plug in to xarray.open_mfdataset.

A simpler way to test this idea would be just to skip the cubed layer and use a lithops map-gather (or whatever the correct primitive is) to try out serverless generation of references. I think for this to work the resulting virtual datasets need to be small enough to be able to be gathered onto one node (thus avoiding a tree-reduce), but the discussion in https://github.com/TomNicholas/VirtualiZarr/issues/104 indicates that this should be okay (at least after #107 is complete).

xref https://github.com/TomNicholas/VirtualiZarr/issues/95

cc @tomwhite @rabernat

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions