Serverless parallelization of reference generation

Finding byte ranges in every file in an archival dataset is an embarrassingly-parallel problem, which might be a good fit for serverless.

This step is analogous to the `parallel=True` option to `xr.open_mfdataset`, which wraps `xr.open_dataset` in `@dask.delayed` for each file to parallelize the opening step (totally separate from any dask.array tree reduction after the arrays have been created). 

https://github.com/pydata/xarray/blob/12123be8608f3a90c6a6d8a1cdc4845126492364/xarray/backends/api.py#L1046

With that motivation, it's been suggested that we add a delayed-like function to Cubed https://github.com/cubed-dev/cubed/issues/311, which could in theory plug in to `xarray.open_mfdataset`.

A simpler way to test this idea would be just to skip the cubed layer and use a lithops map-gather (or whatever the correct primitive is) to try out serverless generation of references. I think for this to work the resulting virtual datasets need to be small enough to be able to be gathered onto one node (thus avoiding a tree-reduce), but the discussion in https://github.com/TomNicholas/VirtualiZarr/issues/104 indicates that this should be okay (at least after #107 is complete).

xref https://github.com/TomNicholas/VirtualiZarr/issues/95

cc @tomwhite @rabernat 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serverless parallelization of reference generation #123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Serverless parallelization of reference generation #123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions