-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Finding byte ranges in every file in an archival dataset is an embarrassingly-parallel problem, which might be a good fit for serverless.
This step is analogous to the parallel=True
option to xr.open_mfdataset
, which wraps xr.open_dataset
in @dask.delayed
for each file to parallelize the opening step (totally separate from any dask.array tree reduction after the arrays have been created).
With that motivation, it's been suggested that we add a delayed-like function to Cubed cubed-dev/cubed#311, which could in theory plug in to xarray.open_mfdataset
.
A simpler way to test this idea would be just to skip the cubed layer and use a lithops map-gather (or whatever the correct primitive is) to try out serverless generation of references. I think for this to work the resulting virtual datasets need to be small enough to be able to be gathered onto one node (thus avoiding a tree-reduce), but the discussion in https://github.com/TomNicholas/VirtualiZarr/issues/104 indicates that this should be okay (at least after #107 is complete).
xref https://github.com/TomNicholas/VirtualiZarr/issues/95