Skip to content

Parallelization via dask #7

@TomNicholas

Description

@TomNicholas

There are two places we could use xarray's machinery for parallelization to potentially speed up the generation of references.

  1. Using parallel=True in xr.open_mfdataset, which would then use dask.delayed to parallelize the generation of the byte ranges from each file. This could be a big speedup, as it would parallelize the opening of the legacy files.

  2. In theory we could also wrap the ManifestArray objects with dask.Array, then use dask's tree-reduce to do the concatenation. I think this is roughly what kerchunk.combine.auto_dask is approximating. However I'm not totally confident that (a) this is set up to work right now in dask.array or (b) this actually is a performance bottleneck in practice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions