-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Preprocess function for save_mfdataset #4475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you could use
I think this will work, but I've never used |
Unfortunately that doesn't work:
|
You could write to netCDF in I guess this is a good argument for adding a |
I think we could support delayed objects in
|
Thank you, this works for me. However, it's quite slow and seems to scale faster than linearly as the length of Could it be connected to #2912 (comment) where they suggest to use Appreciate the help! |
Are you using multiple threads or multiple processes? IIUC you should be using multiple processes for max writing efficiency. |
Multiple threads (the default), because it's recommended "for numeric code that releases the GIL (like NumPy, Pandas, Scikit-Learn, Numba, …)" according to the dask docs. I guess I could do multi-threaded for the compute part (everything up to the definition of |
I think so. I would try multiple processes and see if that is fast enough for what you want to do. Or else, write to zarr. This will be parallelized and is a lot easier than dealing with HDF5 |
Sounds good, I'll do this in the meantime. Still quite interested in |
Is your feature request related to a problem? Please describe.
I would like to supply a
preprocess
argument tosave_mfdataset
that gets applied to each dataset before getting written to disk, similar to howopen_mfdataset
gives you such option. Specifically, have a dataset that I want to split by unique values along dimension, apply some further logic to each sub-dataset, then save each sub-dataset to a different file. Currently I'm able to split and save using the following code provided in the API docs:What's missing is the ability to insert further logic to each of the sub-datasets given by the groupby object. If I try iterating through
datasets
here and chain further operations to each element, the calculations begin to execute serially even thoughds
is a dask array:save_mfdataset([ds.foo() for ds in datasets], paths)
Describe the solution you'd like
Instead, I'd like the ability to do:
xr.save_mfdataset(datasets, paths, preprocess=lambda ds: ds.foo())
Describe alternatives you've considered
Not sure.
The text was updated successfully, but these errors were encountered: