-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Support parallel writes to regions of zarr stores #4035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Stephan, this seems like a great addition. Thanks for getting it started! I'm curious how this interacts with dimension coordinates. Your example bypasses this. But what if dimension coordinates are present. How do we handle alignment issues? For example, what if I call
👍
👍 I think only advanced users will want to use this feature. |
It’s entirely unsafe. Currently the coordinates would be overridden with the new values , which is consistent with how to_netcdf() with mode=‘a’ works. This is probably another good reason for requiring users to explicitly drop variables that don’t include a dimension in the selected region, because at least in that case there can be no user expectations about alignment with coordinates that don’t exist. In the long term, it might make sense to make both to_netcdf and to_zarr check coordinates by alignment by default, but we wouldn’t want that in all cases, because sometimes users really do want to update variables. |
@rabernat pointed this PR out to me, and this is great progress towards allowing more database-like CRUD operations on zarr datasets. A similar neat feature would be to read xarray datasets from regions of zarr groups w/o dask arrays. |
@nbren12 - this has always been supported. Just call |
@rabernat I learn something new everyday. sorry for cluttering up this PR with my ignorance haha. |
I've add error checking, tests and documentation, so this is ready for review now! Take a look here for a rendered version of the new docs section: |
This looks nice. Is there a thought if this would work with functions as a service (GCP cloud functions, AWS Lambda, etc) for supporting parallel transformation from netcdf to zarr? |
I haven't used functions as a service before, but yes, I imagine this might be useful for that sort of thing. As long as you can figure out the structure of the overall Zarr datasets ahead of time, you could use |
Zac, you may be interested in this thread
https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/32
Tom White managed to integrate dask with pywren via dask executor. This allows you to read / write zarr with lambda.
…Sent from my iPhone
On Jul 9, 2020, at 6:41 PM, Stephan Hoyer ***@***.***> wrote:
This looks nice. Is there a thought if this would work with functions as a service (GCP cloud functions, AWS Lambda, etc) for supporting parallel transformation from netcdf to zarr?
I haven't used function as a service before, but yes, I imagine this might be useful for that sort of thing. As long as you can figure out the structure of the overall Zarr datasets ahead of time, you could use region to fill out different parts entirely independently.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
This is a very desirable feature for us. We have been using this branch in development, and it is working great for our use case. We are reluctant to put into production until it is merged and released - is there any expected timeline for that to occur? |
I just fixed a race condition with writing attributes. Let me spend a little bit of time responding to Ryan's review, and then I think we can submit it. |
But yes, we've also been successfully using this for parallel writes for a few months now (aside from the race condition). |
OK, I think this is ready for a final review. |
Anyone else want to take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only looked at the docs, and found some minor things.
Co-authored-by: keewis <[email protected]>
If there are no additional reviews or objections, I will merge this tomorrow. |
@shoyer thanks for implementing this, it is going to be very useful. I am trying to write this dataset below: dsregion:
As a region of this other dataset: dset:
Using the following call:
But I got stuck on the conditional below within
Apparently because
Should this checking be performed for all variables, or only for data_variables? |
I agree that this requirement is a little surprising. The error is because otherwise you might be surprised that the array values for "latitude" and "longtitude" get overriden, rather than being checked for consistency. At least if you have to explicitly drop these variables (with the suggested call to |
This PR adds support for a
region
keyword argument toto_zarr()
, to support parallel writes to different parts of arrays in a zarr stores, e.g.,ds.to_zarr(..., region={'x': slice(1000, 2000)})
to write a dataset over the range1000:2000
along thex
dimension.This is useful for creating large Zarr datasets without requiring dask. For example, the separate workers in a simulation job might each write a single non-overlapping chunk of a Zarr file. The standard way to handle such datasets today is to first write netCDF files in each process, and then consolidate them afterwards with dask (see #3096).
Creating empty Zarr stores
In order to do so, the Zarr file must be pre-existing with desired variables in the right shapes/chunks. It is desirable to be able to create such stores without actually writing data, because datasets that we want to write in parallel may be very large.
In the example below, I achieve this filling a
Dataset
with dask arrays, and passingcompute=False
toto_zarr()
. This works, but it relies on an undocumented implementation detail of thecompute
argument. We should either:compute
argument only controls writing array values, not metadata (at least for zarr).write_values=False
.I think (1) is maybe the cleanest option (no extra API endpoints).
Unchunked variables
One potential gotcha concerns coordinate arrays that are not chunked, e.g., consider parallel writing of a dataset divided along time with 2D
latitude
andlongitude
arrays that are fixed over all chunks. With the current PR, such coordinate arrays would get rewritten by each separate writer.If a Zarr store does not have atomic writes, then conceivably this could result in corrupted data. The default DirectoryStore has atomic writes and cloud based object stores should also be atomic, so perhaps this doesn't matter in practice, but at the very least it's inefficient and could cause issues for large-scale jobs due to resource contention.
Options include:
region
are written byto_zarr()
. This is likely the most intuitive behavior for writing from a single process at a time.region
from being written. This is likely the most convenient behavior for writing from multiple processes at once..drop()
. This is probably the safest behavior.I think (4) would be my preferred option. Some users would undoubtedly find this annoying, but the power-users for whom we are adding this feature would likely appreciate it.
Usage example
isort -rc . && black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API