Support parallel writes to regions of zarr stores #4035

shoyer · 2020-05-06T02:40:19Z

This PR adds support for a region keyword argument to to_zarr(), to support parallel writes to different parts of arrays in a zarr stores, e.g., ds.to_zarr(..., region={'x': slice(1000, 2000)}) to write a dataset over the range 1000:2000 along the x dimension.

This is useful for creating large Zarr datasets without requiring dask. For example, the separate workers in a simulation job might each write a single non-overlapping chunk of a Zarr file. The standard way to handle such datasets today is to first write netCDF files in each process, and then consolidate them afterwards with dask (see #3096).

Creating empty Zarr stores

In order to do so, the Zarr file must be pre-existing with desired variables in the right shapes/chunks. It is desirable to be able to create such stores without actually writing data, because datasets that we want to write in parallel may be very large.

In the example below, I achieve this filling a Dataset with dask arrays, and passing compute=False to to_zarr(). This works, but it relies on an undocumented implementation detail of the compute argument. We should either:

Officially document that the compute argument only controls writing array values, not metadata (at least for zarr).
Add a new keyword argument or entire new method for creating an unfilled Zarr store, e.g., write_values=False.

I think (1) is maybe the cleanest option (no extra API endpoints).

Unchunked variables

One potential gotcha concerns coordinate arrays that are not chunked, e.g., consider parallel writing of a dataset divided along time with 2D latitude and longitude arrays that are fixed over all chunks. With the current PR, such coordinate arrays would get rewritten by each separate writer.

If a Zarr store does not have atomic writes, then conceivably this could result in corrupted data. The default DirectoryStore has atomic writes and cloud based object stores should also be atomic, so perhaps this doesn't matter in practice, but at the very least it's inefficient and could cause issues for large-scale jobs due to resource contention.

Options include:

Current behavior. Variables whose dimensions do not overlap with region are written by to_zarr(). This is likely the most intuitive behavior for writing from a single process at a time.
Exclude variables whose dimensions do not overlap with region from being written. This is likely the most convenient behavior for writing from multiple processes at once.
Like (2), but issue a warning if any such variables exist instead of silently dropping them.
Like (2), but raise an error instead of a warning. Require the user to explicitly drop them with .drop(). This is probably the safest behavior.

I think (4) would be my preferred option. Some users would undoubtedly find this annoying, but the power-users for whom we are adding this feature would likely appreciate it.

Usage example

import xarray
import dask.array as da

ds = xarray.Dataset({'u': (('x',), da.arange(1000, chunks=100))})

# create the new zarr store, but don't write data
path = 'my-data.zarr'
ds.to_zarr(path, compute=False)

# look at the unwritten data
ds_opened = xarray.open_zarr(path)
print('Data before writing:', ds_opened.u.data[::100].compute())
# Data before writing: [  1 100   1 100 100   1   1   1   1   1]

# write out each slice (could be in separate processes)
for start in range(0, 1000, 100):
  selection = {'x': slice(start, start + 100)}
  ds.isel(selection).to_zarr(path, region=selection)

print('Data after writing:', ds_opened.u.data[::100].compute())
# Data after writing: [  0 100 200 300 400 500 600 700 800 900]

Closes Support parallel writes to zarr store #3096
Integration test
Unit tests
Passes isort -rc . && black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

rabernat · 2020-05-08T15:16:54Z

Stephan, this seems like a great addition. Thanks for getting it started!

I'm curious how this interacts with dimension coordinates. Your example bypasses this. But what if dimension coordinates are present. How do we handle alignment issues? For example, what if I call ds.to_zarr(path , region=selection), but the dimension coordinates of ds don't align with the dimension coordinates of the store at path"

Officially document that the compute argument only controls writing array values, not metadata (at least for zarr).

👍

4. Like (2), but raise an error instead of a warning. Require the user to explicitly drop them with .drop(). This is probably the safest behavior.

👍

I think only advanced users will want to use this feature.

shoyer · 2020-05-09T17:14:46Z

I'm curious how this interacts with dimension coordinates. Your example bypasses this. But what if dimension coordinates are present. How do we handle alignment issues? For example, what if I call ds.to_zarr(path , region=selection), but the dimension coordinates of ds don't align with the dimension coordinates of the store at path"

It’s entirely unsafe. Currently the coordinates would be overridden with the new values , which is consistent with how to_netcdf() with mode=‘a’ works.

This is probably another good reason for requiring users to explicitly drop variables that don’t include a dimension in the selected region, because at least in that case there can be no user expectations about alignment with coordinates that don’t exist.

In the long term, it might make sense to make both to_netcdf and to_zarr check coordinates by alignment by default, but we wouldn’t want that in all cases, because sometimes users really do want to update variables.

nbren12 · 2020-05-12T03:44:14Z

@rabernat pointed this PR out to me, and this is great progress towards allowing more database-like CRUD operations on zarr datasets. A similar neat feature would be to read xarray datasets from regions of zarr groups w/o dask arrays.

rabernat · 2020-05-12T12:42:12Z

A similar neat feature would be to read xarray datasets from regions of zarr groups w/o dask arrays.

@nbren12 - this has always been supported. Just call open_zarr(..., chunks=False) and then subset using sel / isel.

nbren12 · 2020-05-13T07:22:40Z

@rabernat I learn something new everyday. sorry for cluttering up this PR with my ignorance haha.

shoyer · 2020-06-18T17:53:03Z

I've add error checking, tests and documentation, so this is ready for review now!

Take a look here for a rendered version of the new docs section:
https://xray--4035.org.readthedocs.build/en/4035/io.html#appending-to-existing-zarr-stores

zflamig · 2020-07-09T21:30:51Z

This looks nice. Is there a thought if this would work with functions as a service (GCP cloud functions, AWS Lambda, etc) for supporting parallel transformation from netcdf to zarr?

shoyer · 2020-07-09T22:40:54Z

This looks nice. Is there a thought if this would work with functions as a service (GCP cloud functions, AWS Lambda, etc) for supporting parallel transformation from netcdf to zarr?

I haven't used functions as a service before, but yes, I imagine this might be useful for that sort of thing. As long as you can figure out the structure of the overall Zarr datasets ahead of time, you could use region to fill out different parts entirely independently.

rabernat · 2020-07-10T11:57:40Z

Zac, you may be interested in this thread https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/32 Tom White managed to integrate dask with pywren via dask executor. This allows you to read / write zarr with lambda.

…

Sent from my iPhone

On Jul 9, 2020, at 6:41 PM, Stephan Hoyer ***@***.***> wrote: This looks nice. Is there a thought if this would work with functions as a service (GCP cloud functions, AWS Lambda, etc) for supporting parallel transformation from netcdf to zarr? I haven't used function as a service before, but yes, I imagine this might be useful for that sort of thing. As long as you can figure out the structure of the overall Zarr datasets ahead of time, you could use region to fill out different parts entirely independently. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

tomdurrant · 2020-10-19T02:55:52Z

This is a very desirable feature for us. We have been using this branch in development, and it is working great for our use case. We are reluctant to put into production until it is merged and released - is there any expected timeline for that to occur?

shoyer · 2020-10-19T04:00:48Z

I just fixed a race condition with writing attributes. Let me spend a little bit of time responding to Ryan's review, and then I think we can submit it.

shoyer · 2020-10-19T04:01:54Z

But yes, we've also been successfully using this for parallel writes for a few months now (aside from the race condition).

shoyer · 2020-10-20T07:21:29Z

OK, I think this is ready for a final review.

shoyer · 2020-10-24T19:12:33Z

Anyone else want to take a look at this?

dcherian

I only looked at the docs, and found some minor things.

doc/io.rst

doc/whats-new.rst

Co-authored-by: keewis <[email protected]>

shoyer · 2020-10-29T23:30:48Z

If there are no additional reviews or objections, I will merge this tomorrow.

rafa-guedes · 2020-11-04T04:23:58Z

@shoyer thanks for implementing this, it is going to be very useful. I am trying to write this dataset below:

dsregion:

<xarray.Dataset>
Dimensions:    (latitude: 2041, longitude: 4320, time: 31)
Coordinates:
  * latitude   (latitude) float32 -80.0 -79.916664 -79.833336 ... 89.916664 90.0
  * time       (time) datetime64[ns] 2008-10-01T12:00:00 ... 2008-10-31T12:00:00
  * longitude  (longitude) float32 -180.0 -179.91667 ... 179.83333 179.91667
Data variables:
    vo         (time, latitude, longitude) float32 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>
    uo         (time, latitude, longitude) float32 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>
    sst        (time, latitude, longitude) float32 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>
    ssh        (time, latitude, longitude) float32 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>

As a region of this other dataset:

dset:

<xarray.Dataset>
Dimensions:    (latitude: 2041, longitude: 4320, time: 9490)
Coordinates:
  * latitude   (latitude) float32 -80.0 -79.916664 -79.833336 ... 89.916664 90.0
  * longitude  (longitude) float32 -180.0 -179.91667 ... 179.83333 179.91667
  * time       (time) datetime64[ns] 1993-01-01T12:00:00 ... 2018-12-25T12:00:00
Data variables:
    ssh        (time, latitude, longitude) float64 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>
    sst        (time, latitude, longitude) float64 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>
    uo         (time, latitude, longitude) float64 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>
    vo         (time, latitude, longitude) float64 dask.array<chunksize=(30, 510, 1080), meta=np.ndarray>

Using the following call:

dsregion.to_zarr(dset_url, region={"time": slice(5752, 5783)})

But I got stuck on the conditional below within xarray/backends/api.py:

   1347         non_matching_vars = [
   1348             k
   1349             for k, v in ds_to_append.variables.items()
   1350             if not set(region).intersection(v.dims)
   1351         ]
   1352         import ipdb; ipdb.set_trace()
-> 1353         if non_matching_vars:
   1354             raise ValueError(
   1355                 f"when setting `region` explicitly in to_zarr(), all "
   1356                 f"variables in the dataset to write must have at least "
   1357                 f"one dimension in common with the region's dimensions "
   1358                 f"{list(region.keys())}, but that is not "
   1359                 f"the case for some variables here. To drop these variables "
   1360                 f"from this dataset before exporting to zarr, write: "
   1361                 f".drop({non_matching_vars!r})"
   1362             )

Apparently because time is not a dimension in coordinate variables ["longitude", "latitude"]:

ipdb> p non_matching_vars                                
['latitude', 'longitude']
ipdb> p set(region)                                      
{'time'}

Should this checking be performed for all variables, or only for data_variables?

shoyer · 2020-11-04T06:18:35Z

Should this checking be performed for all variables, or only for data_variables?

I agree that this requirement is a little surprising. The error is because otherwise you might be surprised that the array values for "latitude" and "longtitude" get overriden, rather than being checked for consistency. At least if you have to explicitly drop these variables (with the suggested call to .drop()) it is clear that they will neither be checked nor overriden in the Zarr store.

shoyer added 4 commits May 5, 2020 17:09

WIP: support writing to a region with zarr

22b3b3b

Consolidate before closing

94be7dc

write -> save

a974878

Integration test for writing to regions

9b50587

shoyer requested review from rabernat and jhamman May 6, 2020 02:40

Skip compute=False if dask is not installed

8d8caa0

shoyer added 3 commits May 11, 2020 12:09

raise an error for non-matching vars in to_zarr with region

027abfc

wip docstring

af76dd5

Update to_zarr docstring

edb14c0

shoyer added 3 commits June 17, 2020 22:50

Merge branch 'master' into zarr-region

e044d05

Error handling and tests for writing to a zarr region

439eb7a

Add narrative docs on to_zarr() with region

23e78fa

shoyer marked this pull request as ready for review June 18, 2020 07:57

shoyer added 5 commits June 18, 2020 01:00

Add whats-new note on region

292c3bd

Add dask.array import

cfa8751

Fixup docs

d0354e6

Add in PR link to whats-new

1e72022

more description in docs

c3bba16

rabernat added the topic-zarr Related to zarr storage library label Jun 30, 2020

shoyer mentioned this pull request Jul 9, 2020

xr.save_mfdataset() doesn't honor compute=False argument #4209

Open

shoyer added 4 commits October 13, 2020 15:09

Merge branch 'master' into zarr-region

24ba0d4

don't override attrs when writing to regions

8e0e291

Another check for edge cases

81bde20

Blacken

15a609b

shoyer added 4 commits October 20, 2020 00:11

edits per Ryan's review

44eba20

Merge branch 'master' into zarr-region

59e82e7

move whats-new

bd5a558

Mark tests as requiring dask

a728374

shoyer mentioned this pull request Oct 24, 2020

fix KeyError when append_dim is not None in to_zarr for empty datastore #4531

Closed

dcherian reviewed Oct 24, 2020

View reviewed changes

doc/io.rst Outdated Show resolved Hide resolved

doc/io.rst Outdated Show resolved Hide resolved

doc clarifications

59b35d7

keewis reviewed Oct 25, 2020

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

This was referenced Oct 27, 2020

Allow processing by subset of time noaa-ocs-hydrography/kluster#2

Closed

Writing subset of array with to_zarr #4544

Closed

Update doc/whats-new.rst

8718b11

Co-authored-by: keewis <[email protected]>

Merge branch 'master' into zarr-region

8c8578a

shoyer merged commit dd806b8 into pydata:master Nov 4, 2020

This was referenced Nov 19, 2020

to_zarr with append_dim behavior changed in 0.16.0 release #4261

Closed

Support parallel writes to zarr store #3096

Closed

keewis mentioned this pull request Dec 1, 2020

to_zarr() produces error when trying to use keyword 'region' #4636

Closed

jrbourbeau mentioned this pull request Jan 12, 2021

AttributeError: 'ZarrStore' object has no attribute '_append_dim' dask/distributed#4419

Open

alexlib mentioned this pull request Oct 6, 2021

New GUI idea based on H5Py OpenPIV/openpiv_tk_gui#33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support parallel writes to regions of zarr stores #4035

Support parallel writes to regions of zarr stores #4035

shoyer commented May 6, 2020 •

edited

Loading

rabernat commented May 8, 2020

shoyer commented May 9, 2020

nbren12 commented May 12, 2020

rabernat commented May 12, 2020 •

edited

Loading

nbren12 commented May 13, 2020

shoyer commented Jun 18, 2020

zflamig commented Jul 9, 2020

shoyer commented Jul 9, 2020 •

edited

Loading

rabernat commented Jul 10, 2020 via email

tomdurrant commented Oct 19, 2020

shoyer commented Oct 19, 2020

shoyer commented Oct 19, 2020

shoyer commented Oct 20, 2020

shoyer commented Oct 24, 2020

dcherian left a comment

shoyer commented Oct 29, 2020

rafa-guedes commented Nov 4, 2020

shoyer commented Nov 4, 2020

Support parallel writes to regions of zarr stores #4035

Support parallel writes to regions of zarr stores #4035

Conversation

shoyer commented May 6, 2020 • edited Loading

Creating empty Zarr stores

Unchunked variables

Usage example

rabernat commented May 8, 2020

shoyer commented May 9, 2020

nbren12 commented May 12, 2020

rabernat commented May 12, 2020 • edited Loading

nbren12 commented May 13, 2020

shoyer commented Jun 18, 2020

zflamig commented Jul 9, 2020

shoyer commented Jul 9, 2020 • edited Loading

rabernat commented Jul 10, 2020 via email

tomdurrant commented Oct 19, 2020

shoyer commented Oct 19, 2020

shoyer commented Oct 19, 2020

shoyer commented Oct 20, 2020

shoyer commented Oct 24, 2020

dcherian left a comment

Choose a reason for hiding this comment

shoyer commented Oct 29, 2020

rafa-guedes commented Nov 4, 2020

shoyer commented Nov 4, 2020

shoyer commented May 6, 2020 •

edited

Loading

rabernat commented May 12, 2020 •

edited

Loading

shoyer commented Jul 9, 2020 •

edited

Loading