@@ -833,7 +833,9 @@ N-dimensional arrays.
833
833
Zarr has the ability to store arrays in a range of ways, including in memory,
834
834
in files, and in cloud-based object storage such as `Amazon S3 `_ and
835
835
`Google Cloud Storage `_.
836
- Xarray's Zarr backend allows xarray to leverage these capabilities.
836
+ Xarray's Zarr backend allows xarray to leverage these capabilities, including
837
+ the ability to store and analyze datasets far too large fit onto disk
838
+ (particularly :ref: `in combination with dask <dask >`).
837
839
838
840
.. warning ::
839
841
@@ -845,7 +847,8 @@ metadata (attributes) describing the dataset dimensions and coordinates.
845
847
At this time, xarray can only open zarr datasets that have been written by
846
848
xarray. For implementation details, see :ref: `zarr_encoding `.
847
849
848
- To write a dataset with zarr, we use the :py:attr: `Dataset.to_zarr ` method.
850
+ To write a dataset with zarr, we use the :py:meth: `Dataset.to_zarr ` method.
851
+
849
852
To write to a local directory, we pass a path to a directory:
850
853
851
854
.. ipython :: python
@@ -869,39 +872,10 @@ To write to a local directory, we pass a path to a directory:
869
872
there.) If the directory does not exist, it will be created. If a zarr
870
873
store is already present at that path, an error will be raised, preventing it
871
874
from being overwritten. To override this behavior and overwrite an existing
872
- store, add ``mode='w' `` when invoking ``to_zarr ``.
873
-
874
- It is also possible to append to an existing store. For that, set
875
- ``append_dim `` to the name of the dimension along which to append. ``mode ``
876
- can be omitted as it will internally be set to ``'a' ``.
877
-
878
- .. ipython :: python
879
- :suppress:
880
-
881
- ! rm - rf path/ to/ directory.zarr
882
-
883
- .. ipython :: python
884
-
885
- ds1 = xr.Dataset(
886
- {" foo" : ((" x" , " y" , " t" ), np.random.rand(4 , 5 , 2 ))},
887
- coords = {
888
- " x" : [10 , 20 , 30 , 40 ],
889
- " y" : [1 , 2 , 3 , 4 , 5 ],
890
- " t" : pd.date_range(" 2001-01-01" , periods = 2 ),
891
- },
892
- )
893
- ds1.to_zarr(" path/to/directory.zarr" )
894
- ds2 = xr.Dataset(
895
- {" foo" : ((" x" , " y" , " t" ), np.random.rand(4 , 5 , 2 ))},
896
- coords = {
897
- " x" : [10 , 20 , 30 , 40 ],
898
- " y" : [1 , 2 , 3 , 4 , 5 ],
899
- " t" : pd.date_range(" 2001-01-03" , periods = 2 ),
900
- },
901
- )
902
- ds2.to_zarr(" path/to/directory.zarr" , append_dim = " t" )
875
+ store, add ``mode='w' `` when invoking :py:meth: `~Dataset.to_zarr `.
903
876
904
- To store variable length strings use ``dtype=object ``.
877
+ To store variable length strings, convert them to object arrays first with
878
+ ``dtype=object ``.
905
879
906
880
To read back a zarr dataset that has been created this way, we use the
907
881
:py:func: `open_zarr ` method:
@@ -987,6 +961,109 @@ Xarray can't perform consolidation on pre-existing zarr datasets. This should
987
961
be done directly from zarr, as described in the
988
962
`zarr docs <https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata >`_.
989
963
964
+ .. _io.zarr.appending :
965
+
966
+ Appending to existing Zarr stores
967
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
968
+
969
+ Xarray supports several ways of incrementally writing variables to a Zarr
970
+ store. These options are useful for scenarios when it is infeasible or
971
+ undesirable to write your entire dataset at once.
972
+
973
+ .. tip ::
974
+
975
+ If you can load all of your data into a single ``Dataset `` using dask, a
976
+ single call to ``to_zarr() `` will write all of your data in parallel.
977
+
978
+ .. warning ::
979
+
980
+ Alignment of coordinates is currently not checked when modifying an
981
+ existing Zarr store. It is up to the user to ensure that coordinates are
982
+ consistent.
983
+
984
+ To add or overwrite entire variables, simply call :py:meth: `~Dataset.to_zarr `
985
+ with ``mode='a' `` on a Dataset containing the new variables, passing in an
986
+ existing Zarr store or path to a Zarr store.
987
+
988
+ To resize and then append values along an existing dimension in a store, set
989
+ ``append_dim ``. This is a good option if data always arives in a particular
990
+ order, e.g., for time-stepping a simulation:
991
+
992
+ .. ipython :: python
993
+ :suppress:
994
+
995
+ ! rm - rf path/ to/ directory.zarr
996
+
997
+ .. ipython :: python
998
+
999
+ ds1 = xr.Dataset(
1000
+ {" foo" : ((" x" , " y" , " t" ), np.random.rand(4 , 5 , 2 ))},
1001
+ coords = {
1002
+ " x" : [10 , 20 , 30 , 40 ],
1003
+ " y" : [1 , 2 , 3 , 4 , 5 ],
1004
+ " t" : pd.date_range(" 2001-01-01" , periods = 2 ),
1005
+ },
1006
+ )
1007
+ ds1.to_zarr(" path/to/directory.zarr" )
1008
+ ds2 = xr.Dataset(
1009
+ {" foo" : ((" x" , " y" , " t" ), np.random.rand(4 , 5 , 2 ))},
1010
+ coords = {
1011
+ " x" : [10 , 20 , 30 , 40 ],
1012
+ " y" : [1 , 2 , 3 , 4 , 5 ],
1013
+ " t" : pd.date_range(" 2001-01-03" , periods = 2 ),
1014
+ },
1015
+ )
1016
+ ds2.to_zarr(" path/to/directory.zarr" , append_dim = " t" )
1017
+
1018
+ Finally, you can use ``region `` to write to limited regions of existing arrays
1019
+ in an existing Zarr store. This is a good option for writing data in parallel
1020
+ from independent processes.
1021
+
1022
+ To scale this up to writing large datasets, the first step is creating an
1023
+ initial Zarr store without writing all of its array data. This can be done by
1024
+ first creating a ``Dataset `` with dummy values stored in :ref: `dask <dask >`,
1025
+ and then calling ``to_zarr `` with ``compute=False `` to write only metadata
1026
+ (including ``attrs ``) to Zarr:
1027
+
1028
+ .. ipython :: python
1029
+ :suppress:
1030
+
1031
+ ! rm - rf path/ to/ directory.zarr
1032
+
1033
+ .. ipython :: python
1034
+
1035
+ import dask.array
1036
+ # The values of this dask array are entirely irrelevant; only the dtype,
1037
+ # shape and chunks are used
1038
+ dummies = dask.array.zeros(30 , chunks = 10 )
1039
+ ds = xr.Dataset({" foo" : (" x" , dummies)})
1040
+ path = " path/to/directory.zarr"
1041
+ # Now we write the metadata without computing any array values
1042
+ ds.to_zarr(path, compute = False , consolidated = True )
1043
+
1044
+ Now, a Zarr store with the correct variable shapes and attributes exists that
1045
+ can be filled out by subsequent calls to ``to_zarr ``. The ``region `` provides a
1046
+ mapping from dimension names to Python ``slice `` objects indicating where the
1047
+ data should be written (in index space, not coordinate space), e.g.,
1048
+
1049
+ .. ipython :: python
1050
+
1051
+ # For convenience, we'll slice a single dataset, but in the real use-case
1052
+ # we would create them separately, possibly even from separate processes.
1053
+ ds = xr.Dataset({" foo" : (" x" , np.arange(30 ))})
1054
+ ds.isel(x = slice (0 , 10 )).to_zarr(path, region = {" x" : slice (0 , 10 )})
1055
+ ds.isel(x = slice (10 , 20 )).to_zarr(path, region = {" x" : slice (10 , 20 )})
1056
+ ds.isel(x = slice (20 , 30 )).to_zarr(path, region = {" x" : slice (20 , 30 )})
1057
+
1058
+ Concurrent writes with ``region `` are safe as long as they modify distinct
1059
+ chunks in the underlying Zarr arrays (or use an appropriate ``lock ``).
1060
+
1061
+ As a safety check to make it harder to inadvertently override existing values,
1062
+ if you set ``region `` then *all * variables included in a Dataset must have
1063
+ dimensions included in ``region ``. Other variables (typically coordinates)
1064
+ need to be explicitly dropped and/or written in a separate calls to ``to_zarr ``
1065
+ with ``mode='a' ``.
1066
+
990
1067
.. _io.cfgrib :
991
1068
992
1069
.. ipython :: python
0 commit comments