Skip to content

Commit b112aa2

Browse files
authored
Add support in the "zarr" backend for reading NCZarr data (#6420)
* add support for NCZarr * restore original format * add test_nczarr * better comment * test reading with zarr * decode zarray * use public store and test nczarr only * restore tests * install netcdf-c fixing bug * add env * fix ci * try build netcdf-c on windows * fix typo * install netcdf-c first * install netcdf-c dep with conda * fix ci * try win env again * fix Nan in tests * edit zarray * loop over all variables * edit Nan in zattrs and zarray * check path exists * must use netcdf-c>=4.8.1 * skip 4.8.1 and Windows * revisions * better testing * revisions * add what's new * update docs * [skip ci] Mention netCDF and GDAL in user-guide * [skip ci] reword
1 parent b4c943e commit b112aa2

File tree

5 files changed

+110
-29
lines changed

5 files changed

+110
-29
lines changed

doc/internals/zarr-encoding-spec.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,11 @@ the variable dimension names and then removed from the attributes dictionary
3232
returned to the user.
3333

3434
Because of these choices, Xarray cannot read arbitrary array data, but only
35-
Zarr data with valid ``_ARRAY_DIMENSIONS`` attributes on each array.
35+
Zarr data with valid ``_ARRAY_DIMENSIONS`` or
36+
`NCZarr <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_ attributes
37+
on each array (NCZarr dimension names are defined in the ``.zarray`` file).
3638

37-
After decoding the ``_ARRAY_DIMENSIONS`` attribute and assigning the variable
39+
After decoding the ``_ARRAY_DIMENSIONS`` or NCZarr attribute and assigning the variable
3840
dimensions, Xarray proceeds to [optionally] decode each variable using its
3941
standard CF decoding machinery used for NetCDF data (see :py:func:`decode_cf`).
4042

doc/user-guide/io.rst

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -518,8 +518,11 @@ the ability to store and analyze datasets far too large fit onto disk
518518

519519
Xarray can't open just any zarr dataset, because xarray requires special
520520
metadata (attributes) describing the dataset dimensions and coordinates.
521-
At this time, xarray can only open zarr datasets that have been written by
522-
xarray. For implementation details, see :ref:`zarr_encoding`.
521+
At this time, xarray can only open zarr datasets with these special attributes,
522+
such as zarr datasets written by xarray,
523+
`netCDF <https://docs.unidata.ucar.edu/nug/current/nczarr_head.html>`_,
524+
or `GDAL <https://gdal.org/drivers/raster/zarr.html>`_.
525+
For implementation details, see :ref:`zarr_encoding`.
523526

524527
To write a dataset with zarr, we use the :py:meth:`Dataset.to_zarr` method.
525528

@@ -548,6 +551,11 @@ store is already present at that path, an error will be raised, preventing it
548551
from being overwritten. To override this behavior and overwrite an existing
549552
store, add ``mode='w'`` when invoking :py:meth:`~Dataset.to_zarr`.
550553

554+
.. note::
555+
556+
xarray does not write NCZarr attributes. Therefore, NCZarr data must be
557+
opened in read-only mode.
558+
551559
To store variable length strings, convert them to object arrays first with
552560
``dtype=object``.
553561

doc/whats-new.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ v2022.03.1 (unreleased)
2222
New Features
2323
~~~~~~~~~~~~
2424

25+
- The `zarr` backend is now able to read NCZarr.
26+
By `Mattia Almansi <https://github.com/malmans2>`_.
2527
- Add a weighted ``quantile`` method to :py:class:`~core.weighted.DatasetWeighted` and
2628
:py:class:`~core.weighted.DataArrayWeighted` (:pull:`6059`). By
2729
`Christian Jauvin <https://github.com/cjauvin>`_ and `David Huard <https://github.com/huard>`_.

xarray/backends/zarr.py

Lines changed: 45 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import json
12
import os
23
import warnings
34

@@ -178,19 +179,37 @@ def _determine_zarr_chunks(enc_chunks, var_chunks, ndim, name, safe_chunks):
178179
raise AssertionError("We should never get here. Function logic must be wrong.")
179180

180181

181-
def _get_zarr_dims_and_attrs(zarr_obj, dimension_key):
182+
def _get_zarr_dims_and_attrs(zarr_obj, dimension_key, try_nczarr):
182183
# Zarr arrays do not have dimensions. To get around this problem, we add
183184
# an attribute that specifies the dimension. We have to hide this attribute
184185
# when we send the attributes to the user.
185186
# zarr_obj can be either a zarr group or zarr array
186187
try:
188+
# Xarray-Zarr
187189
dimensions = zarr_obj.attrs[dimension_key]
188-
except KeyError:
189-
raise KeyError(
190-
f"Zarr object is missing the attribute `{dimension_key}`, which is "
191-
"required for xarray to determine variable dimensions."
192-
)
193-
attributes = HiddenKeyDict(zarr_obj.attrs, [dimension_key])
190+
except KeyError as e:
191+
if not try_nczarr:
192+
raise KeyError(
193+
f"Zarr object is missing the attribute `{dimension_key}`, which is "
194+
"required for xarray to determine variable dimensions."
195+
) from e
196+
197+
# NCZarr defines dimensions through metadata in .zarray
198+
zarray_path = os.path.join(zarr_obj.path, ".zarray")
199+
zarray = json.loads(zarr_obj.store[zarray_path])
200+
try:
201+
# NCZarr uses Fully Qualified Names
202+
dimensions = [
203+
os.path.basename(dim) for dim in zarray["_NCZARR_ARRAY"]["dimrefs"]
204+
]
205+
except KeyError as e:
206+
raise KeyError(
207+
f"Zarr object is missing the attribute `{dimension_key}` and the NCZarr metadata, "
208+
"which are required for xarray to determine variable dimensions."
209+
) from e
210+
211+
nc_attrs = [attr for attr in zarr_obj.attrs if attr.startswith("_NC")]
212+
attributes = HiddenKeyDict(zarr_obj.attrs, [dimension_key] + nc_attrs)
194213
return dimensions, attributes
195214

196215

@@ -409,7 +428,10 @@ def ds(self):
409428

410429
def open_store_variable(self, name, zarr_array):
411430
data = indexing.LazilyIndexedArray(ZarrArrayWrapper(name, self))
412-
dimensions, attributes = _get_zarr_dims_and_attrs(zarr_array, DIMENSION_KEY)
431+
try_nczarr = self._mode == "r"
432+
dimensions, attributes = _get_zarr_dims_and_attrs(
433+
zarr_array, DIMENSION_KEY, try_nczarr
434+
)
413435
attributes = dict(attributes)
414436
encoding = {
415437
"chunks": zarr_array.chunks,
@@ -430,26 +452,24 @@ def get_variables(self):
430452
)
431453

432454
def get_attrs(self):
433-
return dict(self.zarr_group.attrs.asdict())
455+
return {
456+
k: v
457+
for k, v in self.zarr_group.attrs.asdict().items()
458+
if not k.startswith("_NC")
459+
}
434460

435461
def get_dimensions(self):
462+
try_nczarr = self._mode == "r"
436463
dimensions = {}
437464
for k, v in self.zarr_group.arrays():
438-
try:
439-
for d, s in zip(v.attrs[DIMENSION_KEY], v.shape):
440-
if d in dimensions and dimensions[d] != s:
441-
raise ValueError(
442-
f"found conflicting lengths for dimension {d} "
443-
f"({s} != {dimensions[d]})"
444-
)
445-
dimensions[d] = s
446-
447-
except KeyError:
448-
raise KeyError(
449-
f"Zarr object is missing the attribute `{DIMENSION_KEY}`, "
450-
"which is required for xarray to determine "
451-
"variable dimensions."
452-
)
465+
dim_names, _ = _get_zarr_dims_and_attrs(v, DIMENSION_KEY, try_nczarr)
466+
for d, s in zip(dim_names, v.shape):
467+
if d in dimensions and dimensions[d] != s:
468+
raise ValueError(
469+
f"found conflicting lengths for dimension {d} "
470+
f"({s} != {dimensions[d]})"
471+
)
472+
dimensions[d] = s
453473
return dimensions
454474

455475
def set_dimensions(self, variables, unlimited_dims=None):
@@ -645,7 +665,7 @@ def open_zarr(
645665
646666
The `store` object should be a valid store for a Zarr group. `store`
647667
variables must contain dimension metadata encoded in the
648-
`_ARRAY_DIMENSIONS` attribute.
668+
`_ARRAY_DIMENSIONS` attribute or must have NCZarr format.
649669
650670
Parameters
651671
----------

xarray/tests/test_backends.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import math
55
import os.path
66
import pickle
7+
import platform
78
import re
89
import shutil
910
import sys
@@ -5434,3 +5435,51 @@ def test_write_file_from_np_str(str_type, tmpdir) -> None:
54345435
txr = tdf.to_xarray()
54355436

54365437
txr.to_netcdf(tmpdir.join("test.nc"))
5438+
5439+
5440+
@requires_zarr
5441+
@requires_netCDF4
5442+
class TestNCZarr:
5443+
@staticmethod
5444+
def _create_nczarr(filename):
5445+
netcdfc_version = Version(nc4.getlibversion().split()[0])
5446+
if netcdfc_version < Version("4.8.1"):
5447+
pytest.skip("requires netcdf-c>=4.8.1")
5448+
if (platform.system() == "Windows") and (netcdfc_version == Version("4.8.1")):
5449+
# Bug in netcdf-c==4.8.1 (typo: Nan instead of NaN)
5450+
# https://github.com/Unidata/netcdf-c/issues/2265
5451+
pytest.skip("netcdf-c==4.8.1 has issues on Windows")
5452+
5453+
ds = create_test_data()
5454+
# Drop dim3: netcdf-c does not support dtype='<U1'
5455+
# https://github.com/Unidata/netcdf-c/issues/2259
5456+
ds = ds.drop_vars("dim3")
5457+
5458+
# netcdf-c>4.8.1 will add _ARRAY_DIMENSIONS by default
5459+
mode = "nczarr" if netcdfc_version == Version("4.8.1") else "nczarr,noxarray"
5460+
ds.to_netcdf(f"file://{filename}#mode={mode}")
5461+
return ds
5462+
5463+
def test_open_nczarr(self):
5464+
with create_tmp_file(suffix=".zarr") as tmp:
5465+
expected = self._create_nczarr(tmp)
5466+
actual = xr.open_zarr(tmp, consolidated=False)
5467+
assert_identical(expected, actual)
5468+
5469+
def test_overwriting_nczarr(self):
5470+
with create_tmp_file(suffix=".zarr") as tmp:
5471+
ds = self._create_nczarr(tmp)
5472+
expected = ds[["var1"]]
5473+
expected.to_zarr(tmp, mode="w")
5474+
actual = xr.open_zarr(tmp, consolidated=False)
5475+
assert_identical(expected, actual)
5476+
5477+
@pytest.mark.parametrize("mode", ["a", "r+"])
5478+
@pytest.mark.filterwarnings("ignore:.*non-consolidated metadata.*")
5479+
def test_raise_writing_to_nczarr(self, mode):
5480+
with create_tmp_file(suffix=".zarr") as tmp:
5481+
ds = self._create_nczarr(tmp)
5482+
with pytest.raises(
5483+
KeyError, match="missing the attribute `_ARRAY_DIMENSIONS`,"
5484+
):
5485+
ds.to_zarr(tmp, mode=mode)

0 commit comments

Comments
 (0)