Skip to content

Commit c7cb9f7

Browse files
Anu-Ra-gpre-commit-ci[bot]keewisdcherian
authored
added kerchunk as backend documentation (#9163)
* added kerchunk as backend documentation * Update io.rst * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated the io.rst file * updated io.rst * modified the combined.json file * Apply suggestions from code review * added new references * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed some typos --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Justus Magin <[email protected]> Co-authored-by: Deepak Cherian <[email protected]>
1 parent 9426095 commit c7cb9f7

File tree

3 files changed

+84
-0
lines changed

3 files changed

+84
-0
lines changed

ci/requirements/doc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ dependencies:
88
- bottleneck
99
- cartopy
1010
- cfgrib
11+
- kerchunk
1112
- dask-core>=2022.1
1213
- dask-expr
1314
- hypothesis>=6.75.8

doc/combined.json

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"version": 1,
3+
"refs": {
4+
".zgroup": "{\"zarr_format\":2}",
5+
"foo/.zarray": "{\"chunks\":[4,5],\"compressor\":null,\"dtype\":\"<f8\",\"fill_value\":\"NaN\",\"filters\":null,\"order\":\"C\",\"shape\":[4,5],\"zarr_format\":2}",
6+
"foo/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\",\"y\"],\"coordinates\":\"z\"}",
7+
"foo/0.0": [
8+
"saved_on_disk.h5",
9+
8192,
10+
160
11+
],
12+
"x/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}",
13+
"x/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}",
14+
"x/0": [
15+
"saved_on_disk.h5",
16+
8352,
17+
32
18+
],
19+
"y/.zarray": "{\"chunks\":[5],\"compressor\":null,\"dtype\":\"<i8\",\"fill_value\":null,\"filters\":null,\"order\":\"C\",\"shape\":[5],\"zarr_format\":2}",
20+
"y/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"y\"],\"calendar\":\"proleptic_gregorian\",\"units\":\"days since 2000-01-01 00:00:00\"}",
21+
"y/0": [
22+
"saved_on_disk.h5",
23+
8384,
24+
40
25+
],
26+
"z/.zarray": "{\"chunks\":[4],\"compressor\":null,\"dtype\":\"|O\",\"fill_value\":null,\"filters\":[{\"allow_nan\":true,\"check_circular\":true,\"encoding\":\"utf-8\",\"ensure_ascii\":true,\"id\":\"json2\",\"indent\":null,\"separators\":[\",\",\":\"],\"skipkeys\":false,\"sort_keys\":true,\"strict\":true}],\"order\":\"C\",\"shape\":[4],\"zarr_format\":2}",
27+
"z/0": "[\"a\",\"b\",\"c\",\"d\",\"|O\",[4]]",
28+
"z/.zattrs": "{\"_ARRAY_DIMENSIONS\":[\"x\"]}"
29+
}
30+
}

doc/user-guide/io.rst

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1060,6 +1060,59 @@ reads. Because this fall-back option is so much slower, xarray issues a
10601060
instead of falling back to try reading non-consolidated metadata.
10611061

10621062

1063+
.. _io.kerchunk:
1064+
1065+
Kerchunk
1066+
--------
1067+
1068+
`Kerchunk <https://fsspec.github.io/kerchunk/index.html>`_ is a Python library
1069+
that allows you to access chunked and compressed data formats (such as NetCDF3, NetCDF4, HDF5, GRIB2, TIFF & FITS),
1070+
many of which are primary data formats for many data archives, by viewing the
1071+
whole archive as an ephemeral `Zarr`_ dataset which allows for parallel, chunk-specific access.
1072+
1073+
Instead of creating a new copy of the dataset in the Zarr spec/format or
1074+
downloading the files locally, Kerchunk reads through the data archive and extracts the
1075+
byte range and compression information of each chunk and saves as a ``reference``.
1076+
These references are then saved as ``json`` files or ``parquet`` (more efficient)
1077+
for later use. You can view some of these stored in the `references`
1078+
directory `here <https://github.com/pydata/xarray-data>`_.
1079+
1080+
1081+
.. note::
1082+
These references follow this `specification <https://fsspec.github.io/kerchunk/spec.html>`_.
1083+
Packages like `kerchunk`_ and `virtualizarr <https://github.com/zarr-developers/VirtualiZarr>`_
1084+
help in creating and reading these references.
1085+
1086+
1087+
Reading these data archives becomes really easy with ``kerchunk`` in combination
1088+
with ``xarray``, especially when these archives are large in size. A single combined
1089+
reference can refer to thousands of the original data files present in these archives.
1090+
You can view the whole dataset with from this `combined reference` using the above packages.
1091+
1092+
The following example shows opening a combined references generated from a ``.hdf`` file stored locally.
1093+
1094+
.. ipython:: python
1095+
1096+
storage_options = {
1097+
"target_protocol": "file",
1098+
}
1099+
1100+
# add the `remote_protocol` key in `storage_options` if you're accessing a file remotely
1101+
1102+
ds1 = xr.open_dataset(
1103+
"./combined.json",
1104+
engine="kerchunk",
1105+
storage_options=storage_options,
1106+
)
1107+
1108+
ds1
1109+
1110+
.. note::
1111+
1112+
You can refer to the `project pythia kerchunk cookbook <https://projectpythia.org/kerchunk-cookbook/README.html>`_
1113+
and the `pangeo guide on kerchunk <https://guide.cloudnativegeo.org/kerchunk/intro.html>`_ for more information.
1114+
1115+
10631116
.. _io.iris:
10641117

10651118
Iris

0 commit comments

Comments
 (0)