Skip to content

Commit 599b70a

Browse files
committed
Merge branch 'master' into map_blocks_2
* master: Fix whats-new date :/ Revert to dev version Release v0.13.0 auto_combine deprecation to 0.14 (pydata#3314) Deprecation: groupby, resample default dim. (pydata#3313) Raise error if cmap is list of colors (pydata#3310) Refactor concat to use merge for non-concatenated variables (pydata#3239) Honor `keep_attrs` in DataArray.quantile (pydata#3305) Fix DataArray api doc (pydata#3309) Accept int value in head, thin and tail (pydata#3298) ignore h5py 2.10.0 warnings and fix invalid_netcdf warning test. (pydata#3301) Update why-xarray.rst with clearer expression (pydata#3307) Compat and encoding deprecation to 0.14 (pydata#3294) Remove deprecated concat kwargs. (pydata#3288) allow np-array levels and colors in 2D plots (pydata#3295) Remove some deprecations (pydata#3292) Make argmin/max work lazy with dask (pydata#3244) Add head, tail and thin methods (pydata#3278) Updater to testing environment name (pydata#3253)
2 parents d0797f6 + 02e9661 commit 599b70a

28 files changed

+928
-537
lines changed

doc/api.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,9 @@ Indexing
118118
Dataset.loc
119119
Dataset.isel
120120
Dataset.sel
121+
Dataset.head
122+
Dataset.tail
123+
Dataset.thin
121124
Dataset.squeeze
122125
Dataset.interp
123126
Dataset.interp_like
@@ -280,6 +283,9 @@ Indexing
280283
DataArray.loc
281284
DataArray.isel
282285
DataArray.sel
286+
DataArray.head
287+
DataArray.tail
288+
DataArray.thin
283289
DataArray.squeeze
284290
DataArray.interp
285291
DataArray.interp_like
@@ -605,6 +611,7 @@ Plotting
605611

606612
Dataset.plot
607613
DataArray.plot
614+
Dataset.plot.scatter
608615
plot.plot
609616
plot.contourf
610617
plot.contour

doc/dask.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,13 +75,14 @@ entirely equivalent to opening a dataset using ``open_dataset`` and then
7575
chunking the data using the ``chunk`` method, e.g.,
7676
``xr.open_dataset('example-data.nc').chunk({'time': 10})``.
7777

78-
To open multiple files simultaneously, use :py:func:`~xarray.open_mfdataset`::
78+
To open multiple files simultaneously in parallel using Dask delayed,
79+
use :py:func:`~xarray.open_mfdataset`::
7980

80-
xr.open_mfdataset('my/files/*.nc')
81+
xr.open_mfdataset('my/files/*.nc', parallel=True)
8182

8283
This function will automatically concatenate and merge dataset into one in
8384
the simple cases that it understands (see :py:func:`~xarray.auto_combine`
84-
for the full disclaimer). By default, ``open_mfdataset`` will chunk each
85+
for the full disclaimer). By default, :py:func:`~xarray.open_mfdataset` will chunk each
8586
netCDF file into a single Dask array; again, supply the ``chunks`` argument to
8687
control the size of the resulting Dask arrays. In more complex cases, you can
8788
open each file individually using ``open_dataset`` and merge the result, as

doc/io.rst

Lines changed: 147 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,9 @@ netCDF
9999
The recommended way to store xarray data structures is `netCDF`__, which
100100
is a binary file format for self-described datasets that originated
101101
in the geosciences. xarray is based on the netCDF data model, so netCDF files
102-
on disk directly correspond to :py:class:`~xarray.Dataset` objects.
102+
on disk directly correspond to :py:class:`~xarray.Dataset` objects (more accurately,
103+
a group in a netCDF file directly corresponds to a to :py:class:`~xarray.Dataset` object.
104+
See :ref:`io.netcdf_groups` for more.)
103105

104106
NetCDF is supported on almost all platforms, and parsers exist
105107
for the vast majority of scientific programming languages. Recent versions of
@@ -121,7 +123,7 @@ read/write netCDF V4 files and use the compression options described below).
121123
__ https://github.com/Unidata/netcdf4-python
122124

123125
We can save a Dataset to disk using the
124-
:py:attr:`Dataset.to_netcdf <xarray.Dataset.to_netcdf>` method:
126+
:py:meth:`~Dataset.to_netcdf` method:
125127

126128
.. ipython:: python
127129
@@ -147,19 +149,6 @@ convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back
147149
when loading, ensuring that the ``DataArray`` that is loaded is always exactly
148150
the same as the one that was saved.
149151

150-
NetCDF groups are not supported as part of the
151-
:py:class:`~xarray.Dataset` data model. Instead, groups can be loaded
152-
individually as Dataset objects.
153-
To do so, pass a ``group`` keyword argument to the
154-
``open_dataset`` function. The group can be specified as a path-like
155-
string, e.g., to access subgroup 'bar' within group 'foo' pass
156-
'/foo/bar' as the ``group`` argument.
157-
In a similar way, the ``group`` keyword argument can be given to the
158-
:py:meth:`~xarray.Dataset.to_netcdf` method to write to a group
159-
in a netCDF file.
160-
When writing multiple groups in one file, pass ``mode='a'`` to ``to_netcdf``
161-
to ensure that each call does not delete the file.
162-
163152
Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
164153
Dataset and DataArray objects, and no array values are loaded into memory until
165154
you try to perform some sort of actual computation. For an example of how these
@@ -195,6 +184,24 @@ It is possible to append or overwrite netCDF variables using the ``mode='a'``
195184
argument. When using this option, all variables in the dataset will be written
196185
to the original netCDF file, regardless if they exist in the original dataset.
197186

187+
188+
.. _io.netcdf_groups:
189+
190+
Groups
191+
~~~~~~
192+
193+
NetCDF groups are not supported as part of the :py:class:`~xarray.Dataset` data model.
194+
Instead, groups can be loaded individually as Dataset objects.
195+
To do so, pass a ``group`` keyword argument to the
196+
:py:func:`~xarray.open_dataset` function. The group can be specified as a path-like
197+
string, e.g., to access subgroup ``'bar'`` within group ``'foo'`` pass
198+
``'/foo/bar'`` as the ``group`` argument.
199+
In a similar way, the ``group`` keyword argument can be given to the
200+
:py:meth:`~xarray.Dataset.to_netcdf` method to write to a group
201+
in a netCDF file.
202+
When writing multiple groups in one file, pass ``mode='a'`` to
203+
:py:meth:`~xarray.Dataset.to_netcdf` to ensure that each call does not delete the file.
204+
198205
.. _io.encoding:
199206

200207
Reading encoded data
@@ -203,7 +210,7 @@ Reading encoded data
203210
NetCDF files follow some conventions for encoding datetime arrays (as numbers
204211
with a "units" attribute) and for packing and unpacking data (as
205212
described by the "scale_factor" and "add_offset" attributes). If the argument
206-
``decode_cf=True`` (default) is given to ``open_dataset``, xarray will attempt
213+
``decode_cf=True`` (default) is given to :py:func:`~xarray.open_dataset`, xarray will attempt
207214
to automatically decode the values in the netCDF objects according to
208215
`CF conventions`_. Sometimes this will fail, for example, if a variable
209216
has an invalid "units" or "calendar" attribute. For these cases, you can
@@ -247,6 +254,130 @@ will remove encoding information.
247254
import os
248255
os.remove('saved_on_disk.nc')
249256
257+
258+
.. _combining multiple files:
259+
260+
Reading multi-file datasets
261+
...........................
262+
263+
NetCDF files are often encountered in collections, e.g., with different files
264+
corresponding to different model runs or one file per timestamp.
265+
xarray can straightforwardly combine such files into a single Dataset by making use of
266+
:py:func:`~xarray.concat`, :py:func:`~xarray.merge`, :py:func:`~xarray.combine_nested` and
267+
:py:func:`~xarray.combine_by_coords`. For details on the difference between these
268+
functions see :ref:`combining data`.
269+
270+
Xarray includes support for manipulating datasets that don't fit into memory
271+
with dask_. If you have dask installed, you can open multiple files
272+
simultaneously in parallel using :py:func:`~xarray.open_mfdataset`::
273+
274+
xr.open_mfdataset('my/files/*.nc', parallel=True)
275+
276+
This function automatically concatenates and merges multiple files into a
277+
single xarray dataset.
278+
It is the recommended way to open multiple files with xarray.
279+
For more details on parallel reading, see :ref:`combining.multi`, :ref:`dask.io` and a
280+
`blog post`_ by Stephan Hoyer.
281+
:py:func:`~xarray.open_mfdataset` takes many kwargs that allow you to
282+
control its behaviour (for e.g. ``parallel``, ``combine``, ``compat``, ``join``, ``concat_dim``).
283+
See its docstring for more details.
284+
285+
286+
.. note::
287+
288+
A common use-case involves a dataset distributed across a large number of files with
289+
each file containing a large number of variables. Commonly a few of these variables
290+
need to be concatenated along a dimension (say ``"time"``), while the rest are equal
291+
across the datasets (ignoring floating point differences). The following command
292+
with suitable modifications (such as ``parallel=True``) works well with such datasets::
293+
294+
xr.open_mfdataset('my/files/*.nc', concat_dim="time",
295+
data_vars='minimal', coords='minimal', compat='override')
296+
297+
This command concatenates variables along the ``"time"`` dimension, but only those that
298+
already contain the ``"time"`` dimension (``data_vars='minimal', coords='minimal'``).
299+
Variables that lack the ``"time"`` dimension are taken from the first dataset
300+
(``compat='override'``).
301+
302+
303+
.. _dask: http://dask.pydata.org
304+
.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
305+
306+
Sometimes multi-file datasets are not conveniently organized for easy use of :py:func:`~xarray.open_mfdataset`.
307+
One can use the ``preprocess`` argument to provide a function that takes a dataset
308+
and returns a modified Dataset.
309+
:py:func:`~xarray.open_mfdataset` will call ``preprocess`` on every dataset
310+
(corresponding to each file) prior to combining them.
311+
312+
313+
If :py:func:`~xarray.open_mfdataset` does not meet your needs, other approaches are possible.
314+
The general pattern for parallel reading of multiple files
315+
using dask, modifying those datasets and then combining into a single ``Dataset`` is::
316+
317+
def modify(ds):
318+
# modify ds here
319+
return ds
320+
321+
322+
# this is basically what open_mfdataset does
323+
open_kwargs = dict(decode_cf=True, decode_times=False)
324+
open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names]
325+
tasks = [dask.delayed(modify)(task) for task in open_tasks]
326+
datasets = dask.compute(tasks) # get a list of xarray.Datasets
327+
combined = xr.combine_nested(datasets) # or some combination of concat, merge
328+
329+
330+
As an example, here's how we could approximate ``MFDataset`` from the netCDF4
331+
library::
332+
333+
from glob import glob
334+
import xarray as xr
335+
336+
def read_netcdfs(files, dim):
337+
# glob expands paths with * to a list of files, like the unix shell
338+
paths = sorted(glob(files))
339+
datasets = [xr.open_dataset(p) for p in paths]
340+
combined = xr.concat(dataset, dim)
341+
return combined
342+
343+
combined = read_netcdfs('/all/my/files/*.nc', dim='time')
344+
345+
This function will work in many cases, but it's not very robust. First, it
346+
never closes files, which means it will fail one you need to load more than
347+
a few thousands file. Second, it assumes that you want all the data from each
348+
file and that it can all fit into memory. In many situations, you only need
349+
a small subset or an aggregated summary of the data from each file.
350+
351+
Here's a slightly more sophisticated example of how to remedy these
352+
deficiencies::
353+
354+
def read_netcdfs(files, dim, transform_func=None):
355+
def process_one_path(path):
356+
# use a context manager, to ensure the file gets closed after use
357+
with xr.open_dataset(path) as ds:
358+
# transform_func should do some sort of selection or
359+
# aggregation
360+
if transform_func is not None:
361+
ds = transform_func(ds)
362+
# load all data from the transformed dataset, to ensure we can
363+
# use it after closing each original file
364+
ds.load()
365+
return ds
366+
367+
paths = sorted(glob(files))
368+
datasets = [process_one_path(p) for p in paths]
369+
combined = xr.concat(datasets, dim)
370+
return combined
371+
372+
# here we suppose we only care about the combined mean of each file;
373+
# you might also use indexing operations like .sel to subset datasets
374+
combined = read_netcdfs('/all/my/files/*.nc', dim='time',
375+
transform_func=lambda ds: ds.mean())
376+
377+
This pattern works well and is very robust. We've used similar code to process
378+
tens of thousands of files constituting 100s of GB of data.
379+
380+
250381
.. _io.netcdf.writing_encoded:
251382

252383
Writing encoded data
@@ -817,84 +948,3 @@ For CSV files, one might also consider `xarray_extras`_.
817948
.. _xarray_extras: https://xarray-extras.readthedocs.io/en/latest/api/csv.html
818949

819950
.. _IO tools: http://pandas.pydata.org/pandas-docs/stable/io.html
820-
821-
822-
.. _combining multiple files:
823-
824-
825-
Combining multiple files
826-
------------------------
827-
828-
NetCDF files are often encountered in collections, e.g., with different files
829-
corresponding to different model runs. xarray can straightforwardly combine such
830-
files into a single Dataset by making use of :py:func:`~xarray.concat`,
831-
:py:func:`~xarray.merge`, :py:func:`~xarray.combine_nested` and
832-
:py:func:`~xarray.combine_by_coords`. For details on the difference between these
833-
functions see :ref:`combining data`.
834-
835-
.. note::
836-
837-
Xarray includes support for manipulating datasets that don't fit into memory
838-
with dask_. If you have dask installed, you can open multiple files
839-
simultaneously using :py:func:`~xarray.open_mfdataset`::
840-
841-
xr.open_mfdataset('my/files/*.nc')
842-
843-
This function automatically concatenates and merges multiple files into a
844-
single xarray dataset.
845-
It is the recommended way to open multiple files with xarray.
846-
For more details, see :ref:`combining.multi`, :ref:`dask.io` and a
847-
`blog post`_ by Stephan Hoyer.
848-
849-
.. _dask: http://dask.pydata.org
850-
.. _blog post: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
851-
852-
For example, here's how we could approximate ``MFDataset`` from the netCDF4
853-
library::
854-
855-
from glob import glob
856-
import xarray as xr
857-
858-
def read_netcdfs(files, dim):
859-
# glob expands paths with * to a list of files, like the unix shell
860-
paths = sorted(glob(files))
861-
datasets = [xr.open_dataset(p) for p in paths]
862-
combined = xr.concat(dataset, dim)
863-
return combined
864-
865-
combined = read_netcdfs('/all/my/files/*.nc', dim='time')
866-
867-
This function will work in many cases, but it's not very robust. First, it
868-
never closes files, which means it will fail one you need to load more than
869-
a few thousands file. Second, it assumes that you want all the data from each
870-
file and that it can all fit into memory. In many situations, you only need
871-
a small subset or an aggregated summary of the data from each file.
872-
873-
Here's a slightly more sophisticated example of how to remedy these
874-
deficiencies::
875-
876-
def read_netcdfs(files, dim, transform_func=None):
877-
def process_one_path(path):
878-
# use a context manager, to ensure the file gets closed after use
879-
with xr.open_dataset(path) as ds:
880-
# transform_func should do some sort of selection or
881-
# aggregation
882-
if transform_func is not None:
883-
ds = transform_func(ds)
884-
# load all data from the transformed dataset, to ensure we can
885-
# use it after closing each original file
886-
ds.load()
887-
return ds
888-
889-
paths = sorted(glob(files))
890-
datasets = [process_one_path(p) for p in paths]
891-
combined = xr.concat(datasets, dim)
892-
return combined
893-
894-
# here we suppose we only care about the combined mean of each file;
895-
# you might also use indexing operations like .sel to subset datasets
896-
combined = read_netcdfs('/all/my/files/*.nc', dim='time',
897-
transform_func=lambda ds: ds.mean())
898-
899-
This pattern works well and is very robust. We've used similar code to process
900-
tens of thousands of files constituting 100s of GB of data.

0 commit comments

Comments
 (0)