Skip to content

Commit 3bce6a8

Browse files
committed
some dask stuff.
1 parent 6a0d515 commit 3bce6a8

File tree

2 files changed

+16
-4
lines changed

2 files changed

+16
-4
lines changed

doc/dask.rst

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ Parallel computing with Dask
55

66
xarray integrates with `Dask <http://dask.pydata.org/>`__ to support parallel
77
computations and streaming computation on datasets that don't fit into memory.
8-
98
Currently, Dask is an entirely optional feature for xarray. However, the
109
benefits of using Dask are sufficiently strong that Dask may become a required
1110
dependency in a future version of xarray.
1211

1312
For a full example of how to use xarray's Dask integration, read the
14-
`blog post introducing xarray and Dask`_.
13+
`blog post introducing xarray and Dask`_. More up-to-date examples
14+
may be found at the `Pangeo project's use-cases <http://pangeo.io/use_cases/index.html>`_.
1515

1616
.. _blog post introducing xarray and Dask: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
1717

@@ -396,4 +396,16 @@ With analysis pipelines involving both spatial subsetting and temporal resamplin
396396

397397
2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)
398398

399-
3. Specify smaller chunks across space when using ``open_mfdataset()`` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
399+
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
400+
401+
4. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset`
402+
can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
403+
404+
5. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
405+
406+
6. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be
407+
useful in identifying performance bottlenecks.
408+
409+
7. Installing the optional `bottleneck <https://github.com/kwgoodman/bottleneck>`_ library
410+
will result in greatly reduced memory usage when using :py:meth:`~xarray.Dataset.rolling`
411+
on dask arrays,

doc/io.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ string, e.g., to access subgroup 'bar' within group 'foo' pass
8787
pass ``mode='a'`` to ``to_netcdf`` to ensure that each call does not delete the
8888
file.
8989

90-
Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
90+
Data is *always* loaded lazily from netCDF files. You can manipulate, slice and subset
9191
Dataset and DataArray objects, and no array values are loaded into memory until
9292
you try to perform some sort of actual computation. For an example of how these
9393
lazy arrays work, see the OPeNDAP section below.

0 commit comments

Comments
 (0)