Skip to content

Commit 283b4fe

Browse files
authored
Docs/more fixes (#2934)
* Move netcdf to beginning of io.rst * Better indexing example. * Start de-emphasizing pandas * misc. * compute, load, persist docstrings + text. * split-apply-combine. * np.newaxis. * misc. * some dask stuff. * Little more dask. * undo index.rst changes. * link to dask docs on chunks * Fix io.rst. * small changes. * rollingupdate. * joe's review
1 parent dfdeef7 commit 283b4fe

File tree

8 files changed

+212
-135
lines changed

8 files changed

+212
-135
lines changed

doc/computation.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,9 @@ a value when aggregating:
179179
r = arr.rolling(y=3, center=True, min_periods=2)
180180
r.mean()
181181
182-
Note that rolling window aggregations are faster when bottleneck_ is installed.
182+
.. tip::
183+
184+
Note that rolling window aggregations are faster and use less memory when bottleneck_ is installed. This only applies to numpy-backed xarray objects.
183185

184186
.. _bottleneck: https://github.com/kwgoodman/bottleneck/
185187

doc/dask.rst

Lines changed: 48 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,14 @@ Parallel computing with Dask
55

66
xarray integrates with `Dask <http://dask.pydata.org/>`__ to support parallel
77
computations and streaming computation on datasets that don't fit into memory.
8-
98
Currently, Dask is an entirely optional feature for xarray. However, the
109
benefits of using Dask are sufficiently strong that Dask may become a required
1110
dependency in a future version of xarray.
1211

1312
For a full example of how to use xarray's Dask integration, read the
14-
`blog post introducing xarray and Dask`_.
13+
`blog post introducing xarray and Dask`_. More up-to-date examples
14+
may be found at the `Pangeo project's use-cases <http://pangeo.io/use_cases/index.html>`_
15+
and at the `Dask examples website <https://examples.dask.org/xarray.html>`_.
1516

1617
.. _blog post introducing xarray and Dask: http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
1718

@@ -37,13 +38,14 @@ which allows Dask to take full advantage of multiple processors available on
3738
most modern computers.
3839

3940
For more details on Dask, read `its documentation <http://dask.pydata.org/>`__.
41+
Note that xarray only makes use of ``dask.array`` and ``dask.delayed``.
4042

4143
.. _dask.io:
4244

4345
Reading and writing data
4446
------------------------
4547

46-
The usual way to create a dataset filled with Dask arrays is to load the
48+
The usual way to create a ``Dataset`` filled with Dask arrays is to load the
4749
data from a netCDF file or files. You can do this by supplying a ``chunks``
4850
argument to :py:func:`~xarray.open_dataset` or using the
4951
:py:func:`~xarray.open_mfdataset` function.
@@ -71,22 +73,23 @@ argument to :py:func:`~xarray.open_dataset` or using the
7173
7274
In this example ``latitude`` and ``longitude`` do not appear in the ``chunks``
7375
dict, so only one chunk will be used along those dimensions. It is also
74-
entirely equivalent to opening a dataset using ``open_dataset`` and then
75-
chunking the data using the ``chunk`` method, e.g.,
76+
entirely equivalent to opening a dataset using :py:meth:`~xarray.open_dataset`
77+
and then chunking the data using the ``chunk`` method, e.g.,
7678
``xr.open_dataset('example-data.nc').chunk({'time': 10})``.
7779

7880
To open multiple files simultaneously in parallel using Dask delayed,
7981
use :py:func:`~xarray.open_mfdataset`::
8082

8183
xr.open_mfdataset('my/files/*.nc', parallel=True)
8284

83-
This function will automatically concatenate and merge dataset into one in
85+
This function will automatically concatenate and merge datasets into one in
8486
the simple cases that it understands (see :py:func:`~xarray.auto_combine`
85-
for the full disclaimer). By default, :py:func:`~xarray.open_mfdataset` will chunk each
87+
for the full disclaimer). By default, :py:meth:`~xarray.open_mfdataset` will chunk each
8688
netCDF file into a single Dask array; again, supply the ``chunks`` argument to
8789
control the size of the resulting Dask arrays. In more complex cases, you can
88-
open each file individually using ``open_dataset`` and merge the result, as
89-
described in :ref:`combining data`.
90+
open each file individually using :py:meth:`~xarray.open_dataset` and merge the result, as
91+
described in :ref:`combining data`. Passing the keyword argument ``parallel=True`` to :py:meth:`~xarray.open_mfdataset` will speed up the reading of large multi-file datasets by
92+
executing those read tasks in parallel using ``dask.delayed``.
9093

9194
You'll notice that printing a dataset still shows a preview of array values,
9295
even if they are actually Dask arrays. We can do this quickly with Dask because
@@ -106,7 +109,7 @@ usual way.
106109
ds.to_netcdf('manipulated-example-data.nc')
107110
108111
By setting the ``compute`` argument to ``False``, :py:meth:`~xarray.Dataset.to_netcdf`
109-
will return a Dask delayed object that can be computed later.
112+
will return a ``dask.delayed`` object that can be computed later.
110113

111114
.. ipython:: python
112115
@@ -153,8 +156,14 @@ explicit conversion step. One notable exception is indexing operations: to
153156
enable label based indexing, xarray will automatically load coordinate labels
154157
into memory.
155158

159+
.. tip::
160+
161+
By default, dask uses its multi-threaded scheduler, which distributes work across
162+
multiple cores and allows for processing some datasets that do not fit into memory.
163+
For running across a cluster, `setup the distributed scheduler <https://docs.dask.org/en/latest/setup.html>`_.
164+
156165
The easiest way to convert an xarray data structure from lazy Dask arrays into
157-
eager, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method:
166+
*eager*, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method:
158167

159168
.. ipython:: python
160169
@@ -191,11 +200,20 @@ Dask arrays using the :py:meth:`~xarray.Dataset.persist` method:
191200
192201
ds = ds.persist()
193202
194-
This is particularly useful when using a distributed cluster because the data
195-
will be loaded into distributed memory across your machines and be much faster
196-
to use than reading repeatedly from disk. Warning that on a single machine
197-
this operation will try to load all of your data into memory. You should make
198-
sure that your dataset is not larger than available memory.
203+
:py:meth:`~xarray.Dataset.persist` is particularly useful when using a
204+
distributed cluster because the data will be loaded into distributed memory
205+
across your machines and be much faster to use than reading repeatedly from
206+
disk.
207+
208+
.. warning::
209+
210+
On a single machine :py:meth:`~xarray.Dataset.persist` will try to load all of
211+
your data into memory. You should make sure that your dataset is not larger than
212+
available memory.
213+
214+
.. note::
215+
For more on the differences between :py:meth:`~xarray.Dataset.persist` and
216+
:py:meth:`~xarray.Dataset.compute` see this `Stack Overflow answer <https://stackoverflow.com/questions/41806850/dask-difference-between-client-persist-and-client-compute>`_ and the `Dask documentation <https://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures>`_.
199217

200218
For performance you may wish to consider chunk sizes. The correct choice of
201219
chunk size depends both on your data and on the operations you want to perform.
@@ -381,6 +399,11 @@ one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the
381399
cost of queueing up Dask operations can be noticeable, and you may need even
382400
larger chunksizes.
383401

402+
.. tip::
403+
404+
Check out the dask documentation on `chunks <https://docs.dask.org/en/latest/array-chunks.html>`_.
405+
406+
384407
Optimization Tips
385408
-----------------
386409

@@ -390,4 +413,12 @@ With analysis pipelines involving both spatial subsetting and temporal resamplin
390413

391414
2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)
392415

393-
3. Specify smaller chunks across space when using ``open_mfdataset()`` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
416+
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
417+
418+
4. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset`
419+
can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
420+
421+
5. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
422+
423+
6. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be
424+
useful in identifying performance bottlenecks.

doc/faq.rst

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,38 @@ Frequently Asked Questions
1111
import xarray as xr
1212
np.random.seed(123456)
1313
14+
15+
Your documentation keeps mentioning pandas. What is pandas?
16+
-----------------------------------------------------------
17+
18+
pandas_ is a very popular data analysis package in Python
19+
with wide usage in many fields. Our API is heavily inspired by pandas —
20+
this is why there are so many references to pandas.
21+
22+
.. _pandas: https://pandas.pydata.org
23+
24+
25+
Do I need to know pandas to use xarray?
26+
---------------------------------------
27+
28+
No! Our API is heavily inspired by pandas so while knowing pandas will let you
29+
become productive more quickly, knowledge of pandas is not necessary to use xarray.
30+
31+
32+
Should I use xarray instead of pandas?
33+
--------------------------------------
34+
35+
It's not an either/or choice! xarray provides robust support for converting
36+
back and forth between the tabular data-structures of pandas and its own
37+
multi-dimensional data-structures.
38+
39+
That said, you should only bother with xarray if some aspect of data is
40+
fundamentally multi-dimensional. If your data is unstructured or
41+
one-dimensional, pandas is usually the right choice: it has better performance
42+
for common operations such as ``groupby`` and you'll find far more usage
43+
examples online.
44+
45+
1446
Why is pandas not enough?
1547
-------------------------
1648

@@ -56,20 +88,6 @@ of the "time" dimension. You never need to reshape arrays (e.g., with
5688
``np.newaxis``) to align them for arithmetic operations in xarray.
5789

5890

59-
Should I use xarray instead of pandas?
60-
--------------------------------------
61-
62-
It's not an either/or choice! xarray provides robust support for converting
63-
back and forth between the tabular data-structures of pandas and its own
64-
multi-dimensional data-structures.
65-
66-
That said, you should only bother with xarray if some aspect of data is
67-
fundamentally multi-dimensional. If your data is unstructured or
68-
one-dimensional, pandas is usually the right choice: it has better performance
69-
for common operations such as ``groupby`` and you'll find far more usage
70-
examples online.
71-
72-
7391
Why don't aggregations return Python scalars?
7492
---------------------------------------------
7593

doc/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ intuitive, more concise, and less error-prone developer experience.
1111
The package includes a large and growing library of domain-agnostic functions
1212
for advanced analytics and visualization with these data structures.
1313

14-
Xarray was inspired by and borrows heavily from pandas_, the popular data
14+
Xarray is inspired by and borrows heavily from pandas_, the popular data
1515
analysis package focused on labelled tabular data.
1616
It is particularly tailored to working with netCDF_ files, which were the
1717
source of xarray's data model, and integrates tightly with dask_ for parallel

0 commit comments

Comments
 (0)