You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To open multiple files simultaneously in parallel using Dask delayed,
79
81
use :py:func:`~xarray.open_mfdataset`::
80
82
81
83
xr.open_mfdataset('my/files/*.nc', parallel=True)
82
84
83
-
This function will automatically concatenate and merge dataset into one in
85
+
This function will automatically concatenate and merge datasets into one in
84
86
the simple cases that it understands (see :py:func:`~xarray.auto_combine`
85
-
for the full disclaimer). By default, :py:func:`~xarray.open_mfdataset` will chunk each
87
+
for the full disclaimer). By default, :py:meth:`~xarray.open_mfdataset` will chunk each
86
88
netCDF file into a single Dask array; again, supply the ``chunks`` argument to
87
89
control the size of the resulting Dask arrays. In more complex cases, you can
88
-
open each file individually using ``open_dataset`` and merge the result, as
89
-
described in :ref:`combining data`.
90
+
open each file individually using :py:meth:`~xarray.open_dataset` and merge the result, as
91
+
described in :ref:`combining data`. Passing the keyword argument ``parallel=True`` to :py:meth:`~xarray.open_mfdataset` will speed up the reading of large multi-file datasets by
92
+
executing those read tasks in parallel using ``dask.delayed``.
90
93
91
94
You'll notice that printing a dataset still shows a preview of array values,
92
95
even if they are actually Dask arrays. We can do this quickly with Dask because
@@ -106,7 +109,7 @@ usual way.
106
109
ds.to_netcdf('manipulated-example-data.nc')
107
110
108
111
By setting the ``compute`` argument to ``False``, :py:meth:`~xarray.Dataset.to_netcdf`
109
-
will return a Dask delayed object that can be computed later.
112
+
will return a ``dask.delayed`` object that can be computed later.
110
113
111
114
.. ipython:: python
112
115
@@ -153,8 +156,14 @@ explicit conversion step. One notable exception is indexing operations: to
153
156
enable label based indexing, xarray will automatically load coordinate labels
154
157
into memory.
155
158
159
+
.. tip::
160
+
161
+
By default, dask uses its multi-threaded scheduler, which distributes work across
162
+
multiple cores and allows for processing some datasets that do not fit into memory.
163
+
For running across a cluster, `setup the distributed scheduler <https://docs.dask.org/en/latest/setup.html>`_.
164
+
156
165
The easiest way to convert an xarray data structure from lazy Dask arrays into
157
-
eager, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method:
166
+
*eager*, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method:
158
167
159
168
.. ipython:: python
160
169
@@ -191,11 +200,20 @@ Dask arrays using the :py:meth:`~xarray.Dataset.persist` method:
191
200
192
201
ds = ds.persist()
193
202
194
-
This is particularly useful when using a distributed cluster because the data
195
-
will be loaded into distributed memory across your machines and be much faster
196
-
to use than reading repeatedly from disk. Warning that on a single machine
197
-
this operation will try to load all of your data into memory. You should make
198
-
sure that your dataset is not larger than available memory.
203
+
:py:meth:`~xarray.Dataset.persist` is particularly useful when using a
204
+
distributed cluster because the data will be loaded into distributed memory
205
+
across your machines and be much faster to use than reading repeatedly from
206
+
disk.
207
+
208
+
.. warning::
209
+
210
+
On a single machine :py:meth:`~xarray.Dataset.persist` will try to load all of
211
+
your data into memory. You should make sure that your dataset is not larger than
212
+
available memory.
213
+
214
+
.. note::
215
+
For more on the differences between :py:meth:`~xarray.Dataset.persist` and
216
+
:py:meth:`~xarray.Dataset.compute` see this `Stack Overflow answer <https://stackoverflow.com/questions/41806850/dask-difference-between-client-persist-and-client-compute>`_ and the `Dask documentation <https://distributed.readthedocs.io/en/latest/manage-computation.html#dask-collections-to-futures>`_.
199
217
200
218
For performance you may wish to consider chunk sizes. The correct choice of
201
219
chunk size depends both on your data and on the operations you want to perform.
@@ -381,6 +399,11 @@ one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the
381
399
cost of queueing up Dask operations can be noticeable, and you may need even
382
400
larger chunksizes.
383
401
402
+
.. tip::
403
+
404
+
Check out the dask documentation on `chunks <https://docs.dask.org/en/latest/array-chunks.html>`_.
405
+
406
+
384
407
Optimization Tips
385
408
-----------------
386
409
@@ -390,4 +413,12 @@ With analysis pipelines involving both spatial subsetting and temporal resamplin
390
413
391
414
2. Save intermediate results to disk as a netCDF files (using ``to_netcdf()``) and then load them again with ``open_dataset()`` for further computations. For example, if subtracting temporal mean from a dataset, save the temporal mean to disk before subtracting. Again, in theory, Dask should be able to do the computation in a streaming fashion, but in practice this is a fail case for the Dask scheduler, because it tries to keep every chunk of an array that it computes in memory. (See `Dask issue #874 <https://github.com/dask/dask/issues/874>`_)
392
415
393
-
3. Specify smaller chunks across space when using ``open_mfdataset()`` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
416
+
3. Specify smaller chunks across space when using :py:meth:`~xarray.open_mfdataset` (e.g., ``chunks={'latitude': 10, 'longitude': 10}``). This makes spatial subsetting easier, because there's no risk you will load chunks of data referring to different chunks (probably not necessary if you follow suggestion 1).
417
+
418
+
4. Using the h5netcdf package by passing ``engine='h5netcdf'`` to :py:meth:`~xarray.open_mfdataset`
419
+
can be quicker than the default ``engine='netcdf4'`` that uses the netCDF4 package.
420
+
421
+
5. Some dask-specific tips may be found `here <https://docs.dask.org/en/latest/array-best-practices.html>`_.
422
+
423
+
6. The dask `diagnostics <https://docs.dask.org/en/latest/understanding-performance.html>`_ can be
0 commit comments