99
99
The recommended way to store xarray data structures is `netCDF `__, which
100
100
is a binary file format for self-described datasets that originated
101
101
in the geosciences. xarray is based on the netCDF data model, so netCDF files
102
- on disk directly correspond to :py:class: `~xarray.Dataset ` objects.
102
+ on disk directly correspond to :py:class: `~xarray.Dataset ` objects (more accurately,
103
+ a group in a netCDF file directly corresponds to a to :py:class: `~xarray.Dataset ` object.
104
+ See :ref: `io.netcdf_groups ` for more.)
103
105
104
106
NetCDF is supported on almost all platforms, and parsers exist
105
107
for the vast majority of scientific programming languages. Recent versions of
@@ -121,7 +123,7 @@ read/write netCDF V4 files and use the compression options described below).
121
123
__ https://github.com/Unidata/netcdf4-python
122
124
123
125
We can save a Dataset to disk using the
124
- :py:attr: ` Dataset.to_netcdf <xarray.Dataset.to_netcdf> ` method:
126
+ :py:meth: ` ~ Dataset.to_netcdf ` method:
125
127
126
128
.. ipython :: python
127
129
@@ -147,19 +149,6 @@ convert the ``DataArray`` to a ``Dataset`` before saving, and then convert back
147
149
when loading, ensuring that the ``DataArray `` that is loaded is always exactly
148
150
the same as the one that was saved.
149
151
150
- NetCDF groups are not supported as part of the
151
- :py:class: `~xarray.Dataset ` data model. Instead, groups can be loaded
152
- individually as Dataset objects.
153
- To do so, pass a ``group `` keyword argument to the
154
- ``open_dataset `` function. The group can be specified as a path-like
155
- string, e.g., to access subgroup 'bar' within group 'foo' pass
156
- '/foo/bar' as the ``group `` argument.
157
- In a similar way, the ``group `` keyword argument can be given to the
158
- :py:meth: `~xarray.Dataset.to_netcdf ` method to write to a group
159
- in a netCDF file.
160
- When writing multiple groups in one file, pass ``mode='a' `` to ``to_netcdf ``
161
- to ensure that each call does not delete the file.
162
-
163
152
Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
164
153
Dataset and DataArray objects, and no array values are loaded into memory until
165
154
you try to perform some sort of actual computation. For an example of how these
@@ -195,6 +184,24 @@ It is possible to append or overwrite netCDF variables using the ``mode='a'``
195
184
argument. When using this option, all variables in the dataset will be written
196
185
to the original netCDF file, regardless if they exist in the original dataset.
197
186
187
+
188
+ .. _io.netcdf_groups :
189
+
190
+ Groups
191
+ ~~~~~~
192
+
193
+ NetCDF groups are not supported as part of the :py:class: `~xarray.Dataset ` data model.
194
+ Instead, groups can be loaded individually as Dataset objects.
195
+ To do so, pass a ``group `` keyword argument to the
196
+ :py:func: `~xarray.open_dataset ` function. The group can be specified as a path-like
197
+ string, e.g., to access subgroup ``'bar' `` within group ``'foo' `` pass
198
+ ``'/foo/bar' `` as the ``group `` argument.
199
+ In a similar way, the ``group `` keyword argument can be given to the
200
+ :py:meth: `~xarray.Dataset.to_netcdf ` method to write to a group
201
+ in a netCDF file.
202
+ When writing multiple groups in one file, pass ``mode='a' `` to
203
+ :py:meth: `~xarray.Dataset.to_netcdf ` to ensure that each call does not delete the file.
204
+
198
205
.. _io.encoding :
199
206
200
207
Reading encoded data
@@ -203,7 +210,7 @@ Reading encoded data
203
210
NetCDF files follow some conventions for encoding datetime arrays (as numbers
204
211
with a "units" attribute) and for packing and unpacking data (as
205
212
described by the "scale_factor" and "add_offset" attributes). If the argument
206
- ``decode_cf=True `` (default) is given to `` open_dataset ` `, xarray will attempt
213
+ ``decode_cf=True `` (default) is given to :py:func: ` ~xarray. open_dataset `, xarray will attempt
207
214
to automatically decode the values in the netCDF objects according to
208
215
`CF conventions `_. Sometimes this will fail, for example, if a variable
209
216
has an invalid "units" or "calendar" attribute. For these cases, you can
@@ -247,6 +254,130 @@ will remove encoding information.
247
254
import os
248
255
os.remove(' saved_on_disk.nc' )
249
256
257
+
258
+ .. _combining multiple files :
259
+
260
+ Reading multi-file datasets
261
+ ...........................
262
+
263
+ NetCDF files are often encountered in collections, e.g., with different files
264
+ corresponding to different model runs or one file per timestamp.
265
+ xarray can straightforwardly combine such files into a single Dataset by making use of
266
+ :py:func: `~xarray.concat `, :py:func: `~xarray.merge `, :py:func: `~xarray.combine_nested ` and
267
+ :py:func: `~xarray.combine_by_coords `. For details on the difference between these
268
+ functions see :ref: `combining data `.
269
+
270
+ Xarray includes support for manipulating datasets that don't fit into memory
271
+ with dask _. If you have dask installed, you can open multiple files
272
+ simultaneously in parallel using :py:func: `~xarray.open_mfdataset `::
273
+
274
+ xr.open_mfdataset('my/files/*.nc', parallel=True)
275
+
276
+ This function automatically concatenates and merges multiple files into a
277
+ single xarray dataset.
278
+ It is the recommended way to open multiple files with xarray.
279
+ For more details on parallel reading, see :ref: `combining.multi `, :ref: `dask.io ` and a
280
+ `blog post `_ by Stephan Hoyer.
281
+ :py:func: `~xarray.open_mfdataset ` takes many kwargs that allow you to
282
+ control its behaviour (for e.g. ``parallel ``, ``combine ``, ``compat ``, ``join ``, ``concat_dim ``).
283
+ See its docstring for more details.
284
+
285
+
286
+ .. note ::
287
+
288
+ A common use-case involves a dataset distributed across a large number of files with
289
+ each file containing a large number of variables. Commonly a few of these variables
290
+ need to be concatenated along a dimension (say ``"time" ``), while the rest are equal
291
+ across the datasets (ignoring floating point differences). The following command
292
+ with suitable modifications (such as ``parallel=True ``) works well with such datasets::
293
+
294
+ xr.open_mfdataset('my/files/*.nc', concat_dim="time",
295
+ data_vars='minimal', coords='minimal', compat='override')
296
+
297
+ This command concatenates variables along the ``"time" `` dimension, but only those that
298
+ already contain the ``"time" `` dimension (``data_vars='minimal', coords='minimal' ``).
299
+ Variables that lack the ``"time" `` dimension are taken from the first dataset
300
+ (``compat='override' ``).
301
+
302
+
303
+ .. _dask : http://dask.pydata.org
304
+ .. _blog post : http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
305
+
306
+ Sometimes multi-file datasets are not conveniently organized for easy use of :py:func: `~xarray.open_mfdataset `.
307
+ One can use the ``preprocess `` argument to provide a function that takes a dataset
308
+ and returns a modified Dataset.
309
+ :py:func: `~xarray.open_mfdataset ` will call ``preprocess `` on every dataset
310
+ (corresponding to each file) prior to combining them.
311
+
312
+
313
+ If :py:func: `~xarray.open_mfdataset ` does not meet your needs, other approaches are possible.
314
+ The general pattern for parallel reading of multiple files
315
+ using dask, modifying those datasets and then combining into a single ``Dataset `` is::
316
+
317
+ def modify(ds):
318
+ # modify ds here
319
+ return ds
320
+
321
+
322
+ # this is basically what open_mfdataset does
323
+ open_kwargs = dict(decode_cf=True, decode_times=False)
324
+ open_tasks = [dask.delayed(xr.open_dataset)(f, **open_kwargs) for f in file_names]
325
+ tasks = [dask.delayed(modify)(task) for task in open_tasks]
326
+ datasets = dask.compute(tasks) # get a list of xarray.Datasets
327
+ combined = xr.combine_nested(datasets) # or some combination of concat, merge
328
+
329
+
330
+ As an example, here's how we could approximate ``MFDataset `` from the netCDF4
331
+ library::
332
+
333
+ from glob import glob
334
+ import xarray as xr
335
+
336
+ def read_netcdfs(files, dim):
337
+ # glob expands paths with * to a list of files, like the unix shell
338
+ paths = sorted(glob(files))
339
+ datasets = [xr.open_dataset(p) for p in paths]
340
+ combined = xr.concat(dataset, dim)
341
+ return combined
342
+
343
+ combined = read_netcdfs('/all/my/files/*.nc', dim='time')
344
+
345
+ This function will work in many cases, but it's not very robust. First, it
346
+ never closes files, which means it will fail one you need to load more than
347
+ a few thousands file. Second, it assumes that you want all the data from each
348
+ file and that it can all fit into memory. In many situations, you only need
349
+ a small subset or an aggregated summary of the data from each file.
350
+
351
+ Here's a slightly more sophisticated example of how to remedy these
352
+ deficiencies::
353
+
354
+ def read_netcdfs(files, dim, transform_func=None):
355
+ def process_one_path(path):
356
+ # use a context manager, to ensure the file gets closed after use
357
+ with xr.open_dataset(path) as ds:
358
+ # transform_func should do some sort of selection or
359
+ # aggregation
360
+ if transform_func is not None:
361
+ ds = transform_func(ds)
362
+ # load all data from the transformed dataset, to ensure we can
363
+ # use it after closing each original file
364
+ ds.load()
365
+ return ds
366
+
367
+ paths = sorted(glob(files))
368
+ datasets = [process_one_path(p) for p in paths]
369
+ combined = xr.concat(datasets, dim)
370
+ return combined
371
+
372
+ # here we suppose we only care about the combined mean of each file;
373
+ # you might also use indexing operations like .sel to subset datasets
374
+ combined = read_netcdfs('/all/my/files/*.nc', dim='time',
375
+ transform_func=lambda ds: ds.mean())
376
+
377
+ This pattern works well and is very robust. We've used similar code to process
378
+ tens of thousands of files constituting 100s of GB of data.
379
+
380
+
250
381
.. _io.netcdf.writing_encoded :
251
382
252
383
Writing encoded data
@@ -817,84 +948,3 @@ For CSV files, one might also consider `xarray_extras`_.
817
948
.. _xarray_extras : https://xarray-extras.readthedocs.io/en/latest/api/csv.html
818
949
819
950
.. _IO tools : http://pandas.pydata.org/pandas-docs/stable/io.html
820
-
821
-
822
- .. _combining multiple files :
823
-
824
-
825
- Combining multiple files
826
- ------------------------
827
-
828
- NetCDF files are often encountered in collections, e.g., with different files
829
- corresponding to different model runs. xarray can straightforwardly combine such
830
- files into a single Dataset by making use of :py:func: `~xarray.concat `,
831
- :py:func: `~xarray.merge `, :py:func: `~xarray.combine_nested ` and
832
- :py:func: `~xarray.combine_by_coords `. For details on the difference between these
833
- functions see :ref: `combining data `.
834
-
835
- .. note ::
836
-
837
- Xarray includes support for manipulating datasets that don't fit into memory
838
- with dask _. If you have dask installed, you can open multiple files
839
- simultaneously using :py:func: `~xarray.open_mfdataset `::
840
-
841
- xr.open_mfdataset('my/files/*.nc')
842
-
843
- This function automatically concatenates and merges multiple files into a
844
- single xarray dataset.
845
- It is the recommended way to open multiple files with xarray.
846
- For more details, see :ref: `combining.multi `, :ref: `dask.io ` and a
847
- `blog post `_ by Stephan Hoyer.
848
-
849
- .. _dask : http://dask.pydata.org
850
- .. _blog post : http://stephanhoyer.com/2015/06/11/xray-dask-out-of-core-labeled-arrays/
851
-
852
- For example, here's how we could approximate ``MFDataset `` from the netCDF4
853
- library::
854
-
855
- from glob import glob
856
- import xarray as xr
857
-
858
- def read_netcdfs(files, dim):
859
- # glob expands paths with * to a list of files, like the unix shell
860
- paths = sorted(glob(files))
861
- datasets = [xr.open_dataset(p) for p in paths]
862
- combined = xr.concat(dataset, dim)
863
- return combined
864
-
865
- combined = read_netcdfs('/all/my/files/*.nc', dim='time')
866
-
867
- This function will work in many cases, but it's not very robust. First, it
868
- never closes files, which means it will fail one you need to load more than
869
- a few thousands file. Second, it assumes that you want all the data from each
870
- file and that it can all fit into memory. In many situations, you only need
871
- a small subset or an aggregated summary of the data from each file.
872
-
873
- Here's a slightly more sophisticated example of how to remedy these
874
- deficiencies::
875
-
876
- def read_netcdfs(files, dim, transform_func=None):
877
- def process_one_path(path):
878
- # use a context manager, to ensure the file gets closed after use
879
- with xr.open_dataset(path) as ds:
880
- # transform_func should do some sort of selection or
881
- # aggregation
882
- if transform_func is not None:
883
- ds = transform_func(ds)
884
- # load all data from the transformed dataset, to ensure we can
885
- # use it after closing each original file
886
- ds.load()
887
- return ds
888
-
889
- paths = sorted(glob(files))
890
- datasets = [process_one_path(p) for p in paths]
891
- combined = xr.concat(datasets, dim)
892
- return combined
893
-
894
- # here we suppose we only care about the combined mean of each file;
895
- # you might also use indexing operations like .sel to subset datasets
896
- combined = read_netcdfs('/all/my/files/*.nc', dim='time',
897
- transform_func=lambda ds: ds.mean())
898
-
899
- This pattern works well and is very robust. We've used similar code to process
900
- tens of thousands of files constituting 100s of GB of data.
0 commit comments