Skip to content

xarray.backends refactor #2261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 45 commits into from
Oct 9, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4faaf3a
WIP: xarray.backends.file_manager for managing file objects.
shoyer Jul 1, 2018
c82a38c
Switch rasterio to use FileManager
shoyer Jul 1, 2018
7a55a30
lint fixes
shoyer Jul 4, 2018
51463dd
WIP: rewrite FileManager to always use an LRUCache
shoyer Jul 9, 2018
23e132f
Test coverage
shoyer Jul 10, 2018
8fc8183
Don't use move_to_end
shoyer Jul 10, 2018
422944b
minor clarification
shoyer Jul 10, 2018
aea0a1a
Switch FileManager.acquire() to a method
shoyer Jul 11, 2018
4366c0b
Python 2 compat
shoyer Jul 11, 2018
f35b7e7
Update xarray.set_options() to add file_cache_maxsize and validation
shoyer Jul 11, 2018
057cad2
Add assert for FILE_CACHE.maxsize
shoyer Jul 11, 2018
0f3e656
More docstring for FileManager
shoyer Jul 11, 2018
1a0cc10
Add accidentally omited tests for LRUCache
shoyer Jul 11, 2018
8784e6b
Merge branch 'master' into file-manager
shoyer Jul 28, 2018
83d9b10
Adapt scipy backend to use FileManager
shoyer Jul 28, 2018
a0074ff
Stickler fix
shoyer Jul 28, 2018
062ba96
Fix failure on Python 2.7
shoyer Jul 29, 2018
2d41b29
Finish adjusting backends to use FileManager
shoyer Jul 29, 2018
2adf486
Fix bad import
shoyer Jul 30, 2018
76f151c
WIP on distributed
shoyer Aug 1, 2018
769f079
More WIP
shoyer Aug 6, 2018
3e97264
Merge branch 'master' into file-manager
shoyer Aug 17, 2018
5e67efe
Fix distributed write tests
shoyer Aug 19, 2018
8dc77c4
Merge branch 'master' into file-manager
shoyer Aug 19, 2018
1d38335
Fixes
shoyer Aug 19, 2018
6350ca6
Minor fixup
shoyer Aug 20, 2018
4aa0df7
whats new
shoyer Aug 30, 2018
67377c7
More refactoring: remove state from backends entirely
shoyer Aug 31, 2018
8c00f44
Merge branch 'master' into file-manager
shoyer Sep 6, 2018
2a5d1f0
Cleanup
shoyer Sep 6, 2018
a6c170b
Fix failing in-memory datastore tests
shoyer Sep 6, 2018
009e30d
Fix inaccessible datastore
shoyer Sep 6, 2018
14118ea
fix autoclose warnings
shoyer Sep 6, 2018
c778488
Fix PyNIO failures
shoyer Sep 6, 2018
fe14ebf
No longer disable HDF5 file locking
shoyer Sep 7, 2018
f1026ce
whats new and default file cache size
shoyer Sep 7, 2018
e13406b
Whats new tweak
shoyer Sep 7, 2018
465dfae
Refactor default lock logic to backend classes
shoyer Sep 10, 2018
55d35c8
Rename get_resource_lock -> get_write_lock
shoyer Sep 10, 2018
c8fbadc
Don't acquire unnecessary locks in __getitem__
shoyer Sep 10, 2018
ede8ef0
Merge branch 'master' into file-manager
shoyer Sep 26, 2018
220c302
Merge branch 'master' into file-manager
shoyer Oct 8, 2018
36f1156
Fix bad merge
shoyer Oct 9, 2018
c6f43dd
Fix import
shoyer Oct 9, 2018
8916bc7
Remove unreachable code
shoyer Oct 9, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions asv_bench/asv.conf.json
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@
"scipy": [""],
"bottleneck": ["", null],
"dask": [""],
"distributed": [""],
},


Expand Down
41 changes: 41 additions & 0 deletions asv_bench/benchmarks/dataset_io.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from __future__ import absolute_import, division, print_function

import os

import numpy as np
import pandas as pd

Expand All @@ -14,6 +16,9 @@
pass


os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'


class IOSingleNetCDF(object):
"""
A few examples that benchmark reading/writing a single netCDF file with
Expand Down Expand Up @@ -405,3 +410,39 @@ def time_open_dataset_scipy_with_time_chunks(self):
with dask.set_options(get=dask.multiprocessing.get):
xr.open_mfdataset(self.filenames_list, engine='scipy',
chunks=self.time_chunks)


def create_delayed_write():
import dask.array as da
vals = da.random.random(300, chunks=(1,))
ds = xr.Dataset({'vals': (['a'], vals)})
return ds.to_netcdf('file.nc', engine='netcdf4', compute=False)


class IOWriteNetCDFDask(object):
timeout = 60
repeat = 1
number = 5

def setup(self):
requires_dask()
self.write = create_delayed_write()

def time_write(self):
self.write.compute()


class IOWriteNetCDFDaskDistributed(object):
def setup(self):
try:
import distributed
except ImportError:
raise NotImplementedError
self.client = distributed.Client()
self.write = create_delayed_write()

def cleanup(self):
self.client.shutdown()

def time_write(self):
self.write.compute()
3 changes: 3 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -624,3 +624,6 @@ arguments for the ``from_store`` and ``dump_to_store`` Dataset methods:
backends.H5NetCDFStore
backends.PydapDataStore
backends.ScipyDataStore
backends.FileManager
backends.CachingFileManager
backends.DummyFileManager
19 changes: 16 additions & 3 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,27 @@ v0.11.0 (unreleased)
Breaking changes
~~~~~~~~~~~~~~~~

- Xarray's storage backends now automatically open and close files when
necessary, rather than requiring opening a file with ``autoclose=True``. A
global least-recently-used cache is used to store open files; the default
limit of 128 open files should suffice in most cases, but can be adjusted if
necessary with
``xarray.set_options(file_cache_maxsize=...)``. The ``autoclose`` argument
to ``open_dataset`` and related functions has been deprecated and is now a
no-op.

This change, along with an internal refactor of xarray's storage backends,
should significantly improve performance when reading and writing
netCDF files with Dask, especially when working with many files or using
Dask Distributed. By `Stephan Hoyer <https://github.com/shoyer>`_

Documentation
~~~~~~~~~~~~~
- Reduction of :py:meth:`DataArray.groupby` and :py:meth:`DataArray.resample`
without dimension argument will change in the next release.
Now we warn a FutureWarning.
By `Keisuke Fujii <https://github.com/fujiisoup>`_.

Documentation
~~~~~~~~~~~~~

Enhancements
~~~~~~~~~~~~

Expand Down
4 changes: 4 additions & 0 deletions xarray/backends/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
formats. They should not be used directly, but rather through Dataset objects.
"""
from .common import AbstractDataStore
from .file_manager import FileManager, CachingFileManager, DummyFileManager
from .memory import InMemoryDataStore
from .netCDF4_ import NetCDF4DataStore
from .pydap_ import PydapDataStore
Expand All @@ -15,6 +16,9 @@

__all__ = [
'AbstractDataStore',
'FileManager',
'CachingFileManager',
'DummyFileManager',
'InMemoryDataStore',
'NetCDF4DataStore',
'PydapDataStore',
Expand Down
Loading