Skip to content

Pickle and .value vs. dask backend #902

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
crusaderky opened this issue Jul 19, 2016 · 6 comments
Closed

Pickle and .value vs. dask backend #902

crusaderky opened this issue Jul 19, 2016 · 6 comments

Comments

@crusaderky
Copy link
Contributor

Pickling a xarray.DataArray with dask backend will cause it to resolve the .data to a numpy array.
This is not desirable, as there's legitimate use cases where you may want to e.g. save a computation for later, or send it somewhere across the network.

Analogously, auto-converting a dask xarray to a numpy xarray as soon as you invoke the .value property is probably nice when you are working on a jupyter terminal, but not in a general purpose situation, particularly when xarray is used at the foundation of a very complex framework. Most of my headaches so far have been caused trying to figure out when, where and why the dask backend was replaced with numpy.

IMHO a module-wide switch to disable implicit dask->numpy conversion would be a nice solution.
A new method, compute(), could explicitly convert in place from dask to numpy.

@shoyer
Copy link
Member

shoyer commented Jul 19, 2016

I agree about loading data into memory automatically -- this behavior made sense before we used dask in xarray, but now it doesn't really.

We actually already have a .load() method for explicitly loading data into memory, though it might make sense to add .compute() as an alias, possibly without modifying the original dataset inplace.

I'm a little less certain about how to handle pickling data, because anytime you open a file from disk using open_dataset it's not going to pickle. But on the other hand, it's also not hard to explicitly write .load() or .compute() before using pickle or invoking multiprocessing.

@crusaderky
Copy link
Contributor Author

I'm happy to look into this - could you point me in the right direction?

@shoyer
Copy link
Member

shoyer commented Aug 15, 2016

This is where you can find the core caching logic on Variable objects:

@data.setter
def data(self, data):
data = as_compatible_data(data)
if data.shape != self.shape:
raise ValueError(
"replacement data must match the Variable's shape")
self._data = data
def _data_cached(self):
if not isinstance(self._data, (np.ndarray, PandasIndexAdapter)):
self._data = np.asarray(self._data)
return self._data
@property
def _indexable_data(self):
return orthogonally_indexable(self._data)
def load(self):
"""Manually trigger loading of this variable's data from disk or a
remote source into memory and return this variable.
Normally, it should not be necessary to call this method in user code,
because all xarray functions should either work on deferred data or
load data automatically.
"""
self._data_cached()
return self
def load_data(self): # pragma: no cover
warnings.warn('the Variable method `load_data` has been deprecated; '
'use `load` instead',
FutureWarning, stacklevel=2)
return self.load()
def __getstate__(self):
"""Always cache data as an in-memory array before pickling"""
self._data_cached()
# self.__dict__ is the default pickle object, we don't need to
# implement our own __setstate__ method to make pickle work
return self.__dict__
@property
def values(self):
"""The variable's data as a numpy.ndarray"""
return _as_array_or_item(self._data_cached())
@values.setter
def values(self, values):
self.data = values

Here's where we define load on Dataset and DataArray:

def load(self):
"""Manually trigger loading of this dataset's data from disk or a
remote source into memory and return this dataset.
Normally, it should not be necessary to call this method in user code,
because all xarray functions should either work on deferred data or
load data automatically. However, this method can be necessary when
working with many file objects on disk.
"""
# access .data to coerce everything to numpy or dask arrays
all_data = dict((k, v.data) for k, v in self.variables.items())
lazy_data = dict((k, v) for k, v in all_data.items()
if isinstance(v, dask_array_type))
if lazy_data:
import dask.array as da
# evaluate all the dask arrays simultaneously
evaluated_data = da.compute(*lazy_data.values())
for k, data in zip(lazy_data, evaluated_data):
self.variables[k].data = data
return self

def load(self):
"""Manually trigger loading of this array's data from disk or a
remote source into memory and return this array.
Normally, it should not be necessary to call this method in user code,
because all xarray functions should either work on deferred data or
load data automatically. However, this method can be necessary when
working with many file objects on disk.
"""
ds = self._to_temp_dataset().load()
new = self._from_temp_dataset(ds)
self._variable = new._variable
self._coords = new._coords
return self

As I mentioned before, let's add .compute() to evaluate and return a new object, and use it for .values instead of caching. .load() can remain unchanged for when users actually want to cache data. And we can definitely disable automatically loading data in pickles.

@crusaderky
Copy link
Contributor Author

Working on it now.
What I didn't understand is if you want to disable caching for all backends (NetCDF etc.) or only for dask?
The change for dask only is very straightforward. For all backends much less so...

@shoyer
Copy link
Member

shoyer commented Sep 25, 2016

@crusaderky Let's just disable caching for dask.

@crusaderky
Copy link
Contributor Author

I'm done... I think. The result is less clean than I would have hoped - suggestions are welcome.
#1018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants