Pickle and .value vs. dask backend #902

crusaderky · 2016-07-19T09:34:30Z

Pickling a xarray.DataArray with dask backend will cause it to resolve the .data to a numpy array.
This is not desirable, as there's legitimate use cases where you may want to e.g. save a computation for later, or send it somewhere across the network.

Analogously, auto-converting a dask xarray to a numpy xarray as soon as you invoke the .value property is probably nice when you are working on a jupyter terminal, but not in a general purpose situation, particularly when xarray is used at the foundation of a very complex framework. Most of my headaches so far have been caused trying to figure out when, where and why the dask backend was replaced with numpy.

IMHO a module-wide switch to disable implicit dask->numpy conversion would be a nice solution.
A new method, compute(), could explicitly convert in place from dask to numpy.

shoyer · 2016-07-19T16:40:39Z

I agree about loading data into memory automatically -- this behavior made sense before we used dask in xarray, but now it doesn't really.

We actually already have a .load() method for explicitly loading data into memory, though it might make sense to add .compute() as an alias, possibly without modifying the original dataset inplace.

I'm a little less certain about how to handle pickling data, because anytime you open a file from disk using open_dataset it's not going to pickle. But on the other hand, it's also not hard to explicitly write .load() or .compute() before using pickle or invoking multiprocessing.

crusaderky · 2016-08-15T20:47:56Z

I'm happy to look into this - could you point me in the right direction?

shoyer · 2016-08-15T21:38:16Z

This is where you can find the core caching logic on Variable objects:

xarray/xarray/core/variable.py

Lines 257 to 305 in 56abba5

    
               @data.setter 
        
               def data(self, data): 
        
                   data = as_compatible_data(data) 
        
                   if data.shape != self.shape: 
        
                       raise ValueError( 
        
                           "replacement data must match the Variable's shape") 
        
                   self._data = data 
        
               def _data_cached(self): 
        
                   if not isinstance(self._data, (np.ndarray, PandasIndexAdapter)): 
        
                       self._data = np.asarray(self._data) 
        
                   return self._data 
        
               @property 
        
               def _indexable_data(self): 
        
                   return orthogonally_indexable(self._data) 
        
               def load(self): 
        
                   """Manually trigger loading of this variable's data from disk or a 
        
                   remote source into memory and return this variable. 
        
                   Normally, it should not be necessary to call this method in user code, 
        
                   because all xarray functions should either work on deferred data or 
        
                   load data automatically. 
        
                   """ 
        
                   self._data_cached() 
        
                   return self 
        
               def load_data(self):  # pragma: no cover 
        
                   warnings.warn('the Variable method `load_data` has been deprecated; ' 
        
                                 'use `load` instead', 
        
                                 FutureWarning, stacklevel=2) 
        
                   return self.load() 
        
               def __getstate__(self): 
        
                   """Always cache data as an in-memory array before pickling""" 
        
                   self._data_cached() 
        
                   # self.__dict__ is the default pickle object, we don't need to 
        
                   # implement our own __setstate__ method to make pickle work 
        
                   return self.__dict__ 
        
               @property 
        
               def values(self): 
        
                   """The variable's data as a numpy.ndarray""" 
        
                   return _as_array_or_item(self._data_cached()) 
        
               @values.setter 
        
               def values(self, values): 
        
                   self.data = values

Here's where we define load on Dataset and DataArray:

xarray/xarray/core/dataset.py

Lines 305 to 327 in 56abba5

    
               def load(self): 
        
                   """Manually trigger loading of this dataset's data from disk or a 
        
                   remote source into memory and return this dataset. 
        
                   Normally, it should not be necessary to call this method in user code, 
        
                   because all xarray functions should either work on deferred data or 
        
                   load data automatically. However, this method can be necessary when 
        
                   working with many file objects on disk. 
        
                   """ 
        
                   # access .data to coerce everything to numpy or dask arrays 
        
                   all_data = dict((k, v.data) for k, v in self.variables.items()) 
        
                   lazy_data = dict((k, v) for k, v in all_data.items() 
        
                                    if isinstance(v, dask_array_type)) 
        
                   if lazy_data: 
        
                       import dask.array as da 
        
                       # evaluate all the dask arrays simultaneously 
        
                       evaluated_data = da.compute(*lazy_data.values()) 
        
                       for k, data in zip(lazy_data, evaluated_data): 
        
                           self.variables[k].data = data 
        
                   return self

xarray/xarray/core/dataarray.py

Lines 523 to 536 in 56abba5

    
               def load(self): 
        
                   """Manually trigger loading of this array's data from disk or a 
        
                   remote source into memory and return this array. 
        
                   Normally, it should not be necessary to call this method in user code, 
        
                   because all xarray functions should either work on deferred data or 
        
                   load data automatically. However, this method can be necessary when 
        
                   working with many file objects on disk. 
        
                   """ 
        
                   ds = self._to_temp_dataset().load() 
        
                   new = self._from_temp_dataset(ds) 
        
                   self._variable = new._variable 
        
                   self._coords = new._coords 
        
                   return self

As I mentioned before, let's add .compute() to evaluate and return a new object, and use it for .values instead of caching. .load() can remain unchanged for when users actually want to cache data. And we can definitely disable automatically loading data in pickles.

crusaderky · 2016-09-25T11:47:31Z

Working on it now.
What I didn't understand is if you want to disable caching for all backends (NetCDF etc.) or only for dask?
The change for dask only is very straightforward. For all backends much less so...

shoyer · 2016-09-25T17:54:43Z

@crusaderky Let's just disable caching for dask.

crusaderky · 2016-09-25T18:15:49Z

I'm done... I think. The result is less clean than I would have hoped - suggestions are welcome.
#1018

crusaderky mentioned this issue Oct 21, 2016

Disable automatic cache with dask #1024

Merged

shoyer closed this as completed in #1024 Nov 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pickle and .value vs. dask backend #902

Pickle and .value vs. dask backend #902

crusaderky commented Jul 19, 2016

shoyer commented Jul 19, 2016

crusaderky commented Aug 15, 2016

shoyer commented Aug 15, 2016

crusaderky commented Sep 25, 2016

shoyer commented Sep 25, 2016

crusaderky commented Sep 25, 2016

Pickle and .value vs. dask backend #902

Pickle and .value vs. dask backend #902

Comments

crusaderky commented Jul 19, 2016

shoyer commented Jul 19, 2016

crusaderky commented Aug 15, 2016

shoyer commented Aug 15, 2016

crusaderky commented Sep 25, 2016

shoyer commented Sep 25, 2016

crusaderky commented Sep 25, 2016