pass dask compute/persist args through from load/compute/perist#1543
pass dask compute/persist args through from load/compute/perist#1543jhamman merged 10 commits intopydata:masterfrom jhamman:feature/dask_compute_args
Conversation
| if not isinstance(self._data, np.ndarray): | ||
| if isinstance(self._data, dask_array_type): | ||
| self._data = np.asarray(self._data.compute(**kwargs)) | ||
| elif not isinstance(self._data, np.ndarray): |
There was a problem hiding this comment.
This clause should be removed as it causes inconsistent behaviour with numpy scalar types. I cannot think of any other use case where data is neither a dask ARRAY nor a numpy ndarray?
There was a problem hiding this comment.
I think this allows for on-disk type arrays. @shoyer, any thoughts on calling np.asarray here and in the line above?
There was a problem hiding this comment.
Yes, we need this to support on-disk arrays that aren't backed by dask. (I'd love to get rid of this in favor of always using dask, but dask has some limitations that make this tricky.)
xarray/core/variable.py
Outdated
| """ | ||
| if not isinstance(self._data, np.ndarray): | ||
| if isinstance(self._data, dask_array_type): | ||
| self._data = np.asarray(self._data.compute(**kwargs)) |
There was a problem hiding this comment.
You don't want to invoke asarray if dask returns a scalar numpy type
There was a problem hiding this comment.
To be honest, we don't define what can go in _data as carefully as we ought to. I guess there are two ways to define it:
- anything "array like" that defines at least
shape,dtypeand__getitem__ - what comes out of
xarray.core.variable.as_compatible_data
Numpy scalars do actually pass through here (since they define all those attributes!)... but then would get converted into an array when calling .values anyways:
xarray/xarray/core/variable.py
Lines 334 to 337 in 78ca20a
So I guess I agree, but on the other hand I'm also a little nervous that a dask routine might return a non-numpy scalar, which would definitely break if we don't wrap it in asarray. The safe thing to do is to leave this as is or call as_compatible_data on it.
xarray/tests/test_dask.py
Outdated
| b4 = b.persist(get=dask.multiprocessing.get, num_workers=4) | ||
| assert b4._in_memory | ||
| assert_equal(b4, expected) | ||
|
|
There was a problem hiding this comment.
Redundant with the test below?
There was a problem hiding this comment.
fixed, these should have been Variables.
|
|
||
| def build_dask_array(): | ||
| global kernel_call_count | ||
| kernel_call_count = 0 |
There was a problem hiding this comment.
Missing tests for Variable
|
@jhamman happy to go with your version |
| if not isinstance(self._data, np.ndarray): | ||
| if isinstance(self._data, dask_array_type): | ||
| self._data = np.asarray(self._data.compute(**kwargs)) | ||
| elif not isinstance(self._data, np.ndarray): |
There was a problem hiding this comment.
Yes, we need this to support on-disk arrays that aren't backed by dask. (I'd love to get rid of this in favor of always using dask, but dask has some limitations that make this tricky.)
xarray/core/variable.py
Outdated
| """ | ||
| if not isinstance(self._data, np.ndarray): | ||
| if isinstance(self._data, dask_array_type): | ||
| self._data = np.asarray(self._data.compute(**kwargs)) |
There was a problem hiding this comment.
To be honest, we don't define what can go in _data as carefully as we ought to. I guess there are two ways to define it:
- anything "array like" that defines at least
shape,dtypeand__getitem__ - what comes out of
xarray.core.variable.as_compatible_data
Numpy scalars do actually pass through here (since they define all those attributes!)... but then would get converted into an array when calling .values anyways:
xarray/xarray/core/variable.py
Lines 334 to 337 in 78ca20a
So I guess I agree, but on the other hand I'm also a little nervous that a dask routine might return a non-numpy scalar, which would definitely break if we don't wrap it in asarray. The safe thing to do is to leave this as is or call as_compatible_data on it.
xarray/tests/test_dask.py
Outdated
| expected = Variable('x', [1, 4]) | ||
| b = a * a | ||
| # compute | ||
| b1 = b.compute(get=dask.multiprocessing.get) |
There was a problem hiding this comment.
The clean way to test this is probably with mock, e.g.,
In [63]: import numpy as np
In [64]: import dask.array as da
In [65]: from unittest import mock
In [66]: x = da.from_array(np.arange(3), chunks=(2,))
In [67]: with mock.patch.object(da.Array, 'compute', return_value=np.arange(3)) as mock_compute:
...: x.compute(foo='bar')
...:
In [68]: mock_compute.assert_called_with(foo='bar')
In [69]: mock_compute.assert_called_with(bar='foo')
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-69-22937cf26ca1> in <module>()
----> 1 mock_compute.assert_called_with(bar='foo')
~/conda/envs/xarray-dev/lib/python3.5/unittest/mock.py in assert_called_with(_mock_self, *args, **kwargs)
792 if expected != actual:
793 cause = expected if isinstance(expected, Exception) else None
--> 794 raise AssertionError(_error_message()) from cause
795
796
AssertionError: Expected call: compute(bar='foo')
Actual call: compute(foo='bar')
unittest.mock is part of Python 3's standard library, but there's also a widely used Python 2 backport on pypi. I think it would be perfectly fine to add it as a dependency for our test suite.
There was a problem hiding this comment.
Thanks @shoyer, I'll add mock as part of this PR.
shoyer
left a comment
There was a problem hiding this comment.
I think the docs also need an update to mention that mock is a test-time requirement on Python 2.
xarray/tests/test_dask.py
Outdated
| with mock.patch.object(Variable, method, | ||
| return_value=np.arange(3)) as mock_method: | ||
| getattr(x, method)(foo='bar') | ||
| mock_method.assert_called_with(foo='bar') |
There was a problem hiding this comment.
I think you can skip this first test. This verifies that Variable.compute() was called if you directly call Variable.compute().
xarray/tests/test_dask.py
Outdated
| import numpy as np | ||
| import pandas as pd | ||
| import pytest | ||
| import mock |
There was a problem hiding this comment.
On Python 3, use from unittest import mock to avoid adding the new dependency.
|
Can you also update Line 39 in 9a8e2c5 |
xarray/tests/test_dask.py
Outdated
| with suppress(ImportError): | ||
| import dask | ||
| import dask.array as da | ||
| import dask.multiprocessing |
git diff upstream/master | flake8 --diffwhats-new.rstfor all changes andapi.rstfor new APIcc @crusaderky - putting this up in case its useful to you.
I'm frankly not sure if my tests are doing what they need to but everything seems to be playing nicely.