-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Slow performance of DataArray.unstack()
from checking variable.data
#5902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Oh, hmm... I'm noticing now that So perhaps a slight adjustment to diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
index 550c3587..16637574 100644
--- a/xarray/core/dataset.py
+++ b/xarray/core/dataset.py
@@ -4159,14 +4159,14 @@ class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
# Dask arrays don't support assignment by index, which the fast unstack
# function requires.
# https://github.com/pydata/xarray/pull/4746#issuecomment-753282125
- any(is_duck_dask_array(v.data) for v in self.variables.values())
+ any(is_duck_dask_array(v) for v in self.variables.values())
# Sparse doesn't currently support (though we could special-case
# it)
# https://github.com/pydata/sparse/issues/422
- or any(
- isinstance(v.data, sparse_array_type)
- for v in self.variables.values()
- )
+ # or any(
+ # isinstance(v.data, sparse_array_type)
+ # for v in self.variables.values()
+ # )
or sparse
# Until https://github.com/pydata/xarray/pull/4751 is resolved,
# we check explicitly whether it's a numpy array. Once that is
@@ -4177,9 +4177,9 @@ class Dataset(DataWithCoords, DatasetArithmetic, Mapping):
# # or any(
# # isinstance(v.data, pint_array_type) for v in self.variables.values()
# # )
- or any(
- not isinstance(v.data, np.ndarray) for v in self.variables.values()
- )
+ # or any(
+ # not isinstance(v.data, np.ndarray) for v in self.variables.values()
+ # )
):
result = result._unstack_full_reindex(dim, fill_value, sparse)
else:
diff --git a/xarray/core/pycompat.py b/xarray/core/pycompat.py
index d1649235..e9669105 100644
--- a/xarray/core/pycompat.py
+++ b/xarray/core/pycompat.py
@@ -44,6 +44,12 @@ class DuckArrayModule:
def is_duck_dask_array(x):
+ from xarray.core.variable import IndexVariable, Variable
+ if isinstance(x, IndexVariable):
+ return False
+ elif isinstance(x, Variable):
+ x = x.data
+
if DuckArrayModule("dask").available:
from dask.base import is_dask_collection That's completely ignoring the accesses to |
(warning: untested code) Instead of looking at all of nonindexes = set(self.variables) - set(self.indexes)
# or alternatively make a list of multiindex variables names and exclude those
# then the condition becomes
any(is_duck_dask_array(self.variables[v].data) for v in nonindexes) |
PS: It doesn't seem like the bottleneck in your case but #5582 has an alternative proposal for unstacking dask arrays. |
What happened:
Calling
DataArray.unstack()
spends time allocating an object-dtype NumPy array from values of the pandas MultiIndex.What you expected to happen:
Faster unstack.
Minimal Complete Verifiable Example:
Anything else we need to know?:
For this example, >99% of the time is spent at on this line:
xarray/xarray/core/dataset.py
Line 4162 in df76461
v.data
for thepixel
array, which is a pandas MultiIndex.Just going by the comments, it does seem like accessing
v.data
is necessary to perform the check. I'm wonder if we could makeis_duck_dask_array
a bit smarter, to avoid unnecessarily allocating data?Alternatively, if that's too difficult, perhaps we could add a flag to
unstack
to disable those checks and just take the "slow" path. In my actual use-case, the slow_unstack_full_reindex
is necessary since I have large Dask Arrays. But even then, the unstack completes in less than 3s, while I was getting OOM killed on thev.data
checks.Environment:
Output of xr.show_versions()
The text was updated successfully, but these errors were encountered: