-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Slow performance with groupby using a custom DataArray grouper #8377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Good find. I think we are open to this change. However, we test for Are you interested to open a PR? |
Hmmm... this is some "optimization" I tried to add, and given all this additional complexity perhaps we can just delete it. for reference it avoids changing the dask |
Great find @alessioarena , impressive work! Can I ask — are If not, I agree with removing it, I wouldn't think it's that common? (If it is created by some internal function — possibly we could pass a |
Yes when you sort a sorted array. But we don't want to do this for numpy, because I think this was a bad addition, let's remove it. There will be a failing test that can be deleted. |
Thanks all for jumping on this so quickly. I'm happy to do a PR if that is the preference, or leaving it to @dcherian to revert the addition. Thanks heaps for all the amazing work you are doing! I'm quite an heavy and happy user of xarray/dask |
Hi @alessioarena — a PR would be great if you'd be up for it — thank you! |
What is your issue?
I have a code that calculates a per-pixel nearest neighbor match between two datasets, to then perform a groupby + aggregation.
The calculation I perform is generally lazy using dask.
I recently noticed a slow performance of groupby in this way, with lazy calculations taking in excess of 10 minutes for an index of approximately 4000 by 4000.
I did a bit of digging around and noticed that the slow line is this:
The test
duck_array_ops.array_equiv(k, np.arange(self.array.shape[idim]))
is repeated multiple times, and despite that being decently fast it amounts to a lot of time that could be potentially minimized by introducing a prior test of equal length, likeThis would work better because, despite that test being performed by array_equiv, currently the array to test against is always created using
np.arange
, that being ultimately the bottleneckThe text was updated successfully, but these errors were encountered: