-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Fuse slices works with alias in graph #2364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In dask#2080 a bug was introduced that prevented fusing slices across aliases in the graph. These would show up when several dask arrays were concatenated together. This fixes that bug, and adds a test that fusing works across aliases.
This looks like a great fix! Thank you so much! I just checked out your branch and gave it a try. It definitely works in some cases. However, unfortunately the default dask arrays created by Let me provide a quick self-contained example. This creates a test dataset import numpy as np
import xarray as xr
nfiles = 10
nt = 12
all_files = []
for n in range(nfiles):
data = np.random.rand(nt,1000,10000)
time = (n*nt) + np.arange(nt)
da = xr.DataArray(data, dims=['time', 'y', 'x'],
coords={'time': time})
fname = 'test_data.%03d.nc' % n
da.to_dataset(name='data').to_netcdf(fname)
all_files.append(fname) Here is what does work: manually concatenating the datasets all_dsets = [xr.open_dataset(fname).chunk() for fname in all_files]
ds_concat = xr.concat(all_dsets, dim='time')
%%timeit ts = ds_concat.data[:, 0, 0].load() On my system, this gives Here is what does not work: using xarray's ds = xr.open_mfdataset(all_files, decode_cf=False)
%%timeit ts = ds.data[:, 0, 0].load() This gives I cannot find any differences in the dask graphs of these two different xarray datasets. Nevertheless, there is nearly a factor of 1000 difference in performance, indicating that your fix is being applied to Is there any further info I can provide that can help debug what might be going on? cc to @shoyer, who might have some insight into the xarray side of things. |
Ah, that's because the |
Once we get this fixed, we should think about how we could add an integration test for this behavior in xarray (since it has major performance implications). |
@rabernat, this should be fixed with the recent commit. |
A different fix would have changed all the The I think that removing the |
I can confirm that my original issue (pydata/xarray#1396) looks to be fixed by this PR. 😄 Unfortunately I can't add much insight to the dask design questions. @shoyer: 👍 to the xarray integration test. |
It seems like I'm hitting a similar bottleneck with extracting values from my dask DataArray (xarray) after loading my data through I have a 40x12 DataArray However, any attempt to extract the values for plotting or attempting to save the 40x12 array into a netCDF takes 3 minutes. (Example shows
I'm eager to see this fix merged as soon as possible to speed up my interactive analysis. Thanks all. |
The choice to accept custom getitem functions was intended to be a release valve for advanced users. I think that there is some value to this. These users are typically able to handle writing their own optimization functions if necessary. |
From the timeline, I assume this was added to support your work with |
This was originally motivated by private users with custom use cases that were more advanced than sparse. |
Ok. I'll merge this as is then. |
In #2080 a bug was introduced that prevented fusing slices across
aliases in the graph. These would show up when several dask arrays were
concatenated together. This fixes that bug, and adds a test that fusing
works across aliases.
Fixes #2355.