Skip to content

Fuse slices works with alias in graph #2364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 30, 2017
Merged

Conversation

jcrist
Copy link
Member

@jcrist jcrist commented May 19, 2017

In #2080 a bug was introduced that prevented fusing slices across
aliases in the graph. These would show up when several dask arrays were
concatenated together. This fixes that bug, and adds a test that fusing
works across aliases.

Fixes #2355.

In dask#2080 a bug was introduced that prevented fusing slices across
aliases in the graph. These would show up when several dask arrays were
concatenated together. This fixes that bug, and adds a test that fusing
works across aliases.
@rabernat
Copy link
Contributor

This looks like a great fix! Thank you so much!

I just checked out your branch and gave it a try. It definitely works in some cases. However, unfortunately the default dask arrays created by xarray.open_mfdataset still are not benefitting from the new optimization. This is probably a downstream issue with xarray. However, while I have your attention, it would be great to have a little help debugging.

Let me provide a quick self-contained example. This creates a test dataset

import numpy as np
import xarray as xr
nfiles = 10
nt = 12
all_files = []
for n in range(nfiles):
    data = np.random.rand(nt,1000,10000)
    time = (n*nt) + np.arange(nt)
    da = xr.DataArray(data, dims=['time', 'y', 'x'],
                      coords={'time': time})
    fname = 'test_data.%03d.nc' % n
    da.to_dataset(name='data').to_netcdf(fname)
    all_files.append(fname)

Here is what does work: manually concatenating the datasets

all_dsets = [xr.open_dataset(fname).chunk() for fname in all_files]
ds_concat = xr.concat(all_dsets, dim='time')
%%timeit ts = ds_concat.data[:, 0, 0].load()

On my system, this gives 10 loops, best of 3: 13.5 ms per loop, i.e. very fast, shows that the optimization has worked.

Here is what does not work: using xarray's open_mfdataset function, which is by far the most common way users load data

ds = xr.open_mfdataset(all_files, decode_cf=False)
%%timeit ts = ds.data[:, 0, 0].load()

This gives 1 loop, best of 3: 7.35 s per loop. This is the same sort of SLOW speed I was getting before your PR.

I cannot find any differences in the dask graphs of these two different xarray datasets. Nevertheless, there is nearly a factor of 1000 difference in performance, indicating that your fix is being applied to ds_concat but not to ds.

Is there any further info I can provide that can help debug what might be going on?

cc to @shoyer, who might have some insight into the xarray side of things.

@jcrist
Copy link
Member Author

jcrist commented May 20, 2017

Ah, that's because the mfdataset adds a lock to the reads, and our optimizations don't handle that case (but should). Will fix.

@shoyer
Copy link
Member

shoyer commented May 20, 2017

Once we get this fixed, we should think about how we could add an integration test for this behavior in xarray (since it has major performance implications).

@jcrist
Copy link
Member Author

jcrist commented May 22, 2017

@rabernat, this should be fixed with the recent commit.

@jcrist
Copy link
Member Author

jcrist commented May 22, 2017

A different fix would have changed all the get* functions to also take a lock. I went down this path (and still have the commit saved), but it changed many many lines and this fix was simpler. However, there are some issues with the current approach in that a user can provide a custom getitem function to from_array that doesn't accept a lock, and also provide a lock, which will result in faulty behavior.

The getitem keyword was added in #2272 to support custom getitems. However, IIUC the intent of this keyword was to avoid calling np.asarray on chunks - an equivalent but simpler fix would have been to add an asarray keyword which defaults as True. Then we could standardize on all get* functions taking a lock without the possibility of user-error. Another downside of the current code is that the slice fusing doesn't work if there's a custom get* function, as the optimization doesn't know about it (which would be fixed if we had our own getitem_with_lock function).

I think that removing the getitem keyword and standardizing on all get* functions taking a lock would be the better and more robust fix in the long run, but the current fix is simpler. Ping @mrocklin for thoughts.

@rabernat
Copy link
Contributor

I can confirm that my original issue (pydata/xarray#1396) looks to be fixed by this PR. 😄 Unfortunately I can't add much insight to the dask design questions.

@shoyer: 👍 to the xarray integration test.

@bradyrx
Copy link

bradyrx commented May 27, 2017

It seems like I'm hitting a similar bottleneck with extracting values from my dask DataArray (xarray) after loading my data through open_mfdataset.

I have a 40x12 DataArray calMean that was filtered out of a 34x384x320x1032 xarray Dataset with great performance time (order tens of milliseconds).

However, any attempt to extract the values for plotting or attempting to save the 40x12 array into a netCDF takes 3 minutes. (Example shows np.asarray() but had similar timing on .load(), .values(), and .to_netcdf())

%time data = np.asarray(calMean)
CPU times: user 2min 24s, sys: 43.3 s, total: 3min 7s
Wall time: 2min 49s

I'm eager to see this fix merged as soon as possible to speed up my interactive analysis. Thanks all.

@mrocklin
Copy link
Member

I think that removing the getitem keyword and standardizing on all get* functions taking a lock would be the better and more robust fix in the long run, but the current fix is simpler. Ping @mrocklin for thoughts.

The choice to accept custom getitem functions was intended to be a release valve for advanced users. I think that there is some value to this. These users are typically able to handle writing their own optimization functions if necessary.

@jcrist
Copy link
Member Author

jcrist commented May 30, 2017

The choice to accept custom getitem functions was intended to be a release valve for advanced users.

From the timeline, I assume this was added to support your work with sparse, where the custom getitem was used to avoid calling asarray? What other things do you see being done with a custom getitem besides avoiding calling asarray? I can't think of any. If that's it, I think dropping this functionality and replacing with a boolean kwarg (asarray=True? Or maybe subok to follow numpy?) would be simpler and more robust.

@mrocklin
Copy link
Member

This was originally motivated by private users with custom use cases that were more advanced than sparse.

@jcrist
Copy link
Member Author

jcrist commented May 30, 2017

Ok. I'll merge this as is then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

poor optimization of slicing operations on netCDF-backed xarray datasets
6 participants