Fuse slices works with alias in graph #2364

jcrist · 2017-05-19T20:32:52Z

In #2080 a bug was introduced that prevented fusing slices across
aliases in the graph. These would show up when several dask arrays were
concatenated together. This fixes that bug, and adds a test that fusing
works across aliases.

Fixes #2355.

In dask#2080 a bug was introduced that prevented fusing slices across aliases in the graph. These would show up when several dask arrays were concatenated together. This fixes that bug, and adds a test that fusing works across aliases.

rabernat · 2017-05-20T01:39:35Z

This looks like a great fix! Thank you so much!

I just checked out your branch and gave it a try. It definitely works in some cases. However, unfortunately the default dask arrays created by xarray.open_mfdataset still are not benefitting from the new optimization. This is probably a downstream issue with xarray. However, while I have your attention, it would be great to have a little help debugging.

Let me provide a quick self-contained example. This creates a test dataset

import numpy as np
import xarray as xr
nfiles = 10
nt = 12
all_files = []
for n in range(nfiles):
    data = np.random.rand(nt,1000,10000)
    time = (n*nt) + np.arange(nt)
    da = xr.DataArray(data, dims=['time', 'y', 'x'],
                      coords={'time': time})
    fname = 'test_data.%03d.nc' % n
    da.to_dataset(name='data').to_netcdf(fname)
    all_files.append(fname)

Here is what does work: manually concatenating the datasets

all_dsets = [xr.open_dataset(fname).chunk() for fname in all_files]
ds_concat = xr.concat(all_dsets, dim='time')
%%timeit ts = ds_concat.data[:, 0, 0].load()

On my system, this gives 10 loops, best of 3: 13.5 ms per loop, i.e. very fast, shows that the optimization has worked.

Here is what does not work: using xarray's open_mfdataset function, which is by far the most common way users load data

ds = xr.open_mfdataset(all_files, decode_cf=False)
%%timeit ts = ds.data[:, 0, 0].load()

This gives 1 loop, best of 3: 7.35 s per loop. This is the same sort of SLOW speed I was getting before your PR.

I cannot find any differences in the dask graphs of these two different xarray datasets. Nevertheless, there is nearly a factor of 1000 difference in performance, indicating that your fix is being applied to ds_concat but not to ds.

Is there any further info I can provide that can help debug what might be going on?

cc to @shoyer, who might have some insight into the xarray side of things.

jcrist · 2017-05-20T17:38:20Z

Ah, that's because the mfdataset adds a lock to the reads, and our optimizations don't handle that case (but should). Will fix.

shoyer · 2017-05-20T20:02:39Z

Once we get this fixed, we should think about how we could add an integration test for this behavior in xarray (since it has major performance implications).

jcrist · 2017-05-22T22:50:38Z

@rabernat, this should be fixed with the recent commit.

jcrist · 2017-05-22T22:58:44Z

A different fix would have changed all the get* functions to also take a lock. I went down this path (and still have the commit saved), but it changed many many lines and this fix was simpler. However, there are some issues with the current approach in that a user can provide a custom getitem function to from_array that doesn't accept a lock, and also provide a lock, which will result in faulty behavior.

The getitem keyword was added in #2272 to support custom getitems. However, IIUC the intent of this keyword was to avoid calling np.asarray on chunks - an equivalent but simpler fix would have been to add an asarray keyword which defaults as True. Then we could standardize on all get* functions taking a lock without the possibility of user-error. Another downside of the current code is that the slice fusing doesn't work if there's a custom get* function, as the optimization doesn't know about it (which would be fixed if we had our own getitem_with_lock function).

I think that removing the getitem keyword and standardizing on all get* functions taking a lock would be the better and more robust fix in the long run, but the current fix is simpler. Ping @mrocklin for thoughts.

rabernat · 2017-05-23T00:10:14Z

I can confirm that my original issue (pydata/xarray#1396) looks to be fixed by this PR. 😄 Unfortunately I can't add much insight to the dask design questions.

@shoyer: 👍 to the xarray integration test.

bradyrx · 2017-05-27T18:23:34Z

It seems like I'm hitting a similar bottleneck with extracting values from my dask DataArray (xarray) after loading my data through open_mfdataset.

I have a 40x12 DataArray calMean that was filtered out of a 34x384x320x1032 xarray Dataset with great performance time (order tens of milliseconds).

However, any attempt to extract the values for plotting or attempting to save the 40x12 array into a netCDF takes 3 minutes. (Example shows np.asarray() but had similar timing on .load(), .values(), and .to_netcdf())

%time data = np.asarray(calMean)
CPU times: user 2min 24s, sys: 43.3 s, total: 3min 7s
Wall time: 2min 49s

I'm eager to see this fix merged as soon as possible to speed up my interactive analysis. Thanks all.

mrocklin · 2017-05-27T22:30:14Z

I think that removing the getitem keyword and standardizing on all get* functions taking a lock would be the better and more robust fix in the long run, but the current fix is simpler. Ping @mrocklin for thoughts.

The choice to accept custom getitem functions was intended to be a release valve for advanced users. I think that there is some value to this. These users are typically able to handle writing their own optimization functions if necessary.

jcrist · 2017-05-30T16:07:56Z

The choice to accept custom getitem functions was intended to be a release valve for advanced users.

From the timeline, I assume this was added to support your work with sparse, where the custom getitem was used to avoid calling asarray? What other things do you see being done with a custom getitem besides avoiding calling asarray? I can't think of any. If that's it, I think dropping this functionality and replacing with a boolean kwarg (asarray=True? Or maybe subok to follow numpy?) would be simpler and more robust.

mrocklin · 2017-05-30T16:21:56Z

This was originally motivated by private users with custom use cases that were more advanced than sparse.

jcrist · 2017-05-30T16:29:02Z

Ok. I'll merge this as is then.

Fuse slices works with alias in graph

26d1aba

In dask#2080 a bug was introduced that prevented fusing slices across aliases in the graph. These would show up when several dask arrays were concatenated together. This fixes that bug, and adds a test that fusing works across aliases.

jcrist mentioned this pull request May 19, 2017

poor optimization of slicing operations on netCDF-backed xarray datasets #2355

Closed

Fuse slices works with locks

0cb4453

jcrist merged commit 7da2c07 into dask:master May 30, 2017

jcrist deleted the fuse-with-alias branch May 30, 2017 16:29

rabernat mentioned this pull request Jun 7, 2017

selecting a point from an mfdataset pydata/xarray#1396

Closed

rabernat mentioned this pull request Jul 20, 2017

Excessive memory usage when printing multi-file Dataset pydata/xarray#1481

Closed

sinhrks added this to the 0.15.0 milestone Aug 30, 2017

rabernat mentioned this pull request Feb 9, 2018

Drop coordinates on loading large dataset. pydata/xarray#1854

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fuse slices works with alias in graph #2364

Fuse slices works with alias in graph #2364

Uh oh!

jcrist commented May 19, 2017

Uh oh!

rabernat commented May 20, 2017

Uh oh!

jcrist commented May 20, 2017

Uh oh!

shoyer commented May 20, 2017

Uh oh!

jcrist commented May 22, 2017

Uh oh!

jcrist commented May 22, 2017

Uh oh!

rabernat commented May 23, 2017

Uh oh!

bradyrx commented May 27, 2017

Uh oh!

mrocklin commented May 27, 2017

Uh oh!

jcrist commented May 30, 2017

Uh oh!

mrocklin commented May 30, 2017

Uh oh!

jcrist commented May 30, 2017

Uh oh!

Uh oh!

Uh oh!

Fuse slices works with alias in graph #2364

Fuse slices works with alias in graph #2364

Uh oh!

Conversation

jcrist commented May 19, 2017

Uh oh!

rabernat commented May 20, 2017

Uh oh!

jcrist commented May 20, 2017

Uh oh!

shoyer commented May 20, 2017

Uh oh!

jcrist commented May 22, 2017

Uh oh!

jcrist commented May 22, 2017

Uh oh!

rabernat commented May 23, 2017

Uh oh!

bradyrx commented May 27, 2017

Uh oh!

mrocklin commented May 27, 2017

Uh oh!

jcrist commented May 30, 2017

Uh oh!

mrocklin commented May 30, 2017

Uh oh!

jcrist commented May 30, 2017

Uh oh!

Uh oh!