Determine why so many dask graphs are executed during QC filtering #299

eric-czech · 2020-10-05T09:37:07Z

There are 38 computations run when evaluating this short pipeline so we need to better understand what Xarray is telling dask to do here:

def qc(ds):    
    return (
        ds
        .pipe(sg.variant_stats)
        .assign(variant_af=lambda ds: ds.variant_allele_frequency.min(dim="alleles"))
        .assign(sample_call_rate=lambda ds: call_rate(ds, dim="variants").sample_call_rate)
        .pipe(lambda ds: ds.isel(variants=ds.variant_call_rate > .98))
        .pipe(lambda ds: ds.isel(samples=ds.sample_call_rate > .98))
        .pipe(lambda ds: ds.isel(variants=ds.variant_af > .05))
    )

ds = sg.simulate_genotype_call_dataset(100000, 100)
del ds.attrs['contigs']
ds.to_zarr('/tmp/ds.zarr', mode='w')
ds = xr.open_zarr('/tmp/ds.zarr', concat_characters=False)

%%time
with ProgressBar():
    ds_qc = qc(ds)
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
CPU times: user 16.9 s, sys: 1.67 s, total: 18.6 s
Wall time: 8.17 s

tomwhite · 2020-11-23T17:10:45Z

I started looking into this, and it seems that ds.isel evaluates immediately, and does so for every variable in the dataset with that dimension.

So ds.isel(variants=ds.variant_call_rate > .98)) will cause 14 evaluations (at least) if there are 14 variables with a variants dimension. This is very inefficient since the same thing is being re-evaluated each time.

The slightly altered form ds.isel(variants=(ds.variant_call_rate > 0.98).values) which forces the evaluation before the indexing causes just a single evaluation. This might be a good workaround for the moment.

Ideally, Xarray would be able to defer evaluation, but I'm not sure how easy that would be to achieve, since it ultimately depends on what Dask can support. Dask can handle unknown chunk sizes, but with limited effect: you can't do another filtering operation, for example. You also can't do a rechunk operation - which we often want to do before saving the output of a filtering step (to Zarr) - this won't work since Dask can't rechunk arrays with unknown chunk sizes. (BTW Dask's suggested workaround in this situation is to call compute_chunk_sizes() on the array. This forces an immediate evaluation of the array to get concrete sizes, but it's not clear if this helps at all in the context of Xarray datasets where there can be multiple variables that need the same index.)

So perhaps we should see if there's a way to fix Xarray so it doesn't do the multiple re-evaluations?

eric-czech · 2020-12-08T19:18:07Z

FYI I thought this was definitely worth trying to get an explanation for, so I filed pydata/xarray#4663.

eric-czech mentioned this issue Oct 5, 2020

Debug "Slicing is producing a large chunk" warning #300

Open

tomwhite mentioned this issue Nov 5, 2020

Genome-wide selection scans #232

Closed

3 tasks

eric-czech mentioned this issue Dec 8, 2020

Fancy indexing a Dataset with dask DataArray triggers multiple computes pydata/xarray#4663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine why so many dask graphs are executed during QC filtering #299

Determine why so many dask graphs are executed during QC filtering #299

eric-czech commented Oct 5, 2020

tomwhite commented Nov 23, 2020

eric-czech commented Dec 8, 2020

Determine why so many dask graphs are executed during QC filtering #299

Determine why so many dask graphs are executed during QC filtering #299

Comments

eric-czech commented Oct 5, 2020

tomwhite commented Nov 23, 2020

eric-czech commented Dec 8, 2020