Skip to content

Determine why so many dask graphs are executed during QC filtering #299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eric-czech opened this issue Oct 5, 2020 · 2 comments
Open

Comments

@eric-czech
Copy link
Collaborator

There are 38 computations run when evaluating this short pipeline so we need to better understand what Xarray is telling dask to do here:

def qc(ds):    
    return (
        ds
        .pipe(sg.variant_stats)
        .assign(variant_af=lambda ds: ds.variant_allele_frequency.min(dim="alleles"))
        .assign(sample_call_rate=lambda ds: call_rate(ds, dim="variants").sample_call_rate)
        .pipe(lambda ds: ds.isel(variants=ds.variant_call_rate > .98))
        .pipe(lambda ds: ds.isel(samples=ds.sample_call_rate > .98))
        .pipe(lambda ds: ds.isel(variants=ds.variant_af > .05))
    )

ds = sg.simulate_genotype_call_dataset(100000, 100)
del ds.attrs['contigs']
ds.to_zarr('/tmp/ds.zarr', mode='w')
ds = xr.open_zarr('/tmp/ds.zarr', concat_characters=False)

%%time
with ProgressBar():
    ds_qc = qc(ds)
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.1s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
[########################################] | 100% Completed |  0.2s
CPU times: user 16.9 s, sys: 1.67 s, total: 18.6 s
Wall time: 8.17 s
@tomwhite
Copy link
Collaborator

I started looking into this, and it seems that ds.isel evaluates immediately, and does so for every variable in the dataset with that dimension.

So ds.isel(variants=ds.variant_call_rate > .98)) will cause 14 evaluations (at least) if there are 14 variables with a variants dimension. This is very inefficient since the same thing is being re-evaluated each time.

The slightly altered form ds.isel(variants=(ds.variant_call_rate > 0.98).values) which forces the evaluation before the indexing causes just a single evaluation. This might be a good workaround for the moment.

Ideally, Xarray would be able to defer evaluation, but I'm not sure how easy that would be to achieve, since it ultimately depends on what Dask can support. Dask can handle unknown chunk sizes, but with limited effect: you can't do another filtering operation, for example. You also can't do a rechunk operation - which we often want to do before saving the output of a filtering step (to Zarr) - this won't work since Dask can't rechunk arrays with unknown chunk sizes. (BTW Dask's suggested workaround in this situation is to call compute_chunk_sizes() on the array. This forces an immediate evaluation of the array to get concrete sizes, but it's not clear if this helps at all in the context of Xarray datasets where there can be multiple variables that need the same index.)

So perhaps we should see if there's a way to fix Xarray so it doesn't do the multiple re-evaluations?

@eric-czech
Copy link
Collaborator Author

FYI I thought this was definitely worth trying to get an explanation for, so I filed pydata/xarray#4663.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants