Track and improve the performance of allele counting method #49

eric-czech · 2020-07-16T17:02:06Z

The solution to https://github.com/pystatgen/sgkit/issues/3 in https://github.com/pystatgen/sgkit/pull/36 is naive and possibly unacceptably slow. This will be true if Dask does not optimize the loop over allele indexes to a single pass on the genotypes array (which it probably won't).

The extension to this proposed in https://github.com/pystatgen/sgkit/pull/36#issuecomment-656611356 would definitely solve the problem in a single pass if Dask supported counting rows like numpy does, but it currently doesn't.

There may be some other efficient ways to do it without dropping down to writing custom kernels but in any case, we should track the performance of this implementation (and others) as part of a benchmark suite like @alimanfoo mentioned in https://github.com/pystatgen/sgkit/pull/36#issuecomment-658893949 so we can measure the impact of future iterations more passively and prevent regressions.

alimanfoo · 2020-07-16T17:12:12Z

Thanks for picking this up @eric-czech, I think I started writing something but then got distracted, and you summarised it much better than I would've.

hammer · 2020-07-17T15:56:44Z

if Dask supported counting rows like numpy does, but it currently doesn't.

Is there an upstream issue we can track?

eric-czech · 2020-07-17T16:47:14Z

I added one: dask/dask#6423

tomwhite · 2020-07-27T08:39:02Z

Can this be closed now?

eric-czech · 2020-07-27T12:07:32Z

I added https://github.com/pystatgen/sgkit/issues/68 to track the benchmarking discussion so I think this can be closed now.

eric-czech · 2020-08-16T18:35:14Z

Reopening largely to revisit numba, re: https://github.com/pystatgen/sgkit/pull/114.

I was generally hoping to avoid dropping down to numba as much as possible because it then means users have no freedom to choose array backends that the computations run on. I think this case, counting alleles for variants, is a bit of a grey area for that since it is supported somewhat well with any implementations of bincount (as in https://github.com/pystatgen/sgkit/pull/36) applied to larger arrays.

Counting alleles for individual calls (https://github.com/pystatgen/sgkit/issues/85) is different though and afaik, there really is no way to utilize bincount even remotely efficiently. All the vectorization would be in python since it doesn't support reduction across an axis. If numba is going to be necessary for that (which I think it is), then it may make sense to start thinking of counting alleles for variants as a sum of the counts for individual calls, i.e.:

def count_call_alleles(ds) -> DataArray:
    return xr.DataArray(..., dims=('variants', 'samples', 'alleles'))

def count_variant_alleles(ds) -> DataArray:
    count_call_alleles(ds).sum(dim='samples')

Benchmarking would still be helpful in making these decisions, but I'm fairly certain that the same approach for counting variant alleles will be unacceptably slow for counting call alleles. So I think this should be reopened since they're related.

tomwhite · 2020-10-26T17:54:26Z

Related to #348

hammer added performance core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. labels Jul 16, 2020

eric-czech closed this as completed Jul 27, 2020

eric-czech mentioned this issue Aug 16, 2020

[WIP] count_allele_calls #114

Merged

eric-czech reopened this Aug 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track and improve the performance of allele counting method #49

Track and improve the performance of allele counting method #49

eric-czech commented Jul 16, 2020

alimanfoo commented Jul 16, 2020

hammer commented Jul 17, 2020

eric-czech commented Jul 17, 2020

tomwhite commented Jul 27, 2020

eric-czech commented Jul 27, 2020

eric-czech commented Aug 16, 2020 •

edited

Loading

tomwhite commented Oct 26, 2020

Track and improve the performance of allele counting method #49

Track and improve the performance of allele counting method #49

Comments

eric-czech commented Jul 16, 2020

alimanfoo commented Jul 16, 2020

hammer commented Jul 17, 2020

eric-czech commented Jul 17, 2020

tomwhite commented Jul 27, 2020

eric-czech commented Jul 27, 2020

eric-czech commented Aug 16, 2020 • edited Loading

tomwhite commented Oct 26, 2020

eric-czech commented Aug 16, 2020 •

edited

Loading