Skip to content

Track and improve the performance of allele counting method #49

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eric-czech opened this issue Jul 16, 2020 · 7 comments
Open

Track and improve the performance of allele counting method #49

eric-czech opened this issue Jul 16, 2020 · 7 comments
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. performance

Comments

@eric-czech
Copy link
Collaborator

The solution to https://github.com/pystatgen/sgkit/issues/3 in https://github.com/pystatgen/sgkit/pull/36 is naive and possibly unacceptably slow. This will be true if Dask does not optimize the loop over allele indexes to a single pass on the genotypes array (which it probably won't).

The extension to this proposed in https://github.com/pystatgen/sgkit/pull/36#issuecomment-656611356 would definitely solve the problem in a single pass if Dask supported counting rows like numpy does, but it currently doesn't.

There may be some other efficient ways to do it without dropping down to writing custom kernels but in any case, we should track the performance of this implementation (and others) as part of a benchmark suite like @alimanfoo mentioned in https://github.com/pystatgen/sgkit/pull/36#issuecomment-658893949 so we can measure the impact of future iterations more passively and prevent regressions.

@alimanfoo
Copy link
Collaborator

Thanks for picking this up @eric-czech, I think I started writing something but then got distracted, and you summarised it much better than I would've.

@hammer hammer added performance core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. labels Jul 16, 2020
@hammer
Copy link
Contributor

hammer commented Jul 17, 2020

if Dask supported counting rows like numpy does, but it currently doesn't.

Is there an upstream issue we can track?

@eric-czech
Copy link
Collaborator Author

I added one: dask/dask#6423

@tomwhite
Copy link
Collaborator

Can this be closed now?

@eric-czech
Copy link
Collaborator Author

I added https://github.com/pystatgen/sgkit/issues/68 to track the benchmarking discussion so I think this can be closed now.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Aug 16, 2020

Reopening largely to revisit numba, re: https://github.com/pystatgen/sgkit/pull/114.

I was generally hoping to avoid dropping down to numba as much as possible because it then means users have no freedom to choose array backends that the computations run on. I think this case, counting alleles for variants, is a bit of a grey area for that since it is supported somewhat well with any implementations of bincount (as in https://github.com/pystatgen/sgkit/pull/36) applied to larger arrays.

Counting alleles for individual calls (https://github.com/pystatgen/sgkit/issues/85) is different though and afaik, there really is no way to utilize bincount even remotely efficiently. All the vectorization would be in python since it doesn't support reduction across an axis. If numba is going to be necessary for that (which I think it is), then it may make sense to start thinking of counting alleles for variants as a sum of the counts for individual calls, i.e.:

def count_call_alleles(ds) -> DataArray:
    return xr.DataArray(..., dims=('variants', 'samples', 'alleles'))

def count_variant_alleles(ds) -> DataArray:
    count_call_alleles(ds).sum(dim='samples')

Benchmarking would still be helpful in making these decisions, but I'm fairly certain that the same approach for counting variant alleles will be unacceptably slow for counting call alleles. So I think this should be reopened since they're related.

@tomwhite
Copy link
Collaborator

Related to #348

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. performance
Projects
None yet
Development

No branches or pull requests

4 participants