-
Notifications
You must be signed in to change notification settings - Fork 35
Track and improve the performance of allele counting method #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for picking this up @eric-czech, I think I started writing something but then got distracted, and you summarised it much better than I would've. |
Is there an upstream issue we can track? |
I added one: dask/dask#6423 |
Can this be closed now? |
I added https://github.com/pystatgen/sgkit/issues/68 to track the benchmarking discussion so I think this can be closed now. |
Reopening largely to revisit numba, re: https://github.com/pystatgen/sgkit/pull/114. I was generally hoping to avoid dropping down to numba as much as possible because it then means users have no freedom to choose array backends that the computations run on. I think this case, counting alleles for variants, is a bit of a grey area for that since it is supported somewhat well with any implementations of Counting alleles for individual calls (https://github.com/pystatgen/sgkit/issues/85) is different though and afaik, there really is no way to utilize def count_call_alleles(ds) -> DataArray:
return xr.DataArray(..., dims=('variants', 'samples', 'alleles'))
def count_variant_alleles(ds) -> DataArray:
count_call_alleles(ds).sum(dim='samples') Benchmarking would still be helpful in making these decisions, but I'm fairly certain that the same approach for counting variant alleles will be unacceptably slow for counting call alleles. So I think this should be reopened since they're related. |
Related to #348 |
The solution to https://github.com/pystatgen/sgkit/issues/3 in https://github.com/pystatgen/sgkit/pull/36 is naive and possibly unacceptably slow. This will be true if Dask does not optimize the loop over allele indexes to a single pass on the genotypes array (which it probably won't).
The extension to this proposed in https://github.com/pystatgen/sgkit/pull/36#issuecomment-656611356 would definitely solve the problem in a single pass if Dask supported counting rows like numpy does, but it currently doesn't.
There may be some other efficient ways to do it without dropping down to writing custom kernels but in any case, we should track the performance of this implementation (and others) as part of a benchmark suite like @alimanfoo mentioned in https://github.com/pystatgen/sgkit/pull/36#issuecomment-658893949 so we can measure the impact of future iterations more passively and prevent regressions.
The text was updated successfully, but these errors were encountered: