-
Notifications
You must be signed in to change notification settings - Fork 35
Garud H statistics #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Garud H statistics #378
Conversation
cc @ravwojdyla who has debugged GIL contention issues previously. |
Amazing!! |
Yes, this only makes sense for windowed data, I think it would be fine to require windows. |
Maybe there is a potential problem with hash collisions using dbx33a? https://gist.github.com/91f00a3d327ac07fb23be7cb2e332b4b |
It looks like the generator is creating duplicates, because it overflows a single-byte unsigned int. It works if you change |
Beautiful, thanks, sorry for the noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @tomwhite.
Given that this is a function that could be applied to very large datasets, I wonder if the reshaping of the genotype data into haplotypes is necessary/worth it. Would we always want to double the memory footprint just to compute the haplotype hashes (which it looks to me is what we're doing)?
We don't have to answer this now, but I thought it was worthwhile asking this question before we add the to-haplotyes function over in #377
sgkit/stats/popgen.py
Outdated
N_GARUD_H_STATS = 4 # H1, H12, H123, H2/H1 | ||
|
||
|
||
def _Garud_h(k: ArrayLike) -> ArrayLike: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble understanding what k
is here - any chance of a more descriptive name?
Unless k is quite small, isn't sorted(collections.Counter(k.tolist()).values(), reverse=True)
going to be bottleneck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to haplotypes
.
I thought that the sorted call would be a bottleneck too, but I haven't seen that during my MalariaGEN testing.
Thanks for the review @jeromekelleher. You are right about I've also removed the non-windowed path following @alimanfoo's suggestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! This is super cool, thanks @tomwhite !
Codecov Report
@@ Coverage Diff @@
## master #378 +/- ##
==========================================
- Coverage 95.23% 95.15% -0.08%
==========================================
Files 31 31
Lines 2289 2334 +45
==========================================
+ Hits 2180 2221 +41
- Misses 109 113 +4
Continue to review full report at Codecov.
|
This fixes #231.
hash(x.tobytes())
on the numpy array columnx
. When I tried this on MalariaGEN-scale data using Dask the computation ground to a halt, which I think was due to issues with the GIL. To avoid this, I have written ahash_columns
function that is Numba-jit compiled, and outperforms the Python hash method by a factor of 5 in a single thread.