Garud H statistics #378

tomwhite · 2020-11-10T15:19:49Z

This fixes #231.

This PR depends on Add to_haplotype_calls function #377 for getting a haplotype representation of calls, so that should be merged first. The function there takes genotype calls of shape (variants, samples, ploidy) to (variants, haplotypes), where haplotypes=samples * ploidy.
If the dataset is not windowed, the variants are assumed to be in a single window. This won't work for large numbers of variants (see discussion about hashing below), so we could choose not to support this and say that the input must be windowed. Would be good to hear thoughts about this @alimanfoo, @jeromekelleher.
All of the H statistics work by computing statistics on the frequency of occurrences of each haplotype in a cohort. In scikit-allel, the haplotypes (columns of calls) are hashed by calling hash(x.tobytes()) on the numpy array column x. When I tried this on MalariaGEN-scale data using Dask the computation ground to a halt, which I think was due to issues with the GIL. To avoid this, I have written a hash_columns function that is Numba-jit compiled, and outperforms the Python hash method by a factor of 5 in a single thread.
I've used this code in a notebook to compare with scikit-allel on MalariaGEN. The results for a single cohort and statistic (H12) are concordant (for the first 1000 windows at least), which is promising.
The H stats notebook that uses scikit-allel has different window sizes for different cohorts, which is not possible in sgkit easily at the moment (see https://github.com/pystatgen/sgkit/issues/232#issuecomment-722377847 for the same problem with PBS). The most pragmatic way to fix this will probably be to specify the subset of cohorts to calculate the statistic for.

hammer · 2020-11-10T16:39:38Z

which I think was due to issues with the GIL.

cc @ravwojdyla who has debugged GIL contention issues previously.

ravwojdyla · 2020-11-10T17:30:34Z

Just dropping some tools I find useful when I suspect GIL issues:

py-spy which is good and easy ad-hoc, approximate profiling
gil_load which is more accurate but requires setup

Please, let me know if I can help in any way.

alimanfoo · 2020-11-10T17:40:07Z

To avoid this, I have written a hash_columns function that is Numba-jit compiled, and outperforms the Python hash method by a factor of 5 in a single thread.

Amazing!!

alimanfoo · 2020-11-10T17:54:46Z

If the dataset is not windowed, the variants are assumed to be in a single window. This won't work for large numbers of variants (see discussion about hashing below), so we could choose not to support this and say that the input must be windowed.

Yes, this only makes sense for windowed data, I think it would be fine to require windows.

alimanfoo · 2020-11-10T18:23:09Z

Maybe there is a potential problem with hash collisions using dbx33a? https://gist.github.com/91f00a3d327ac07fb23be7cb2e332b4b

tomwhite · 2020-11-11T09:06:52Z

Maybe there is a potential problem with hash collisions using dbx33a?

It looks like the generator is creating duplicates, because it overflows a single-byte unsigned int. It works if you change u1 to u4.

alimanfoo · 2020-11-11T09:40:20Z

Maybe there is a potential problem with hash collisions using dbx33a?

It looks like the generator is creating duplicates, because it overflows a single-byte unsigned int. It works if you change u1 to u4.

Beautiful, thanks, sorry for the noise.

jeromekelleher

Looks great @tomwhite.

Given that this is a function that could be applied to very large datasets, I wonder if the reshaping of the genotype data into haplotypes is necessary/worth it. Would we always want to double the memory footprint just to compute the haplotype hashes (which it looks to me is what we're doing)?

We don't have to answer this now, but I thought it was worthwhile asking this question before we add the to-haplotyes function over in #377

jeromekelleher · 2020-11-11T11:01:35Z

sgkit/stats/popgen.py

+N_GARUD_H_STATS = 4  # H1, H12, H123, H2/H1
+
+
+def _Garud_h(k: ArrayLike) -> ArrayLike:


I'm having trouble understanding what k is here - any chance of a more descriptive name?

Unless k is quite small, isn't sorted(collections.Counter(k.tolist()).values(), reverse=True) going to be bottleneck?

Changed to haplotypes.

I thought that the sorted call would be a bottleneck too, but I haven't seen that during my MalariaGEN testing.

tomwhite · 2020-11-12T10:34:30Z

Thanks for the review @jeromekelleher. You are right about to_haplotype_calls - I don't think we need it. Following your suggestion I have changed the Garud_h code to work directly on genotype calls. I did this by changing the hashing function to use guvectorize so it can cope with arbitrarily shaped arrays. The updated notebook is a lot simpler now too (while producing the same output), since there's no need to create a haplotype representation.

I've also removed the non-windowed path following @alimanfoo's suggestion.

jeromekelleher

Awesome! This is super cool, thanks @tomwhite !

codecov-io · 2020-11-12T10:42:23Z

Codecov Report

Merging #378 (3b5194f) into master (238fc56) will decrease coverage by 0.07%.
The diff coverage is 91.83%.

@@            Coverage Diff             @@
##           master     #378      +/-   ##
==========================================
- Coverage   95.23%   95.15%   -0.08%     
==========================================
  Files          31       31              
  Lines        2289     2334      +45     
==========================================
+ Hits         2180     2221      +41     
- Misses        109      113       +4

Impacted Files	Coverage Δ
sgkit/utils.py	`96.55% <50.00%> (-3.45%)`	⬇️
sgkit/stats/popgen.py	`72.82% <97.14%> (+5.49%)`	⬆️
sgkit/__init__.py	`100.00% <100.00%> (ø)`
sgkit/variables.py	`96.63% <100.00%> (+0.11%)`	⬆️
sgkit/window.py	`97.56% <100.00%> (+0.03%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 238fc56...3b5194f. Read the comment docs.

jeromekelleher reviewed Nov 11, 2020

View reviewed changes

tomwhite added 3 commits November 12, 2020 10:05

Garud H

b372fc2

Don't use to_haplotype_calls for Garud H

3c31253

Garud H should only support windowed datasets

3b5194f

tomwhite force-pushed the garud-h branch from 79b7101 to 3b5194f Compare November 12, 2020 10:33

tomwhite mentioned this pull request Nov 12, 2020

Add to_haplotype_calls function #377

Closed

jeromekelleher approved these changes Nov 12, 2020

View reviewed changes

tomwhite mentioned this pull request Nov 12, 2020

Disable hypothesis deadlines #336

Open

jeromekelleher added the auto-merge Auto merge label for mergify test flight label Nov 12, 2020

mergify bot merged commit f1cfd17 into sgkit-dev:master Nov 12, 2020

tomwhite mentioned this pull request Nov 17, 2020

Early scalability demonstrations #345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garud H statistics #378

Garud H statistics #378

tomwhite commented Nov 10, 2020

hammer commented Nov 10, 2020

ravwojdyla commented Nov 10, 2020 •

edited

Loading

alimanfoo commented Nov 10, 2020

alimanfoo commented Nov 10, 2020

alimanfoo commented Nov 10, 2020

tomwhite commented Nov 11, 2020

alimanfoo commented Nov 11, 2020

jeromekelleher left a comment

jeromekelleher Nov 11, 2020

tomwhite Nov 12, 2020

tomwhite commented Nov 12, 2020

jeromekelleher left a comment

codecov-io commented Nov 12, 2020

		N_GARUD_H_STATS = 4 # H1, H12, H123, H2/H1


		def _Garud_h(k: ArrayLike) -> ArrayLike:

Garud H statistics #378

Garud H statistics #378

Conversation

tomwhite commented Nov 10, 2020

hammer commented Nov 10, 2020

ravwojdyla commented Nov 10, 2020 • edited Loading

alimanfoo commented Nov 10, 2020

alimanfoo commented Nov 10, 2020

alimanfoo commented Nov 10, 2020

tomwhite commented Nov 11, 2020

alimanfoo commented Nov 11, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Nov 11, 2020

Choose a reason for hiding this comment

tomwhite Nov 12, 2020

Choose a reason for hiding this comment

tomwhite commented Nov 12, 2020

jeromekelleher left a comment

Choose a reason for hiding this comment

codecov-io commented Nov 12, 2020

Codecov Report

ravwojdyla commented Nov 10, 2020 •

edited

Loading