Skip to content

Garud H statistics #378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 12, 2020
Merged

Garud H statistics #378

merged 3 commits into from
Nov 12, 2020

Conversation

tomwhite
Copy link
Collaborator

This fixes #231.

  • This PR depends on Add to_haplotype_calls function #377 for getting a haplotype representation of calls, so that should be merged first. The function there takes genotype calls of shape (variants, samples, ploidy) to (variants, haplotypes), where haplotypes=samples * ploidy.
  • If the dataset is not windowed, the variants are assumed to be in a single window. This won't work for large numbers of variants (see discussion about hashing below), so we could choose not to support this and say that the input must be windowed. Would be good to hear thoughts about this @alimanfoo, @jeromekelleher.
  • All of the H statistics work by computing statistics on the frequency of occurrences of each haplotype in a cohort. In scikit-allel, the haplotypes (columns of calls) are hashed by calling hash(x.tobytes()) on the numpy array column x. When I tried this on MalariaGEN-scale data using Dask the computation ground to a halt, which I think was due to issues with the GIL. To avoid this, I have written a hash_columns function that is Numba-jit compiled, and outperforms the Python hash method by a factor of 5 in a single thread.
  • I've used this code in a notebook to compare with scikit-allel on MalariaGEN. The results for a single cohort and statistic (H12) are concordant (for the first 1000 windows at least), which is promising.
  • The H stats notebook that uses scikit-allel has different window sizes for different cohorts, which is not possible in sgkit easily at the moment (see https://github.com/pystatgen/sgkit/issues/232#issuecomment-722377847 for the same problem with PBS). The most pragmatic way to fix this will probably be to specify the subset of cohorts to calculate the statistic for.

@hammer
Copy link
Contributor

hammer commented Nov 10, 2020

which I think was due to issues with the GIL.

cc @ravwojdyla who has debugged GIL contention issues previously.

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Nov 10, 2020

Just dropping some tools I find useful when I suspect GIL issues:

  • py-spy which is good and easy ad-hoc, approximate profiling
  • gil_load which is more accurate but requires setup

Please, let me know if I can help in any way.

@alimanfoo
Copy link
Collaborator

To avoid this, I have written a hash_columns function that is Numba-jit compiled, and outperforms the Python hash method by a factor of 5 in a single thread.

Amazing!!

@alimanfoo
Copy link
Collaborator

  • If the dataset is not windowed, the variants are assumed to be in a single window. This won't work for large numbers of variants (see discussion about hashing below), so we could choose not to support this and say that the input must be windowed.

Yes, this only makes sense for windowed data, I think it would be fine to require windows.

@alimanfoo
Copy link
Collaborator

Maybe there is a potential problem with hash collisions using dbx33a? https://gist.github.com/91f00a3d327ac07fb23be7cb2e332b4b

@tomwhite
Copy link
Collaborator Author

Maybe there is a potential problem with hash collisions using dbx33a?

It looks like the generator is creating duplicates, because it overflows a single-byte unsigned int. It works if you change u1 to u4.

@alimanfoo
Copy link
Collaborator

Maybe there is a potential problem with hash collisions using dbx33a?

It looks like the generator is creating duplicates, because it overflows a single-byte unsigned int. It works if you change u1 to u4.

Beautiful, thanks, sorry for the noise.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @tomwhite.

Given that this is a function that could be applied to very large datasets, I wonder if the reshaping of the genotype data into haplotypes is necessary/worth it. Would we always want to double the memory footprint just to compute the haplotype hashes (which it looks to me is what we're doing)?

We don't have to answer this now, but I thought it was worthwhile asking this question before we add the to-haplotyes function over in #377

N_GARUD_H_STATS = 4 # H1, H12, H123, H2/H1


def _Garud_h(k: ArrayLike) -> ArrayLike:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding what k is here - any chance of a more descriptive name?

Unless k is quite small, isn't sorted(collections.Counter(k.tolist()).values(), reverse=True) going to be bottleneck?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to haplotypes.

I thought that the sorted call would be a bottleneck too, but I haven't seen that during my MalariaGEN testing.

@tomwhite
Copy link
Collaborator Author

Thanks for the review @jeromekelleher. You are right about to_haplotype_calls - I don't think we need it. Following your suggestion I have changed the Garud_h code to work directly on genotype calls. I did this by changing the hashing function to use guvectorize so it can cope with arbitrarily shaped arrays. The updated notebook is a lot simpler now too (while producing the same output), since there's no need to create a haplotype representation.

I've also removed the non-windowed path following @alimanfoo's suggestion.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This is super cool, thanks @tomwhite !

@codecov-io
Copy link

Codecov Report

Merging #378 (3b5194f) into master (238fc56) will decrease coverage by 0.07%.
The diff coverage is 91.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #378      +/-   ##
==========================================
- Coverage   95.23%   95.15%   -0.08%     
==========================================
  Files          31       31              
  Lines        2289     2334      +45     
==========================================
+ Hits         2180     2221      +41     
- Misses        109      113       +4     
Impacted Files Coverage Δ
sgkit/utils.py 96.55% <50.00%> (-3.45%) ⬇️
sgkit/stats/popgen.py 72.82% <97.14%> (+5.49%) ⬆️
sgkit/__init__.py 100.00% <100.00%> (ø)
sgkit/variables.py 96.63% <100.00%> (+0.11%) ⬆️
sgkit/window.py 97.56% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 238fc56...3b5194f. Read the comment docs.

@jeromekelleher jeromekelleher added the auto-merge Auto merge label for mergify test flight label Nov 12, 2020
@mergify mergify bot merged commit f1cfd17 into sgkit-dev:master Nov 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Garud H Haplotype diversity statistic
6 participants