-
Notifications
You must be signed in to change notification settings - Fork 35
Tajima's D for cohorts uses segregating sites for entire dataset #1094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: I just realized that I can use
which gives
Does this mean cohort-based Tajimas_D is working as expected and that using |
Hi @percyfal, thanks for opening an issue. Using The implementation in sgkit is based on tskit's implementation of Tajimas_D. There are unit tests in https://github.com/pystatgen/sgkit/blob/main/sgkit/tests/test_popgen.py#L367-L390, which check the values are the same, which might provide a bit more information for you. |
Hi @tomwhite , thanks for the heads up. I will use I think my confusion stemmed from the fact that you can do diversity calculations on cohorts that are identical to those of the subpopulations. Building on the previous example:
made me think the grouping mechanism (cohorts) would apply to other statistics as well. |
Uh oh!
There was an error while loading. Please reload this page.
I have a question regarding Tajima's D calculation for cohorts (related to #240). I have gone through the issues best I could and I couldn't find anything related to my observations. I have included an example comparing the results to those of scikit-allel for reference.
To summarize, when Tajima's D is calculated for cohorts, it relies on
stat_diversity
which is calculated by cohort, but also on segregating sites, which is based on the allele counts for the entire dataset and not on cohort allele counts (i.e.,variant_allele_count
and notcohort_allele_count
). Shouldn't it calculate the number of segregating sites by cohort? Also, the harmonic number is based on the total number of chromosomes. Related to this,is there any reason segregating sites and Watterson's theta are not provided as separate functions?
A short MWE template follows - as long as some sites are monomorphic in either cohort similar results should ensue.
which on my computer produces
The text was updated successfully, but these errors were encountered: