Skip to content

Tajima's D for cohorts uses segregating sites for entire dataset #1094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
percyfal opened this issue Jun 1, 2023 · 3 comments
Closed

Tajima's D for cohorts uses segregating sites for entire dataset #1094

percyfal opened this issue Jun 1, 2023 · 3 comments

Comments

@percyfal
Copy link

percyfal commented Jun 1, 2023

I have a question regarding Tajima's D calculation for cohorts (related to #240). I have gone through the issues best I could and I couldn't find anything related to my observations. I have included an example comparing the results to those of scikit-allel for reference.

To summarize, when Tajima's D is calculated for cohorts, it relies on stat_diversity which is calculated by cohort, but also on segregating sites, which is based on the allele counts for the entire dataset and not on cohort allele counts (i.e., variant_allele_count and not cohort_allele_count). Shouldn't it calculate the number of segregating sites by cohort? Also, the harmonic number is based on the total number of chromosomes. Related to this,
is there any reason segregating sites and Watterson's theta are not provided as separate functions?

A short MWE template follows - as long as some sites are monomorphic in either cohort similar results should ensue.

import allel
import sgkit as sg
import xarray as xr
import numpy as np

ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4, seed=3234)
g = allel.GenotypeArray(ds.call_genotype)
print("Genotypes for 5 variants, 4 samples, with some monomorphic sites in cohorts (cohorts being [0, 1], [2, 3])")
print(g)
print()

# Position and windows for scikit-allel
pos = [1, 2, 3, 4, 5]
windows = [[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]

# All samples in one cohort
print("All samples in one cohort")
ds["sample_cohort"] = xr.DataArray(
    np.repeat([0], ds.dims["samples"]), dims="samples")

# Diversity
print("Diversity")
print("scikit-allel: ", allel.windowed_diversity(pos=pos, ac=g.count_alleles(), windows=windows)[0])
print("sgkit: ", sg.diversity(ds)["stat_diversity"].values.T)
print()

# Tajima's D
print("Tajima's")
print(allel.windowed_tajima_d(pos=pos, ac=g.count_alleles(), windows=windows, min_sites=1)[0])
print(sg.Tajimas_D(ds)["stat_Tajimas_D"].values.T)
print()

# Cohorts
print("Two cohorts ([0, 1], [2, 3])")
ds["sample_cohort"] = xr.DataArray(
    np.repeat([0, 1], ds.dims["samples"] // 2), dims="samples")
print("ds[\"sample_cohort\"]: ", ds["sample_cohort"])
print()

# Diversity
print("Diversity")
print("scikit-allel ([0, 1]) :", allel.windowed_diversity(pos=pos, ac=g.count_alleles(subpop=[0, 1]), windows=windows)[0])
print("scikit-allel ([2, 3]) :", allel.windowed_diversity(pos=pos, ac=g.count_alleles(subpop=[2, 3]), windows=windows)[0])
print("sgkit: ", sg.diversity(ds)["stat_diversity"].values.T)
print()

# Tajima's D
print("Tajima's")
print("scikit-allel ([0, 1]) :", allel.windowed_tajima_d(pos=pos, ac=g.count_alleles(subpop=[0, 1]), windows=windows, min_sites=1)[0])
print("scikit-allel ([2, 3]) :", allel.windowed_tajima_d(pos=pos, ac=g.count_alleles(subpop=[2, 3]), windows=windows, min_sites=1)[0])
print("sgkit: ", sg.Tajimas_D(ds)["stat_Tajimas_D"].values.T)

which on my computer produces

Genotypes for 5 variants, 4 samples, with some monomorphic sites in cohorts (cohorts being [0, 1], [2, 3])
0/1 0/0 0/0 0/0
0/0 1/0 1/0 0/0
0/0 1/1 1/1 0/1
0/1 0/0 0/0 0/1
0/0 0/0 0/0 0/1


All samples in one cohort
Diversity
scikit-allel:  [0.25       0.42857143 0.53571429 0.42857143 0.25      ]
sgkit:  [[0.25       0.42857143 0.53571429 0.42857143 0.25      ]]

Tajima's
[-1.05481911  0.33350336  1.16649684  0.33350336 -1.05481911]
Sgkit type:  <class 'xarray.core.dataarray.DataArray'>
[[-1.05481911  0.33350336  1.16649684  0.33350336 -1.05481911]]

Two cohorts ([0, 1], [2, 3])
ds["sample_cohort"]:  <xarray.DataArray 'sample_cohort' (samples: 4)>
array([0, 0, 1, 1])
Dimensions without coordinates: samples

Diversity
scikit-allel ([0, 1]) : [0.5        0.5        0.66666667 0.5        0.        ]
scikit-allel ([2, 3]) :  [0.  0.5 0.5 0.5 0.5]
sgkit:  [[0.5        0.5        0.66666667 0.5        0.        ]
 [0.         0.5        0.5        0.5        0.5       ]]

Tajima's
scikit-allel ([0, 1]) : [-0.61237244 -0.61237244  1.63299316 -0.61237244         nan]
scikit-allel ([2, 3]) : [        nan -0.61237244 -0.61237244 -0.61237244 -0.61237244]
Sgkit type:  <class 'xarray.core.dataarray.DataArray'>
sgkit:  [[ 0.88883234  0.88883234  2.18459998  0.88883234 -2.99847056]
 [-2.99847056  0.88883234  0.88883234  0.88883234  0.88883234]]
@percyfal
Copy link
Author

percyfal commented Jun 1, 2023

Update: I just realized that I can use ds.sel on the relevant samples to recreate the results from scikit-allel:

print("sgkit: ", sg.Tajimas_D(ds.sel(samples=[0, 1]))["stat_Tajimas_D"].values.T)
print("sgkit: ", sg.Tajimas_D(ds.sel(samples=[2, 3]))["stat_Tajimas_D"].values.T)

which gives

sgkit:  [[-0.61237244 -0.61237244  1.63299316 -0.61237244         nan]]
sgkit:  [[        nan -0.61237244 -0.61237244 -0.61237244 -0.61237244]]

Does this mean cohort-based Tajimas_D is working as expected and that using ds.sel is the way to go when looking at sample subsets?

@tomwhite
Copy link
Collaborator

tomwhite commented Jun 2, 2023

Hi @percyfal, thanks for opening an issue.

Using sel is the way to subset samples. (It seems like you are getting the values you would expect now?)

The implementation in sgkit is based on tskit's implementation of Tajimas_D. There are unit tests in https://github.com/pystatgen/sgkit/blob/main/sgkit/tests/test_popgen.py#L367-L390, which check the values are the same, which might provide a bit more information for you.

@percyfal
Copy link
Author

percyfal commented Jun 5, 2023

Hi @tomwhite , thanks for the heads up. I will use sel then to make the Tajima's calculations.

I think my confusion stemmed from the fact that you can do diversity calculations on cohorts that are identical to those of the subpopulations. Building on the previous example:

ds["sample_cohort"] = xr.DataArray(
    np.repeat([0, 1], ds.dims["samples"] // 2), dims="samples")
print("sgkit, cohorts: ", sg.diversity(ds)["stat_diversity"].values.T)
ds["sample_cohort"] = xr.DataArray(
    np.repeat([0], ds.dims["samples"]), dims="samples")
print("sgkit, samples=[0, 1]: ", sg.diversity(ds.sel(samples=[0, 1]))["stat_diversity"].values.T)
print("sgkit, samples=[2, 3]: ", sg.diversity(ds.sel(samples=[2, 3]))["stat_diversity"].values.T)
sgkit, cohorts:  [[0.5        0.5        0.66666667 0.5        0.        ]
 [0.         0.5        0.5        0.5        0.5       ]]
sgkit, samples=[0, 1]:  [[0.5        0.5        0.66666667 0.5        0.        ]]
sgkit, samples=[2, 3]:  [[0.  0.5 0.5 0.5 0.5]]

made me think the grouping mechanism (cohorts) would apply to other statistics as well.

@percyfal percyfal closed this as completed Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants