Tajima's D for cohorts uses segregating sites for entire dataset #1094

percyfal · 2023-06-01T08:56:16Z

I have a question regarding Tajima's D calculation for cohorts (related to #240). I have gone through the issues best I could and I couldn't find anything related to my observations. I have included an example comparing the results to those of scikit-allel for reference.

To summarize, when Tajima's D is calculated for cohorts, it relies on stat_diversity which is calculated by cohort, but also on segregating sites, which is based on the allele counts for the entire dataset and not on cohort allele counts (i.e., variant_allele_count and not cohort_allele_count). Shouldn't it calculate the number of segregating sites by cohort? Also, the harmonic number is based on the total number of chromosomes. Related to this,
is there any reason segregating sites and Watterson's theta are not provided as separate functions?

A short MWE template follows - as long as some sites are monomorphic in either cohort similar results should ensue.

import allel
import sgkit as sg
import xarray as xr
import numpy as np

ds = sg.simulate_genotype_call_dataset(n_variant=5, n_sample=4, seed=3234)
g = allel.GenotypeArray(ds.call_genotype)
print("Genotypes for 5 variants, 4 samples, with some monomorphic sites in cohorts (cohorts being [0, 1], [2, 3])")
print(g)
print()

# Position and windows for scikit-allel
pos = [1, 2, 3, 4, 5]
windows = [[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]]

# All samples in one cohort
print("All samples in one cohort")
ds["sample_cohort"] = xr.DataArray(
    np.repeat([0], ds.dims["samples"]), dims="samples")

# Diversity
print("Diversity")
print("scikit-allel: ", allel.windowed_diversity(pos=pos, ac=g.count_alleles(), windows=windows)[0])
print("sgkit: ", sg.diversity(ds)["stat_diversity"].values.T)
print()

# Tajima's D
print("Tajima's")
print(allel.windowed_tajima_d(pos=pos, ac=g.count_alleles(), windows=windows, min_sites=1)[0])
print(sg.Tajimas_D(ds)["stat_Tajimas_D"].values.T)
print()

# Cohorts
print("Two cohorts ([0, 1], [2, 3])")
ds["sample_cohort"] = xr.DataArray(
    np.repeat([0, 1], ds.dims["samples"] // 2), dims="samples")
print("ds[\"sample_cohort\"]: ", ds["sample_cohort"])
print()

# Diversity
print("Diversity")
print("scikit-allel ([0, 1]) :", allel.windowed_diversity(pos=pos, ac=g.count_alleles(subpop=[0, 1]), windows=windows)[0])
print("scikit-allel ([2, 3]) :", allel.windowed_diversity(pos=pos, ac=g.count_alleles(subpop=[2, 3]), windows=windows)[0])
print("sgkit: ", sg.diversity(ds)["stat_diversity"].values.T)
print()

# Tajima's D
print("Tajima's")
print("scikit-allel ([0, 1]) :", allel.windowed_tajima_d(pos=pos, ac=g.count_alleles(subpop=[0, 1]), windows=windows, min_sites=1)[0])
print("scikit-allel ([2, 3]) :", allel.windowed_tajima_d(pos=pos, ac=g.count_alleles(subpop=[2, 3]), windows=windows, min_sites=1)[0])
print("sgkit: ", sg.Tajimas_D(ds)["stat_Tajimas_D"].values.T)

which on my computer produces

Genotypes for 5 variants, 4 samples, with some monomorphic sites in cohorts (cohorts being [0, 1], [2, 3])
0/1 0/0 0/0 0/0
0/0 1/0 1/0 0/0
0/0 1/1 1/1 0/1
0/1 0/0 0/0 0/1
0/0 0/0 0/0 0/1


All samples in one cohort
Diversity
scikit-allel:  [0.25       0.42857143 0.53571429 0.42857143 0.25      ]
sgkit:  [[0.25       0.42857143 0.53571429 0.42857143 0.25      ]]

Tajima's
[-1.05481911  0.33350336  1.16649684  0.33350336 -1.05481911]
Sgkit type:  <class 'xarray.core.dataarray.DataArray'>
[[-1.05481911  0.33350336  1.16649684  0.33350336 -1.05481911]]

Two cohorts ([0, 1], [2, 3])
ds["sample_cohort"]:  <xarray.DataArray 'sample_cohort' (samples: 4)>
array([0, 0, 1, 1])
Dimensions without coordinates: samples

Diversity
scikit-allel ([0, 1]) : [0.5        0.5        0.66666667 0.5        0.        ]
scikit-allel ([2, 3]) :  [0.  0.5 0.5 0.5 0.5]
sgkit:  [[0.5        0.5        0.66666667 0.5        0.        ]
 [0.         0.5        0.5        0.5        0.5       ]]

Tajima's
scikit-allel ([0, 1]) : [-0.61237244 -0.61237244  1.63299316 -0.61237244         nan]
scikit-allel ([2, 3]) : [        nan -0.61237244 -0.61237244 -0.61237244 -0.61237244]
Sgkit type:  <class 'xarray.core.dataarray.DataArray'>
sgkit:  [[ 0.88883234  0.88883234  2.18459998  0.88883234 -2.99847056]
 [-2.99847056  0.88883234  0.88883234  0.88883234  0.88883234]]

The text was updated successfully, but these errors were encountered:

percyfal · 2023-06-01T12:49:34Z

Update: I just realized that I can use ds.sel on the relevant samples to recreate the results from scikit-allel:

print("sgkit: ", sg.Tajimas_D(ds.sel(samples=[0, 1]))["stat_Tajimas_D"].values.T)
print("sgkit: ", sg.Tajimas_D(ds.sel(samples=[2, 3]))["stat_Tajimas_D"].values.T)

which gives

sgkit:  [[-0.61237244 -0.61237244  1.63299316 -0.61237244         nan]]
sgkit:  [[        nan -0.61237244 -0.61237244 -0.61237244 -0.61237244]]

Does this mean cohort-based Tajimas_D is working as expected and that using ds.sel is the way to go when looking at sample subsets?

tomwhite · 2023-06-02T13:55:06Z

Hi @percyfal, thanks for opening an issue.

Using sel is the way to subset samples. (It seems like you are getting the values you would expect now?)

The implementation in sgkit is based on tskit's implementation of Tajimas_D. There are unit tests in https://github.com/pystatgen/sgkit/blob/main/sgkit/tests/test_popgen.py#L367-L390, which check the values are the same, which might provide a bit more information for you.

percyfal · 2023-06-05T07:25:04Z

Hi @tomwhite , thanks for the heads up. I will use sel then to make the Tajima's calculations.

I think my confusion stemmed from the fact that you can do diversity calculations on cohorts that are identical to those of the subpopulations. Building on the previous example:

ds["sample_cohort"] = xr.DataArray(
    np.repeat([0, 1], ds.dims["samples"] // 2), dims="samples")
print("sgkit, cohorts: ", sg.diversity(ds)["stat_diversity"].values.T)
ds["sample_cohort"] = xr.DataArray(
    np.repeat([0], ds.dims["samples"]), dims="samples")
print("sgkit, samples=[0, 1]: ", sg.diversity(ds.sel(samples=[0, 1]))["stat_diversity"].values.T)
print("sgkit, samples=[2, 3]: ", sg.diversity(ds.sel(samples=[2, 3]))["stat_diversity"].values.T)

sgkit, cohorts:  [[0.5        0.5        0.66666667 0.5        0.        ]
 [0.         0.5        0.5        0.5        0.5       ]]
sgkit, samples=[0, 1]:  [[0.5        0.5        0.66666667 0.5        0.        ]]
sgkit, samples=[2, 3]:  [[0.  0.5 0.5 0.5 0.5]]

made me think the grouping mechanism (cohorts) would apply to other statistics as well.

percyfal closed this as completed Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tajima's D for cohorts uses segregating sites for entire dataset #1094

Tajima's D for cohorts uses segregating sites for entire dataset #1094

percyfal commented Jun 1, 2023 •

edited

Loading

percyfal commented Jun 1, 2023

Uh oh!

tomwhite commented Jun 2, 2023

Uh oh!

percyfal commented Jun 5, 2023

Uh oh!

Tajima's D for cohorts uses segregating sites for entire dataset #1094

Tajima's D for cohorts uses segregating sites for entire dataset #1094

Comments

percyfal commented Jun 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

percyfal commented Jun 1, 2023

Uh oh!

tomwhite commented Jun 2, 2023

Uh oh!

percyfal commented Jun 5, 2023

Uh oh!

percyfal commented Jun 1, 2023 •

edited

Loading