@@ -23,19 +23,19 @@ kernelspec:
23
23
24
24
## Toy example
25
25
26
- _ Tsinfer_ takes as input a [ Zarr] ( https://zarr.readthedocs.io/ ) file, with phased variant data encoded in the
27
- [ VCF Zarr] ( https://github.com/sgkit-dev/vcf-zarr-spec/ ) (.vcz) format. The standard
28
- route to create such a file is by conversion from a VCF file, e.g. using
29
- [ vcf2zarr ] ( https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html ) as described later in this
30
- document. However, for the moment we'll just use a pre-generated dataset:
26
+ _ Tsinfer_ takes as input a [ Zarr] ( https://zarr.readthedocs.io/ ) file, with phased variant
27
+ data encoded in the [ VCF Zarr] ( https://github.com/sgkit-dev/vcf-zarr-spec/ ) (.vcz) format.
28
+ For simplicity, we first demonstrate using a pre-generated .vcz file (later, in the
29
+ {ref} ` sec_usage_data_example ` section, we describe how to create such a file e.g. from
30
+ a VCF using [ vcf2zarr ] ( https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html ) ).
31
31
32
32
33
33
``` {code-cell} ipython3
34
34
import zarr
35
35
vcf_zarr = zarr.open("_static/example_data.vcz")
36
36
```
37
37
38
- This is what the genotypes stored in that datafile look like:
38
+ Here's what the genotypes stored in that datafile look like:
39
39
40
40
``` {code-cell}
41
41
:"tags": ["remove-input"]
@@ -63,31 +63,64 @@ for site_id, pos in enumerate(positions):
63
63
64
64
:::{note}
65
65
The last site, at position 95, is an indel (insertion or deletion). Indels can be used
66
- for inference as long as the indel does not overlap with other variants, only 2
66
+ for inference as long as the indel does not overlap with other variants, only two
67
67
alleles exist, and the ancestral state is known.
68
68
:::
69
69
70
- ### VariantData and ancestral alleles
70
+ (sec_usage_toy_example_variant_data)=
71
71
72
- We wish to infer a genealogy that could have given rise to this data set. To run _ tsinfer_
73
- we wrap the ` .vcz ` file in a {class}` tsinfer.VariantData ` object. This requires an
74
- * ancestral state* to be specified for each site; there are
75
- many methods for calculating these: details are outside the scope of this manual, but we
76
- have started a [ discussion topic] ( https://github.com/tskit-dev/tsinfer/discussions/523 )
72
+
73
+ ### VariantData
74
+
75
+ _ Tsinfer_ produces a genealogy that could have given rise to the data set above,
76
+ based on the sites that vary between the samples. To provide extra information
77
+ to the algorithm, you must wrap the .vcz file in a lightweight
78
+ {class}` tsinfer.VariantData ` object, using syntax like:
79
+
80
+ ``` {code} python
81
+ vdata = tsinfer.VariantData("file.vcz", ancestral_state=***, ...)
82
+ ```
83
+
84
+ #### Ancestral states
85
+
86
+ Importantly, the {class}` ~tsinfer.VariantData ` object requires an
87
+ * ancestral state* to be provided for each site used in inference.
88
+ There are many methods for determing ancestral states: details are outside
89
+ the scope of this manual, but we have started a
90
+ [ discussion topic] ( https://github.com/tskit-dev/tsinfer/discussions/523 )
77
91
on this issue to provide some recommendations.
78
92
79
- Sometimes VCF files will contain the
80
- ancestral state in the "AA" ("ancestral allele") info field, in which case it will be encoded
81
- in the ` variant_AA ` field of the .vcz file. It's also possible to provide a numpy array
82
- of ancestral alleles, of the same length as the number of selected variants. If you have a string
83
- of the ancestral states (e.g. from FASTA) the {meth}` add_ancestral_state_array `
84
- method can be used to convert and save this to the VCF Zarr dataset (under the name
85
- ` ancestral_state ` ). Note that this method assumes that the string uses zero-based
86
- indexing, so that the first letter corresponds to a site at position 0. If,
87
- as is typical, the first letter in the string denotes the ancestral state of
88
- the site at position 1 in the .vcz file, you should prepend an "X" to the string.
89
- Alleles that are not in the list of alleles for their respective site are
90
- treated as unknown and not used for inference (with a warning given).
93
+ Ancestral states can be specified in several ways:
94
+
95
+ * Using an existing array in the .vcz file. Some VCF files already contain an ancestral
96
+ state, e.g. in the ` AA ` ("ancestral allele") ` info ` field. In this case the ` variant_AA `
97
+ field of the .vcz file can be specified, as follows.
98
+ ``` {code}
99
+ vdata = tsinfer.VariantData("file.vcz", ancestral_state="variant_AA")
100
+ ```
101
+ A worked example is shown in the {ref}` real data example<sec_usage_read_vcf_inference> `
102
+ later on this page.
103
+ * Using a numpy array of ancestral state strings, of the same length as the number
104
+ of unmasked variants. For example, if you wish to treat the reference ("REF") allele
105
+ as the ancestral state, you can take advantage of the fact that the .vcz format
106
+ always stores the zeroth allele as the REF, suggesting the following syntax:
107
+ ``` {code}
108
+ vcf_zarr = zarr.load("file.vcz")
109
+ vdata = tsinfer.VariantData("file.vcz", ancestral_state=vcf_zarr["variant_allele"][:, 0])
110
+ ```
111
+ A worked example of using an array of strings is provided in the
112
+ {ref}` simulation example<sec_usage_simulation_example_inference> ` later on this page.
113
+ * Using a single string of ancestral states, e.g. from a FASTA file. The string
114
+ should cover the entire genetic sequence, such that the ` i ` th character in the
115
+ string is taken as the ancestral state for an inference site at position ` i ` . In this
116
+ case, the {meth}` add_ancestral_state_array ` method can be used to extract the states
117
+ and save them to the VCF Zarr dataset, under the name ` ancestral_state ` . Note
118
+ that if, as is common, variant positions in the .vcz file are one-based (starting at 1), rather than zero-based, you should add a padding character at the start of the string.
119
+
120
+
121
+ Below we illustrate the single string method, using a stored FASTA file.
122
+ In this file, the 16th, 44th, 50th, 55th, 71st, 75th, 85th, and 95th characters are
123
+ ` G ` , ` G ` , ` C ` , ` T ` , ` C ` , ` A ` , ` T ` , and ` A ` (note that the 85th character, ` T ` , does not match any of the alleles in the .vcz genotypes for position 85).
91
124
92
125
``` {code-cell}
93
126
import tsinfer
@@ -98,47 +131,46 @@ vcf_zarr = zarr.open("_static/example_data.vcz")
98
131
99
132
reader = pyfaidx.Fasta("_static/example_ancestral_state.fa")
100
133
ancestral_str = str(reader["chr1"])
101
- # Our positions are zero-based, if they were one-based we would
102
- # prepend an X here.
103
- # e.g. ancestral_str = "X" + ancestral_str
104
-
105
- # Set the ancestral state at the last known position to "N", for demonstration
106
- last_pos = vcf_zarr['variant_position'][-1]
107
- ancestral_str = ancestral_str[:last_pos] + "N" + ancestral_str[(last_pos + 1):]
134
+ # We consider positions in the .vcz file to be one-based, so we prepend an X to the string
135
+ ancestral_str = "X" + ancestral_str
108
136
109
137
tsinfer.add_ancestral_state_array(vcf_zarr, ancestral_str)
110
138
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_state="ancestral_state")
111
139
```
112
140
113
- The {class}` VariantData ` object is a lightweight wrapper around the .vcz file.
114
- We'll use it to infer a tree sequence on the basis of the sites that vary between the
115
- different samples. However, note that certain sites are not used by _ tsinfer_ for inferring
116
- the genealogy (although they are still encoded in the final tree sequence), These are:
141
+ Because the ancestral state at position 85 does not match any of the
142
+ alleles at that site, a warning has been given that the ancestral state
143
+ will be considered unknown. As a consequence, although it will appear in
144
+ the inferred tree sequence, the site at position 85 will be treated as a
145
+ "noninference" site.
146
+
147
+ ### Inference sites
148
+
149
+ Certain sites are not used by _ tsinfer_ for inferring the genealogy.
150
+ These _ noninference_ sites are nevertheless included in the final
151
+ tree sequence, but with their mutations placed by
152
+ {meth}` parsimony<tskit.Tree.map_mutations> ` . Such sites include:
117
153
118
154
* Non-variable (fixed) sites, e.g. the site at position 71 above
119
155
* Singleton sites, where only one genome has the derived allele
120
156
e.g. the site at position 75 above
121
- * Sites where the ancestral allele is unknown, e.g. demonstrated above when setting the
122
- ancestral state of the site at position 95 to ` N ` .
157
+ * Sites where the ancestral allele is unknown, e.g. site 85 (see above).
123
158
* Multialleleic sites, with more than 2 alleles (but see
124
159
[ here] ( https://github.com/tskit-dev/tsinfer/issues/670 ) for a workaround)
125
160
126
- Additionally, during the inference step, additional sites can be flagged as not for use in
127
- inference, for example if they are deemed unreliable (this is done
128
- via the ` exclude_positions ` parameter).
129
-
130
- Sites which are not used for inference will
131
- still be included in the final tree sequence, with mutations at those sites being placed
132
- onto branches by {meth}` parsimony<tskit.Tree.map_mutations> ` .
133
-
161
+ Additional sites can be deliberately flagged as not-for-use in inference,
162
+ for example if their genotypes or ancestral states are deemed unreliable,
163
+ via the ` exclude_positions ` parameter when running the inference
164
+ step of _ tsinfer_ .
134
165
135
166
### Topology inference
136
167
137
- Once we have stored our data in a ` .VariantData ` object, we can easily infer
138
- a {ref}` tree sequence<sec_python_api_trees_and_tree_sequences> ` using the Python
139
- API. Note that each sample in the original .vcz file will correspond to an * individual*
140
- in the resulting tree sequence. Since these three individuals are diploid, the resulting
141
- tree sequence will have ` ts.num_samples == 6 ` (i.e. unlike in a .vcz file, a "sample" in
168
+ Once our data is wrapped in a {class}` ~tsinfer.VariantData ` object, we can infer
169
+ a {ref}` tree sequence<sec_python_api_trees_and_tree_sequences> ` e.g. using
170
+ _ tsinfer_ 's {ref}` Python API<sec_api> ` . Note that each sample in the original
171
+ .vcz file will correspond to an * individual* in the resulting tree sequence.
172
+ Since these three individuals are diploid, the resulting
173
+ tree sequence will have ` ts.num_samples == 6 ` (unlike in a .vcz file, a "sample" in
142
174
tskit refers to a haploid genome, not a diploid individual).
143
175
144
176
``` {code-cell} ipython3
@@ -177,7 +209,7 @@ for v_orig, v_inferred in zip(vdata.variants(), inferred_ts.variants()):
177
209
np.array(v_inferred.alleles)[v_inferred.genotypes]
178
210
):
179
211
raise ValueError("Genotypes in original dataset and inferred tree seq not equal")
180
- print("** Genotypes in original dataset and inferred tree sequence are the same **")
212
+ print("** Genotypes in original dataset and inferred ts are identical **")
181
213
```
182
214
183
215
We can examine the inferred genetic genealogy, in the form of
@@ -230,11 +262,14 @@ software such as [tsdate](https://tskit.dev/software/tsdate.html): the _tsinfer_
230
262
algorithm is only intended to infer the genetic relationships between the samples
231
263
(i.e. the * topology* of the tree sequence).
232
264
265
+ (sec_usage_toy_example_masks)=
266
+
233
267
### Masks
234
268
235
- It is possible to * completely* exclude sites and samples, by specifing a boolean
236
- ` site_mask ` and/or a ` sample_mask ` when creating the ` VariantData ` object. Sites or samples with
237
- a mask value of ` True ` will be completely omitted both from inference and the final tree sequence.
269
+ As well as marking sites as not-for-inference, is possible to * completely* exclude
270
+ sites and samples, by specifing a boolean ` site_mask ` and/or a ` sample_mask ` when
271
+ creating the ` VariantData ` object. Sites or samples with a mask value of ` True ` will
272
+ be completely omitted both from inference and the final tree sequence.
238
273
This can be useful, for example, if you wish to select only a subset of the chromosome for
239
274
inference, e.g. to reduce computational load. You can also use it to subset inference to a
240
275
particular contig, if your dataset contains multiple contigs. Note that if a ` site_mask ` is provided,
@@ -392,6 +427,8 @@ In practice this means we can keep such files lying around without taking up too
392
427
393
428
Once we have our ` .vcz ` file created, running the inference is straightforward.
394
429
430
+ (sec_usage_simulation_example_inference)=
431
+
395
432
``` {code-cell} ipython3
396
433
# Infer & save a ts from the notebook simulation.
397
434
ancestral_states = np.load(f"{name}-AA.npy")
@@ -499,9 +536,9 @@ source and inferred tree sequences.
499
536
500
537
## Data example
501
538
502
- Inputting real data for inference is similar in principle to the examples above .
539
+ Inputting real data for inference is similar in principle to the previous examples .
503
540
All that is required is a .vcz file, which can be created using
504
- [ vcf2zarr] ( https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html ) as above
541
+ [ vcf2zarr] ( https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html ) as above.
505
542
506
543
(sec_usage_read_vcf)=
507
544
@@ -519,9 +556,10 @@ vcf_location = "_static/P_dom_chr24_phased.vcf.gz"
519
556
!python -m bio2zarr vcf2zarr convert --force {vcf_location} sparrows.vcz
520
557
```
521
558
522
- This creates the ` sparrows.vcz ` datastore, which we open using ` tsinfer.VariantData ` .
523
- The original VCF had the ancestral allelic state specified in the ` AA ` INFO field,
524
- so we can simply provide the string ` "variant_AA" ` as the ancestral_state parameter.
559
+ This creates the ` sparrows.vcz ` datastore, which we open using
560
+ {class}` tsinfer.VariantData ` . The original VCF had the ancestral allelic
561
+ state specified in the ` AA ` INFO field, so we can simply provide the
562
+ string ` "variant_AA" ` as the ancestral_state parameter.
525
563
526
564
``` {code-cell} ipython3
527
565
# Do the inference: this VCF has ancestral states in the AA field
@@ -536,7 +574,7 @@ print(
536
574
537
575
On a modern computer, this should only take a few seconds to run.
538
576
539
- ### Adding more metadata
577
+ #### Adding more metadata
540
578
541
579
We can add additional data to the zarr file, which will make it through to the tree sequence.
542
580
For instance, we might want to mark which population each individual comes from.
@@ -573,8 +611,21 @@ for i, name in enumerate(vcf_zarr["sample_id"]):
573
611
zarr.save("sparrows.vcz/individuals_population", individuals_population)
574
612
```
575
613
576
- Now when we carry out the inference, we get a tree sequence in which the nodes are
577
- correctly assigned to named populations
614
+ (sec_usage_read_vcf_inference)=
615
+
616
+ ### _ Tsinfer_ inference
617
+
618
+ Note that the steps above to generate a .vcz file are not strictly part of
619
+ _ tsinfer_ . We only invoke _ tsinfer_ subsequently, when creating a
620
+ {class}` ~tsinfer.VariantData ` object. Moreover, _ tsinfer_ treats the
621
+ .vcz information as read-only, and does not make a copy of it.
622
+ This means _ tsinfer_ is well-suited to using publicly provided, read-only
623
+ .vcz datafiles. Furthermore, {ref}` sec_usage_toy_example_masks ` make it easy
624
+ to use subsets of such datafiles, e.g. if they contain multiple chromosomes
625
+ or more samples than are required for your analysis.
626
+
627
+ As the .vcz file we are now using contains population metadata, ` tsinfer ` will create
628
+ a tree sequence whose sample nodes are correctly assigned to named populations:
578
629
579
630
``` {code-cell} ipython3
580
631
vdata = tsinfer.VariantData("sparrows.vcz", ancestral_state="variant_AA", individuals_population="individuals_population")
@@ -600,6 +651,7 @@ the [tskit](https://tskit.dev/tskit/docs/stable/) library. The
600
651
{ref}` tskit tutorial<sec_tutorial_stats> ` provides much more detail. Below we just give a
601
652
flavour of the possibilities.
602
653
654
+
603
655
To quickly eyeball small datasets, we can draw the entire tree sequence, or
604
656
{meth}` ~tskit.Tree.draw ` the tree at any particular genomic position. The following
605
657
code demonstrates how to use the {meth}` tskit.TreeSequence.at ` method to obtain the tree
@@ -695,7 +747,6 @@ print(gnn_table)
695
747
print(gnn_table.groupby(level="Country").mean())
696
748
```
697
749
698
-
699
750
From this, it can be seen that the genealogical nearest neighbours of birds in Norway
700
751
tend also to be in Norway, and vice versa for birds from France. In other words, there is
701
752
a small but noticable degree of population structure in the data. The bird `` 8L19766 ``
0 commit comments