Skip to content

Commit 20788d3

Browse files
hyanwongmergify[bot]
authored andcommitted
Tidy the usage docs, including ancestral state specs
Also fixes #1024
1 parent 44827ed commit 20788d3

File tree

3 files changed

+117
-66
lines changed

3 files changed

+117
-66
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
>chr1
2-
nnnnnnnnnnnnnnnnGnnnnnnnnnnnnnnnnnnnnnnnnnnnGnnnnnCnnnnTnnnnnnnnnnnnnnnCnnnAnnnnnnnnnGnnnnnnnnnAnnnn
2+
nnnnnnnnnnnnnnnGnnnnnnnnnnnnnnnnnnnnnnnnnnnGnnnnnCnnnnTnnnnnnnnnnnnnnnCnnnAnnnnnnnnnTnnnnnnnnnAnnnn
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
chr1 100 6 100 100
1+
chr1 99 6 99 99

docs/usage.md

Lines changed: 115 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -23,19 +23,19 @@ kernelspec:
2323

2424
## Toy example
2525

26-
_Tsinfer_ takes as input a [Zarr](https://zarr.readthedocs.io/) file, with phased variant data encoded in the
27-
[VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec/) (.vcz) format. The standard
28-
route to create such a file is by conversion from a VCF file, e.g. using
29-
[vcf2zarr](https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html) as described later in this
30-
document. However, for the moment we'll just use a pre-generated dataset:
26+
_Tsinfer_ takes as input a [Zarr](https://zarr.readthedocs.io/) file, with phased variant
27+
data encoded in the [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec/) (.vcz) format.
28+
For simplicity, we first demonstrate using a pre-generated .vcz file (later, in the
29+
{ref}`sec_usage_data_example` section, we describe how to create such a file e.g. from
30+
a VCF using [vcf2zarr](https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html)).
3131

3232

3333
```{code-cell} ipython3
3434
import zarr
3535
vcf_zarr = zarr.open("_static/example_data.vcz")
3636
```
3737

38-
This is what the genotypes stored in that datafile look like:
38+
Here's what the genotypes stored in that datafile look like:
3939

4040
```{code-cell}
4141
:"tags": ["remove-input"]
@@ -63,31 +63,64 @@ for site_id, pos in enumerate(positions):
6363

6464
:::{note}
6565
The last site, at position 95, is an indel (insertion or deletion). Indels can be used
66-
for inference as long as the indel does not overlap with other variants, only 2
66+
for inference as long as the indel does not overlap with other variants, only two
6767
alleles exist, and the ancestral state is known.
6868
:::
6969

70-
### VariantData and ancestral alleles
70+
(sec_usage_toy_example_variant_data)=
7171

72-
We wish to infer a genealogy that could have given rise to this data set. To run _tsinfer_
73-
we wrap the `.vcz` file in a {class}`tsinfer.VariantData` object. This requires an
74-
*ancestral state* to be specified for each site; there are
75-
many methods for calculating these: details are outside the scope of this manual, but we
76-
have started a [discussion topic](https://github.com/tskit-dev/tsinfer/discussions/523)
72+
73+
### VariantData
74+
75+
_Tsinfer_ produces a genealogy that could have given rise to the data set above,
76+
based on the sites that vary between the samples. To provide extra information
77+
to the algorithm, you must wrap the .vcz file in a lightweight
78+
{class}`tsinfer.VariantData` object, using syntax like:
79+
80+
```{code} python
81+
vdata = tsinfer.VariantData("file.vcz", ancestral_state=***, ...)
82+
```
83+
84+
#### Ancestral states
85+
86+
Importantly, the {class}`~tsinfer.VariantData` object requires an
87+
*ancestral state* to be provided for each site used in inference.
88+
There are many methods for determing ancestral states: details are outside
89+
the scope of this manual, but we have started a
90+
[discussion topic](https://github.com/tskit-dev/tsinfer/discussions/523)
7791
on this issue to provide some recommendations.
7892

79-
Sometimes VCF files will contain the
80-
ancestral state in the "AA" ("ancestral allele") info field, in which case it will be encoded
81-
in the `variant_AA` field of the .vcz file. It's also possible to provide a numpy array
82-
of ancestral alleles, of the same length as the number of selected variants. If you have a string
83-
of the ancestral states (e.g. from FASTA) the {meth}`add_ancestral_state_array`
84-
method can be used to convert and save this to the VCF Zarr dataset (under the name
85-
`ancestral_state`). Note that this method assumes that the string uses zero-based
86-
indexing, so that the first letter corresponds to a site at position 0. If,
87-
as is typical, the first letter in the string denotes the ancestral state of
88-
the site at position 1 in the .vcz file, you should prepend an "X" to the string.
89-
Alleles that are not in the list of alleles for their respective site are
90-
treated as unknown and not used for inference (with a warning given).
93+
Ancestral states can be specified in several ways:
94+
95+
* Using an existing array in the .vcz file. Some VCF files already contain an ancestral
96+
state, e.g. in the `AA` ("ancestral allele") `info` field. In this case the `variant_AA`
97+
field of the .vcz file can be specified, as follows.
98+
```{code}
99+
vdata = tsinfer.VariantData("file.vcz", ancestral_state="variant_AA")
100+
```
101+
A worked example is shown in the {ref}`real data example<sec_usage_read_vcf_inference>`
102+
later on this page.
103+
* Using a numpy array of ancestral state strings, of the same length as the number
104+
of unmasked variants. For example, if you wish to treat the reference ("REF") allele
105+
as the ancestral state, you can take advantage of the fact that the .vcz format
106+
always stores the zeroth allele as the REF, suggesting the following syntax:
107+
```{code}
108+
vcf_zarr = zarr.load("file.vcz")
109+
vdata = tsinfer.VariantData("file.vcz", ancestral_state=vcf_zarr["variant_allele"][:, 0])
110+
```
111+
A worked example of using an array of strings is provided in the
112+
{ref}`simulation example<sec_usage_simulation_example_inference>` later on this page.
113+
* Using a single string of ancestral states, e.g. from a FASTA file. The string
114+
should cover the entire genetic sequence, such that the `i`th character in the
115+
string is taken as the ancestral state for an inference site at position `i`. In this
116+
case, the {meth}`add_ancestral_state_array` method can be used to extract the states
117+
and save them to the VCF Zarr dataset, under the name `ancestral_state`. Note
118+
that if, as is common, variant positions in the .vcz file are one-based (starting at 1), rather than zero-based, you should add a padding character at the start of the string.
119+
120+
121+
Below we illustrate the single string method, using a stored FASTA file.
122+
In this file, the 16th, 44th, 50th, 55th, 71st, 75th, 85th, and 95th characters are
123+
`G`, `G`, `C`, `T`, `C`, `A`, `T`, and `A` (note that the 85th character, `T`, does not match any of the alleles in the .vcz genotypes for position 85).
91124

92125
```{code-cell}
93126
import tsinfer
@@ -98,47 +131,46 @@ vcf_zarr = zarr.open("_static/example_data.vcz")
98131
99132
reader = pyfaidx.Fasta("_static/example_ancestral_state.fa")
100133
ancestral_str = str(reader["chr1"])
101-
# Our positions are zero-based, if they were one-based we would
102-
# prepend an X here.
103-
# e.g. ancestral_str = "X" + ancestral_str
104-
105-
# Set the ancestral state at the last known position to "N", for demonstration
106-
last_pos = vcf_zarr['variant_position'][-1]
107-
ancestral_str = ancestral_str[:last_pos] + "N" + ancestral_str[(last_pos + 1):]
134+
# We consider positions in the .vcz file to be one-based, so we prepend an X to the string
135+
ancestral_str = "X" + ancestral_str
108136
109137
tsinfer.add_ancestral_state_array(vcf_zarr, ancestral_str)
110138
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_state="ancestral_state")
111139
```
112140

113-
The {class}`VariantData` object is a lightweight wrapper around the .vcz file.
114-
We'll use it to infer a tree sequence on the basis of the sites that vary between the
115-
different samples. However, note that certain sites are not used by _tsinfer_ for inferring
116-
the genealogy (although they are still encoded in the final tree sequence), These are:
141+
Because the ancestral state at position 85 does not match any of the
142+
alleles at that site, a warning has been given that the ancestral state
143+
will be considered unknown. As a consequence, although it will appear in
144+
the inferred tree sequence, the site at position 85 will be treated as a
145+
"noninference" site.
146+
147+
### Inference sites
148+
149+
Certain sites are not used by _tsinfer_ for inferring the genealogy.
150+
These _noninference_ sites are nevertheless included in the final
151+
tree sequence, but with their mutations placed by
152+
{meth}`parsimony<tskit.Tree.map_mutations>`. Such sites include:
117153

118154
* Non-variable (fixed) sites, e.g. the site at position 71 above
119155
* Singleton sites, where only one genome has the derived allele
120156
e.g. the site at position 75 above
121-
* Sites where the ancestral allele is unknown, e.g. demonstrated above when setting the
122-
ancestral state of the site at position 95 to `N`.
157+
* Sites where the ancestral allele is unknown, e.g. site 85 (see above).
123158
* Multialleleic sites, with more than 2 alleles (but see
124159
[here](https://github.com/tskit-dev/tsinfer/issues/670) for a workaround)
125160

126-
Additionally, during the inference step, additional sites can be flagged as not for use in
127-
inference, for example if they are deemed unreliable (this is done
128-
via the `exclude_positions` parameter).
129-
130-
Sites which are not used for inference will
131-
still be included in the final tree sequence, with mutations at those sites being placed
132-
onto branches by {meth}`parsimony<tskit.Tree.map_mutations>`.
133-
161+
Additional sites can be deliberately flagged as not-for-use in inference,
162+
for example if their genotypes or ancestral states are deemed unreliable,
163+
via the `exclude_positions` parameter when running the inference
164+
step of _tsinfer_.
134165

135166
### Topology inference
136167

137-
Once we have stored our data in a `.VariantData` object, we can easily infer
138-
a {ref}`tree sequence<sec_python_api_trees_and_tree_sequences>` using the Python
139-
API. Note that each sample in the original .vcz file will correspond to an *individual*
140-
in the resulting tree sequence. Since these three individuals are diploid, the resulting
141-
tree sequence will have `ts.num_samples == 6` (i.e. unlike in a .vcz file, a "sample" in
168+
Once our data is wrapped in a {class}`~tsinfer.VariantData` object, we can infer
169+
a {ref}`tree sequence<sec_python_api_trees_and_tree_sequences>` e.g. using
170+
_tsinfer_'s {ref}`Python API<sec_api>`. Note that each sample in the original
171+
.vcz file will correspond to an *individual* in the resulting tree sequence.
172+
Since these three individuals are diploid, the resulting
173+
tree sequence will have `ts.num_samples == 6` (unlike in a .vcz file, a "sample" in
142174
tskit refers to a haploid genome, not a diploid individual).
143175

144176
```{code-cell} ipython3
@@ -177,7 +209,7 @@ for v_orig, v_inferred in zip(vdata.variants(), inferred_ts.variants()):
177209
np.array(v_inferred.alleles)[v_inferred.genotypes]
178210
):
179211
raise ValueError("Genotypes in original dataset and inferred tree seq not equal")
180-
print("** Genotypes in original dataset and inferred tree sequence are the same **")
212+
print("** Genotypes in original dataset and inferred ts are identical **")
181213
```
182214

183215
We can examine the inferred genetic genealogy, in the form of
@@ -230,11 +262,14 @@ software such as [tsdate](https://tskit.dev/software/tsdate.html): the _tsinfer_
230262
algorithm is only intended to infer the genetic relationships between the samples
231263
(i.e. the *topology* of the tree sequence).
232264

265+
(sec_usage_toy_example_masks)=
266+
233267
### Masks
234268

235-
It is possible to *completely* exclude sites and samples, by specifing a boolean
236-
`site_mask` and/or a `sample_mask` when creating the `VariantData` object. Sites or samples with
237-
a mask value of `True` will be completely omitted both from inference and the final tree sequence.
269+
As well as marking sites as not-for-inference, is possible to *completely* exclude
270+
sites and samples, by specifing a boolean `site_mask` and/or a `sample_mask` when
271+
creating the `VariantData` object. Sites or samples with a mask value of `True` will
272+
be completely omitted both from inference and the final tree sequence.
238273
This can be useful, for example, if you wish to select only a subset of the chromosome for
239274
inference, e.g. to reduce computational load. You can also use it to subset inference to a
240275
particular contig, if your dataset contains multiple contigs. Note that if a `site_mask` is provided,
@@ -392,6 +427,8 @@ In practice this means we can keep such files lying around without taking up too
392427

393428
Once we have our `.vcz` file created, running the inference is straightforward.
394429

430+
(sec_usage_simulation_example_inference)=
431+
395432
```{code-cell} ipython3
396433
# Infer & save a ts from the notebook simulation.
397434
ancestral_states = np.load(f"{name}-AA.npy")
@@ -499,9 +536,9 @@ source and inferred tree sequences.
499536

500537
## Data example
501538

502-
Inputting real data for inference is similar in principle to the examples above.
539+
Inputting real data for inference is similar in principle to the previous examples.
503540
All that is required is a .vcz file, which can be created using
504-
[vcf2zarr](https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html) as above
541+
[vcf2zarr](https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html) as above.
505542

506543
(sec_usage_read_vcf)=
507544

@@ -519,9 +556,10 @@ vcf_location = "_static/P_dom_chr24_phased.vcf.gz"
519556
!python -m bio2zarr vcf2zarr convert --force {vcf_location} sparrows.vcz
520557
```
521558

522-
This creates the `sparrows.vcz` datastore, which we open using `tsinfer.VariantData`.
523-
The original VCF had the ancestral allelic state specified in the `AA` INFO field,
524-
so we can simply provide the string `"variant_AA"` as the ancestral_state parameter.
559+
This creates the `sparrows.vcz` datastore, which we open using
560+
{class}`tsinfer.VariantData`. The original VCF had the ancestral allelic
561+
state specified in the `AA` INFO field, so we can simply provide the
562+
string `"variant_AA"` as the ancestral_state parameter.
525563

526564
```{code-cell} ipython3
527565
# Do the inference: this VCF has ancestral states in the AA field
@@ -536,7 +574,7 @@ print(
536574

537575
On a modern computer, this should only take a few seconds to run.
538576

539-
### Adding more metadata
577+
#### Adding more metadata
540578

541579
We can add additional data to the zarr file, which will make it through to the tree sequence.
542580
For instance, we might want to mark which population each individual comes from.
@@ -573,8 +611,21 @@ for i, name in enumerate(vcf_zarr["sample_id"]):
573611
zarr.save("sparrows.vcz/individuals_population", individuals_population)
574612
```
575613

576-
Now when we carry out the inference, we get a tree sequence in which the nodes are
577-
correctly assigned to named populations
614+
(sec_usage_read_vcf_inference)=
615+
616+
### _Tsinfer_ inference
617+
618+
Note that the steps above to generate a .vcz file are not strictly part of
619+
_tsinfer_. We only invoke _tsinfer_ subsequently, when creating a
620+
{class}`~tsinfer.VariantData` object. Moreover, _tsinfer_ treats the
621+
.vcz information as read-only, and does not make a copy of it.
622+
This means _tsinfer_ is well-suited to using publicly provided, read-only
623+
.vcz datafiles. Furthermore, {ref}`sec_usage_toy_example_masks` make it easy
624+
to use subsets of such datafiles, e.g. if they contain multiple chromosomes
625+
or more samples than are required for your analysis.
626+
627+
As the .vcz file we are now using contains population metadata, `tsinfer` will create
628+
a tree sequence whose sample nodes are correctly assigned to named populations:
578629

579630
```{code-cell} ipython3
580631
vdata = tsinfer.VariantData("sparrows.vcz", ancestral_state="variant_AA", individuals_population="individuals_population")
@@ -600,6 +651,7 @@ the [tskit](https://tskit.dev/tskit/docs/stable/) library. The
600651
{ref}`tskit tutorial<sec_tutorial_stats>` provides much more detail. Below we just give a
601652
flavour of the possibilities.
602653

654+
603655
To quickly eyeball small datasets, we can draw the entire tree sequence, or
604656
{meth}`~tskit.Tree.draw` the tree at any particular genomic position. The following
605657
code demonstrates how to use the {meth}`tskit.TreeSequence.at` method to obtain the tree
@@ -695,7 +747,6 @@ print(gnn_table)
695747
print(gnn_table.groupby(level="Country").mean())
696748
```
697749

698-
699750
From this, it can be seen that the genealogical nearest neighbours of birds in Norway
700751
tend also to be in Norway, and vice versa for birds from France. In other words, there is
701752
a small but noticable degree of population structure in the data. The bird ``8L19766``

0 commit comments

Comments
 (0)