diff --git a/_config.yml b/_config.yml index 8f31b82a..32833dbc 100644 --- a/_config.yml +++ b/_config.yml @@ -33,6 +33,7 @@ sphinx: tskit: ["https://tskit.dev/tskit/docs/stable", null] msprime: ["https://tskit.dev/msprime/docs/stable", null] pyslim: ["https://pyslim.readthedocs.io/en/stable", null] + numpy: ["https://numpy.org/doc/stable/", null] myst_enable_extensions: - colon_fence - deflist diff --git a/_toc.yml b/_toc.yml index 9bdd798c..ae0e00e0 100644 --- a/_toc.yml +++ b/_toc.yml @@ -13,6 +13,10 @@ parts: - file: simplification - file: metadata - file: tskitr +- caption: Analysis + chapters: + - file: analysing_tree_sequences + - file: analysing_trees - caption: Backward time simulations # TODO this will be broken down more finely as we add more content chapters: diff --git a/analysing_tree_sequences.md b/analysing_tree_sequences.md new file mode 100644 index 00000000..02fec1de --- /dev/null +++ b/analysing_tree_sequences.md @@ -0,0 +1,25 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.12 + jupytext_version: 1.9.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +```{currentmodule} tskit +``` + +(sec_analysing_tree_sequences)= + +# Analysing tree sequences + +:::{todo} +Add all stats material from https://tskit.dev/tskit/docs/stable/tutorial.html + +Should we rename this "calculating statistics", or something else with the word "statistics" in the name? +::: \ No newline at end of file diff --git a/analysing_trees.md b/analysing_trees.md new file mode 100644 index 00000000..819c3061 --- /dev/null +++ b/analysing_trees.md @@ -0,0 +1,409 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.12 + jupytext_version: 1.9.1 +kernelspec: + display_name: Python 3 + language: python + name: python3 +--- + +```{currentmodule} tskit +``` + +```{code-cell} ipython3 +:tags: [remove-cell] +import io +import pickle + +import msprime +import tskit + +def tree_traversals(): + nodes = """\ + id is_sample time + 0 1 0 + 1 1 0 + 2 1 0 + 3 1 0 + 4 1 0 + 5 0 1 + 6 0 2 + 7 0 3 + """ + edges = """\ + left right parent child + 0 1 5 0,1,2 + 0 1 6 3,4 + 0 1 7 5,6 + """ + # NB same tree as used above, and we're using the same diagram. + ts = tskit.load_text( + nodes=io.StringIO(nodes), edges=io.StringIO(edges), strict=False + ) + ts.dump("data/tree_traversals.trees") + +def different_time_samples(): + samples = [ + msprime.Sample(0, 0), + msprime.Sample(0, 1), + msprime.Sample(0, 20), + ] + ts = msprime.simulate( + Ne=1e6, + samples=samples, + demographic_events=[ + msprime.PopulationParametersChange(time=10, growth_rate=2, population_id=0), + ], + random_seed=42, + ) + ts.dump("data/different_time_samples.trees") + +def parsimony_simple(): + ts = msprime.sim_ancestry(3, random_seed=42) + ts.dump("data/parsimony_simple.trees") + +def parsimony_map(): + # pick a seed that gives the tips in the right order + ts = msprime.sim_ancestry(3, sequence_length=100, random_seed=38) + ts.dump("data/parsimony_map.trees") + ts = msprime.sim_mutations( + ts, rate=0.01, random_seed=123, discrete_genome=False) + data = [ + {'pos':v.site.position, 'alleles': v.alleles, 'genotypes':v.genotypes} + for v in ts.variants()] + with open("data/parsimony_map.pickle", "wb") as f: + pickle.dump(data, f) + + +def create_notebook_data(): + tree_traversals() + different_time_samples() + parsimony_simple() + parsimony_map() + +# create_notebook_data() # uncomment to recreate the tree seqs used in this notebook +``` + +(sec_analysing_trees)= +# Analysing trees + +There are a number of different ways we might want to analyse a single {class}`Tree`. +Most involve some sort of traversal over the nodes, mutations, or branches in the tree. +{program}`tskit` provides various way of traversing through a tree, and also some +built in phylogenetic algorithms such as {meth}`Tree.map_mutations` which efficently +places mutations ("characters" in phylogenetic terminology) on a given tree. + + +(sec_analysing_trees_traversals)= + +## Tree traversals + +There are various a single {class}`Tree`, traversals in various orders are possible using the +{meth}`~Tree.nodes` iterator. For example, in the following tree we can visit the +nodes in different orders: + + +```{code-cell} ipython3 +import tskit +from IPython.display import SVG, display + +ts = tskit.load("data/tree_traversals.trees") +tree = ts.first() +display(SVG(tree.draw_svg())) + +for order in ["preorder", "inorder", "postorder"]: + print(f"{order}:\t", list(tree.nodes(order=order))) +``` + +Much of the time, the specific ordering of the nodes is not important +and we can leave it out (defaulting to preorder traversal). For example, +here we compute the total branch length of a tree: + +```{code-cell} ipython3 +total_branch_length = sum(tree.branch_length(u) for u in tree.nodes()) +print(f"Total branch length: {total_branch_length}") +``` + +Note that this is also available as the {attr}`Tree.total_branch_length` attribute. + +### Traversing upwards + +For many applications it is useful to be able to traverse upwards from the +leaves. We can do this using the {meth}`Tree.parent` method, which +returns the parent of a node. For example, we can traverse upwards from +each of the samples in the tree: + +```{code-cell} ipython3 +for u in tree.samples(): + path = [] + v = u + while v != tskit.NULL: + path.append(v) + v = tree.parent(v) + print(u, "->", path) +``` + +### Traversals with information + +Sometimes we will need to traverse down the tree while maintaining +some information about the nodes that are above it. While this +can be done using recursive algorithms, it is often more convenient +and efficient to use an iterative approach. Here, for example, +we define an iterator that yields all nodes in preorder along with +their path length to root: + +```{code-cell} ipython3 +def preorder_dist(tree): + for root in tree.roots: + stack = [(root, 0)] + while len(stack) > 0: + u, distance = stack.pop() + yield u, distance + for v in tree.children(u): + stack.append((v, distance + 1)) + +print(list(preorder_dist(tree))) +``` + + +(sec_tutorial_networkx)= + +## Networkx + +Traversals and other network analysis can also be performed using the sizeable +[networkx](https://networkx.github.io/documentation/stable/index.html) +library. This can be achieved by calling {meth}`Tree.as_dict_of_dicts` to +convert a {class}`Tree` instance to a format that can be imported by networkx to +create a graph: + +```{code-cell} ipython3 +import networkx as nx + +g = nx.DiGraph(tree.as_dict_of_dicts()) +print(sorted(g.edges)) +``` + +### Traversing upwards in networkx + +We can revisit the above examples and traverse upwards with +networkx using a depth-first search algorithm: + +```{code-cell} ipython3 +g = nx.DiGraph(tree.as_dict_of_dicts()) +for u in tree.samples(): + path = [u] + [parent for parent, child, _ in + nx.edge_dfs(g, source=u, orientation="reverse")] + print(u, "->", path) +``` + + +### Calculating distances to the root + +Similarly, we can yield the nodes of a tree along with their distance to the +root in pre-order in networkx as well. Running this on the example above gives us +the same result as before: + +```{code-cell} ipython3 +g = nx.DiGraph(tree.as_dict_of_dicts()) +for root in tree.roots: + print(sorted(list(nx.shortest_path_length(g, source=root).items()))) +``` + + +### Finding nearest neighbors + +If some samples in a tree are not at time 0, then finding the nearest neighbor +of a sample is a bit more involved. Instead of writing our own traversal code +we can again draw on a networkx algorithm. +Let us start with an example tree with three samples that were sampled at +different time points: + +```{code-cell} ipython3 +ts = tskit.load("data/different_time_samples.trees") +tree = ts.first() +SVG(tree.draw_svg(y_axis=True, time_scale="rank")) +``` + +The generation times for these nodes are as follows: + +```{code-cell} ipython3 +for u in tree.nodes(): + print(f"Node {u}: time {tree.time(u)}") +``` + +Note that samples 0 and 1 are about 35 generations apart from each other even though +they were sampled at almost the same time. This is why samples 0 and 1 are +closer to sample 2 than to each other. + +For this nearest neighbor search we will be traversing up and down the tree, +so it is easier to treat the tree as an undirected graph: + +```{code-cell} ipython3 +g = nx.Graph(tree.as_dict_of_dicts()) +``` + +When converting the tree to a networkx graph the edges are annotated with their +branch length: + +```{code-cell} ipython3 +for e in g.edges(data=True): + print(e) +``` + +We can now use the "branch_length" field as a weight for a weighted shortest path +search: + +```{code-cell} ipython3 +import collections +import itertools + +# a dictionary of dictionaries to represent our distance matrix +dist_dod = collections.defaultdict(dict) +for source, target in itertools.combinations(tree.samples(), 2): + dist_dod[source][target] = nx.shortest_path_length( + g, source=source, target=target, weight="branch_length" + ) + dist_dod[target][source] = dist_dod[source][target] + +# extract the nearest neighbor of nodes 0, 1, and 2 +nearest_neighbor_of = [min(dist_dod[u], key=dist_dod[u].get) for u in range(3)] + +print(dict(zip(range(3), nearest_neighbor_of))) +``` + + +(sec_analysing_trees_parsimony)= + +## Parsimony + +The {meth}`Tree.map_mutations` method finds a parsimonious explanation for a +set of discrete character observations on the samples in a tree using classical +phylogenetic algorithms. + +```{code-cell} ipython3 +tree = tskit.load("data/parsimony_simple.trees").first() +alleles = ["red", "blue", "green"] +genotypes = [0, 0, 0, 0, 1, 2] +styles = [f".n{j} > .sym {{fill: {alleles[g]}}}" for j, g in enumerate(genotypes)] +display(SVG(tree.draw_svg(style="".join(styles)))) + +ancestral_state, mutations = tree.map_mutations(genotypes, alleles) +print("Ancestral state = ", ancestral_state) +for mut in mutations: + print(f"Mutation: node = {mut.node} derived_state = {mut.derived_state}") +``` + +So, the algorithm has concluded, quite reasonably, that the most parsimonious +description of this state is that the ancestral state is red and there was +a mutation to blue and green over nodes 4 and 5. + +### Building tables + +One of the main uses of {meth}`Tree.map_mutations` is to position mutations on a tree +to encode observed data. In the following example we show how a set +of tables can be updated using the {ref}`Tables API`; here we +infer the location of mutations in an simulated tree sequence of one tree, +given a set of site positions with their genotypes and allelic states: + +```{code-cell} ipython3 +import pickle +ts = tskit.load("data/parsimony_map.trees") +with open("data/parsimony_map.pickle", "rb") as file: + data = pickle.load(file) # Load saved variant data from a file +display(SVG(ts.draw_svg(size=(500, 300), time_scale="rank"))) +print("Variant data: pos, genotypes & alleles as described by the ts.variants() iterator:") +for i, v in enumerate(data): + print(f"Site {i} (pos {v['pos']:7.4f}): alleles: {v['alleles']}, genotypes: {v['genotypes']}") +print() +tree = ts.first() # there's only one tree anyway +tables = ts.dump_tables() +# Infer the sites and mutations from the variants. +for variant in data: + ancestral_state, mutations = tree.map_mutations(variant["genotypes"], variant['alleles']) + site_id = tables.sites.add_row(variant['pos'], ancestral_state=ancestral_state) + info = f"Site {site_id}: parsimony sets ancestral state to {ancestral_state}" + parent_offset = len(tables.mutations) + for mut in mutations: + parent = mut.parent + if parent != tskit.NULL: + parent += parent_offset + mut_id = tables.mutations.add_row( + site_id, node=mut.node, parent=parent, derived_state=mut.derived_state) + info += f", and places mutation {mut.id} to {mut.derived_state} above node {mut.node}" + print(info) +new_ts = tables.tree_sequence() +``` + +```{code-cell} ipython3 +mut_labels = {} # An array of labels for the mutations +for mut in new_ts.mutations(): # Make pretty labels showing the change in state + site = new_ts.site(mut.site) + older_mut = mut.parent >= 0 # is there an older mutation at the same position? + prev = new_ts.mutation(mut.parent).derived_state if older_mut else site.ancestral_state + mut_labels[site.id] = f"{mut.id}: {prev}→{mut.derived_state}" + +display(SVG(new_ts.draw_svg(size=(500, 300), mutation_labels=mut_labels, time_scale="rank"))) +``` + + +### Parsimony and missing data + +The Hartigan parsimony algorithm in {meth}`Tree.map_mutations` can also take missing data +into account when finding a set of parsimonious state transitions. We do this by +specifying the special value {data}`tskit.MISSING_DATA` (-1) as the state, which is +treated by the algorithm as "could be anything". + +For example, here we state that sample 0 is missing, and use the colour white to indicate +this: + +```{code-cell} ipython3 +tree = tskit.load("data/parsimony_simple.trees").first() +alleles = ["red", "blue", "green", "white"] +genotypes = [tskit.MISSING_DATA, 0, 0, 0, 1, 2] +styles = [f".n{j} > .sym {{fill: {alleles[g]}}}" for j, g in enumerate(genotypes)] +display(SVG(tree.draw_svg(style="".join(styles)))) + +ancestral_state, mutations = tree.map_mutations(genotypes, alleles) +print("Ancestral state = ", ancestral_state) +for mut in mutations: + print(f"Mutation: node = {mut.node} derived_state = {mut.derived_state}") +``` + + +The algorithm decided, again, quite reasonably, that the most parsimonious explanation +for the input data is the same as before. Thus, if we used this information to fill +out mutation table as above, we would impute the missing value for 0 as red. + +The output of the algorithm can be a little surprising at times. Consider this example:: + +```{code-cell} ipython3 +tree = msprime.simulate(6, random_seed=42).first() +alleles = ["red", "blue", "white"] +genotypes = [1, -1, 0, 0, 0, 0] +node_colours = {j: alleles[g] for j, g in enumerate(genotypes)} +ancestral_state, mutations = tree.map_mutations(genotypes, alleles) +print("Ancestral state = ", ancestral_state) +for mut in mutations: + print(f"Mutation: node = {mut.node} derived_state = {mut.derived_state}") +SVG(tree.draw(node_colours=node_colours)) +``` + + +Note that this is putting a mutation to blue over node 6, **not** node 0 as +we might expect. Thus, we impute here that node 1 is blue. It is important +to remember that the algorithm is minimising the number of state transitions; +this may not correspond always to what we might consider the most parsimonious +explanation. + +## Fast tree-based algorithms using numba + +:::{todo} +Add a few examples here of how to use numba to speed up tree based dynamic programming +algorithms. There are a good number of worked-up examples with timings in +[issue #63](https://github.com/tskit-dev/tutorials/issues/63) +::: + diff --git a/data/construction_example.trees b/data/construction_example.trees new file mode 100644 index 00000000..5c7fc9af Binary files /dev/null and b/data/construction_example.trees differ diff --git a/data/different_time_samples.trees b/data/different_time_samples.trees new file mode 100644 index 00000000..233cd155 Binary files /dev/null and b/data/different_time_samples.trees differ diff --git a/data/parsimony_map.pickle b/data/parsimony_map.pickle new file mode 100644 index 00000000..55dba4ef Binary files /dev/null and b/data/parsimony_map.pickle differ diff --git a/data/parsimony_map.trees b/data/parsimony_map.trees new file mode 100644 index 00000000..0ac459ab Binary files /dev/null and b/data/parsimony_map.trees differ diff --git a/data/parsimony_simple.trees b/data/parsimony_simple.trees new file mode 100644 index 00000000..cfb30429 Binary files /dev/null and b/data/parsimony_simple.trees differ diff --git a/data/tables_example.trees b/data/tables_example.trees new file mode 100644 index 00000000..51cda181 Binary files /dev/null and b/data/tables_example.trees differ diff --git a/data/tables_example_muts.trees b/data/tables_example_muts.trees new file mode 100644 index 00000000..30c71489 Binary files /dev/null and b/data/tables_example_muts.trees differ diff --git a/data/tree_traversals.trees b/data/tree_traversals.trees new file mode 100644 index 00000000..cdc41671 Binary files /dev/null and b/data/tree_traversals.trees differ diff --git a/data_structures.md b/data_structures.md index 568185dd..3c87de18 100644 --- a/data_structures.md +++ b/data_structures.md @@ -11,12 +11,529 @@ kernelspec: name: python3 --- +```{currentmodule} tskit +``` + +```{code-cell} ipython3 +:tags: [remove-cell] +import io +import pickle + +import msprime +import tskit + +def tables_examples(): + nodes = """\ + id is_sample time + 0 1 0 + 1 1 0 + 2 1 0 + 3 0 0.15 + 4 0 0.6 + 5 0 0.8 + 6 0 1.0 + """ + edges = """\ + left right parent child + 20 80 3 0 + 20 80 3 2 + 0 100 4 1 + 0 20 4 2 + 80 100 4 2 + 20 80 4 3 + 80 100 5 0 + 80 100 5 4 + 0 20 6 0 + 0 20 6 4 + """ + sites = """\ + id position ancestral_state + 0 15 A + 1 42 G + 2 60 T + """ + mutations = """\ + site node derived_state parent time + 0 4 G -1 0.9 + 1 1 A -1 0.4 + 2 3 C -1 0.55 + 2 0 T 2 0.1 + """ + ts = tskit.load_text( + nodes=io.StringIO(nodes), + edges=io.StringIO(edges), + strict=False, + ) + ts.dump("data/tables_example.trees") + ts = tskit.load_text( + nodes=io.StringIO(nodes), + edges=io.StringIO(edges), + sites=io.StringIO(sites), + mutations=io.StringIO(mutations), + strict=False, + ) + ts.dump("data/tables_example_muts.trees") + +def construction_example(): + nodes = """\ + id is_sample time + 0 1 0 + 1 1 0 + 2 0 10 + 3 1 0 + 4 0 20 + """ + edges = """\ + left right parent child + 0 1000 2 0 + 0 1000 2 1 + 0 1000 4 2 + 0 1000 4 3 + """ + sites = """\ + id position ancestral_state + 0 500 A + """ + mutations = """\ + site node derived_state parent + 0 3 G -1 + """ + ts = tskit.load_text( + nodes=io.StringIO(nodes), + edges=io.StringIO(edges), + sites=io.StringIO(sites), + mutations=io.StringIO(mutations), + strict=False, + ) + ts.dump("data/construction_example.trees") + + +def create_notebook_data(): + tables_examples() + construction_example() + +# create_notebook_data() # uncomment to recreate the tree seqs used in this notebook +``` + + (sec_data_structures)= # Data structures +(sec_data_structures_tables)= ## Tables +Internally, a tree sequence can be thought of as a set of tables, and {program}`tskit` +provides an interface for dealing with these tables directly. This is particularly +relevant when you wish to {ref}`edit or otherwise modify` +a tree sequence, although tables are also useful for bulk access to data contained in +the tree sequence, such as the times of all nodes. + +There are eight tables that together define a tree sequence, although some may be empty, +and together they form a {class}`TableCollection`. +The tables are defined in the official {program}`tskit` documentation for +{ref}`Table Definitions `, and the +{ref}`Tables API ` section in the docs describes how to work with +them. In this tutorial we give some pointers about what you can and cannot do with them. + +### Correspondance between tables and trees + +Consider the following sequence of trees: + +```{code-cell} ipython3 +from IPython.display import SVG +ts = tskit.load("data/tables_example.trees") +SVG(ts.draw_svg(y_axis=True)) +``` + +Ancestral recombination events have produced three different trees +that relate the three sampled genomes ``0``, ``1``, and ``2`` to each other +along the chromosome of length 100. + +#### Node and edge tables + +Each node in each of the above trees represents a particular ancestral genome +(a *haploid* genome; diploid individuals would be represented by two nodes). +Details about each node, including the time it lived, are stored in a +{class}`NodeTable` (details {ref}`here`) + +```{code-cell} ipython3 +ts.tables.nodes +``` + +Importantly, the first column, ``id``, is not actually recorded, and is +only shown when printing out node tables (as here) for convenience. +The second column, ``flags`` records a ``1`` for the individuals that are *samples*, +i.e., whose entire genealogical history is recorded by these trees. +(Note that the trees above record that node 3 inherited from node 4 +on the middle portion of the genome, but not on the ends.) + +The way the nodes are related to each other (i.e. the tree topology) is stored +in an {class}`EdgeTable` (details {ref}`here`). +Since some edges are present in more than one tree (e.g., node 1 inherits from node 4 +across the entire sequence), each edge contains not only the IDs of a parent and a child +node, but also the left and right positions which define the genomic region for which it +appears in the trees: + +```{code-cell} ipython3 +ts.tables.edges +``` + +Since node 3 is most recent, the edge that says that nodes 0 and 2 inherit +from node 3 on the interval between 0.2 and 0.8 comes first. Next are the +edges from node 4: there are four of these, as the edge from node 4 to node +1 is shared across the entire sequence, and for each of the three +genomic intervals there is an additional child node. At this +point, we know the full tree on the middle interval. Finally, edges +specifying the common ancestor of 0 and 4 on the remaining intervals (parents 6 +and 5 respectively) allow us to construct all trees across the entire interval. + + + +#### Sites and mutations tables + +Most tree sequences have DNA variation data associated with them, +{ref}`stored as mutations overlaid on the trees`: + +```{code-cell} ipython3 +ts = tskit.load("data/tables_example_muts.trees") +mut_labels = {} # An array of labels for the mutations, listing position & allele change +for mut in ts.mutations(): # This entire loop is just to make pretty labels + site = ts.site(mut.site) + older_mut = mut.parent >= 0 # is there an older mutation at the same position? + prev = ts.mutation(mut.parent).derived_state if older_mut else site.ancestral_state + mut_labels[mut.id] = "{}→{} @{:g}".format(prev, mut.derived_state, site.position) + +SVG(ts.draw_svg(y_axis=True, mutation_labels=mut_labels)) +``` + +There are four mutations in the depiction above, +marked by red crosses: one above node ``4`` on the first tree which records an A to G +transition at position 15, another above node ``1`` in the second tree which records a G +to A transition at position 45, and the final two above nodes ``2`` and ``0`` recording +transitions, both at position 60, on the second tree. The positions are recorded in the +{class}`SiteTable` (details {ref}`here`): + +```{code-cell} ipython3 +ts.tables.sites +``` + +As with node tables, the ``id`` column is **not** actually recorded, but is +implied by the position in the table. The mutations themselves are recorded in the +{class}`MutationTable` (details {ref}`here`). +This associates each mutation with a site ID, the ID of the node above which the +mutation occurs, the derived state to which the allele has mutated, and (optionally) a +time at which the mutation occured. + + +```{code-cell} ipython3 +ts.tables.mutations +``` + +Where there are multiple mutations at the same site, there can be mutations +stacked on top of each other. The "parent" column therefore contains the ID of the +mutation immediately above the current one at the same site, or -1 if there is no parent. + +These sites and mutations allow us to calculate the DNA sequence, or haplotype, for each +of the sample nodes: + +```{code-cell} ipython3 +for sample, h in enumerate(ts.haplotypes()): + print(f"Sample {sample}: {h}") +``` + +#### Other tables + +The other tables in a {class}`TableCollection` are the {class}`IndividualTable` (which +records which {ref}`individual organism` a node is contained +in), the {class}`PopulationTable` and the {class}`MigrationTable`, and the +{class}`ProvenanceTable` which records the {ref}`tskit:sec_provenance` of the data in +a tree sequence. These won't be covered in this tutorial. + +#### Metadata + +You may have noticed ``metadata`` columns in all the tables above. All tables can have +(optional) metadata attached to their rows, as described in the {ref}`sec_metadata` +section of the official {program}`tskit` documentation. +The {ref}`metadata tutorial` provides more information about how +to set and use these metadata columns in tree sequence tables. + + +### Accessing table data + +To look at a the contents of a table, you can simply `print` it: + +```{code-cell} ipython3 +print(ts.tables.mutations) +``` + +But {program}`tskit` also provides access to the rows and columns of each table. + +#### Row access + +Rows can be accessed using square braces, which will return an object containing the +raw values: + +```{code-cell} ipython3 +row = ts.tables.mutations[1] +print(f"A mutation at site {row.site} exists at time {row.time}") +print(row) +``` + +Additionally, many rows can be extracted into a new table using slices or +{ref}`numpy indexing` + +```{code-cell} ipython3 +ts.tables.mutations[2:4] +``` + +:::{note} +When manipulating table data, it is quite possible to create a table collection that +does not correspond to a valid tree sequence. For instance, if we replaced the mutations +table in our original tree sequence with the table above, the parent column +would refer to an invalid mutation ID (ID 2, when in the new tables we only have +mutations with IDs 0 and 1). In this particular case there is a method, +{meth}`TableCollection.compute_mutation_parents`, which will recalculate the parent IDs +from the trees, but there is no general way to ensure that a manipulated table will +remain valid. +::: + +Rows can also be *added* to a table using ``.add_row()``. However, when modifying tables +from a tree sequence, you should always modify a *copy* of the original data, +using the {meth}`TreeSequence.dump_tables` method: + +```{code-cell} ipython3 +new_tables = ts.dump_tables() +new_pos = 10 +new_site_id = new_tables.sites.add_row(position=new_pos, ancestral_state="G") +print("New empty site allocated at position {new_pos} with ID {new_site_id}") +new_tables.sites +``` + +Note that one of the requirements of a tree sequence is that the sites are listed in +order of position, so this new table will require sorting (see +{ref}`sec_data_structures_tables_creating`) + + +#### Column access + +An entire column in a table can be extracted as a {program}`numpy` array from the table +object. For instance, if ``n`` is a {class}`NodeTable`, then ``n.time`` +will return an array containing the birth times of the individuals whose genomes +are represented by the nodes in the table. *However*, it is important to note that this +is a copy of the data, and modifying individual elements of ``n.time`` will *not* change +the node table ``n``. To change the column data, you need to take a copy, modify it, +and assign it back in. For example, here we add 0.25 to every ``time`` except the first +in the node table from our current example: + +```{code-cell} ipython3 +node_table = new_tables.nodes +times = node_table.time +print("Old node times:", times) +times[1:] = times[1:] + 0.25 +node_table.time = times +node_table +``` + +When assigning columns like this, an error will be raised if the column is not of the +expected length: + +```{code-cell} ipython3 +:tags: [raises-exception] +node_table.time = times[2:] +``` + +(sec_data_structures_tables_creating)= + +### Turning tables into a tree sequence + +The {meth}`TableCollection.tree_sequence` method will attempt to turn a table collection +into a tree sequence. This is not guaranteed to work: it's possible you have created a +nonsensical tree sequence where, for example, a child node has multiple parents at +a given position. The {ref}`tskit:sec_valid_tree_sequence_requirements` section of the +official tskit docs lists the requirements; these include the correct order for rows in +a number of tables. For instance, the sites table is +{ref}`required` to be sorted in order of position. Since +this is not true of the sites table in the ``new_tables`` collection we have just +created, we need to {meth}`~TableCollection.sort` the table collection before turning +it into a tree sequence + +```{code-cell} ipython3 +# new_ts = new_tables.tree_sequence() # This won't work +new_tables.sort() +new_ts = new_tables.tree_sequence() # Now it will +# Plot without mutation labels, for clarity +SVG(new_ts.draw_svg(y_axis=True, y_gridlines=True, mutation_labels={})) +``` + +Here you can see that a new empty site has been added at position 10 (represented by a +tickmark above the X axis with no mutations on it), and all the nodes except node 0 +have had their times increased by 0.25. + +### Constructing a tree sequence + +With the tools above in mind, we can now see how to construct a tree sequence by hand. +It's unlikely that you would ever need to do this from scratch, but it's helpful to +understand how tables work. We'll build the simplest possible tree sequence, a single +tree that looks like this: + +```{code-cell} ipython3 +SVG(tskit.load("data/construction_example.trees").draw_svg(y_axis=True)) +``` + +Starting with an empty set of tables, we can fill, say, the node information by using +{meth}`NodeTable.add_row` as follows: + +```{code-cell} ipython3 +tables = tskit.TableCollection(sequence_length=1e3) +node_table = tables.nodes # set up an alias, for efficiency +node_table.add_row(flags=tskit.NODE_IS_SAMPLE) # Node 0: defaults to time 0 +node_table.add_row(flags=tskit.NODE_IS_SAMPLE) # Node 1: defaults to time 0 +node_table.add_row(time=10) # Node 2: not a sample +node_table.add_row(flags=tskit.NODE_IS_SAMPLE) # Node 3 +node_table.add_row(time=20) # Node 4 +node_table +``` + +The ``.add_row()`` method is natural (and should be reasonably efficient) if +new records appear one-by-one. Alternatively ``.set_columns()`` can be used to +set columns for all the rows at once, by passing in numpy arrays of the appropriate +type. We'll use that for the edges: + +```{code-cell} ipython3 +import numpy as np +edge_table = tables.edges +edge_table.set_columns( + left=np.array([0.0, 0.0, 0.0, 0.0]), + right=np.array([1e3, 1e3, 1e3, 1e3]), + parent=np.array([2, 2, 4, 4], dtype=np.int32), # References IDs in the node table + child=np.array([0, 1, 2, 3], dtype=np.int32), # References IDs in the node table +) +edge_table +``` + +And finally we can add a site and a mutation: here we'll use 0/1 mutations rather than +ATGC. + +```{code-cell} ipython3 +site_id = tables.sites.add_row(position=500.0, ancestral_state='0') +tables.mutations.add_row(site=site_id, node=2, derived_state='1') +ts = tables.tree_sequence() +print("A hand-built tree sequence!") +SVG(ts.draw_svg(y_axis=True)) +``` + +:::{note} +The ordering requirements for a valid tree sequence +{ref}`do not specify` that rows in the +node table have to be sorted in any particular order, e.g. by time. By convention, +sample nodes are often the first ones listed in the node table, and this is the node +order returned by {meth}`~TreeSequence.simplify`, but the example above shows that +sample nodes need not necessarily be those with the IDs $0..n$. +::: + +(sec_data_structures_editing)= + +## Editing tree sequences + +Sometimes we wish to make some minor modifications to a tree sequence that has +been generated by a simulation. However, tree sequence objects are **immutable** +and so we cannot edit them in place. + + +Several {program}`tskit` methods will return a new tree sequence that has been +modified in some way. For example: +- {meth}`TreeSequence.delete_sites` returns a tree sequence with certain sites + deleted +- {meth}`TreeSequence.delete_intervals` and {meth}`TreeSequence.keep_intervals` + return a tree sequence whose trees cover a smaller fraction of the genome (and which can + be combined with {meth}`TreeSequence.trim` to change the coordinate system) +- {meth}`TreeSequence.simplify` returns a new tree sequence with some sample nodes removed + (see the {ref}`simplification tutorial`). +- {meth}`TreeSequence.union` returns a new tree sequence formed by merging two other tree + sequences together. + +However, if you want to do something not covered by those built-in methods, you will +need to edit a copy of the underlying tables and then create a new tree sequence from +the modified tables. In the following example, we use this +approach to remove all singleton sites from a given tree sequence. + +```{code-cell} ipython3 +def strip_singletons(ts): + tables = ts.dump_tables() + tables.sites.clear() + tables.mutations.clear() + for tree in ts.trees(): + for site in tree.sites(): + assert len(site.mutations) == 1 # Only supports infinite sites muts. + mut = site.mutations[0] + if tree.num_samples(mut.node) > 1: + site_id = tables.sites.append(site) + mut = mut.replace(site=site_id) # set the new site id + tables.mutations.append(mut) + return tables.tree_sequence() +``` + +This function takes a tree sequence containing some infinite sites mutations as +input, and returns a copy in which all singleton sites have been removed. +The approach is very simple: we get a copy of the underlying +table data in a {class}`TableCollection` object, and first clear the +site and mutation tables. We then consider each site in turn, +and if the number of samples with +the mutation is greater than one, we add the site and mutation to our +output tables using {meth}`SiteTable.append` and {meth}`MutationTable.append`. These +functions act exactly like {meth}`SiteTable.add_row` and {meth}`MutationTable.add_row` +but they take an existing site or mutation and add all its details as a new row. + + +:::{note} +In this code we consider only simple infinite sites mutations, +where we cannot have back or recurrent mutations. These would require a slightly +more involved approach where we keep a map of mutation IDs so that +mutation ``parent`` values could be computed. +::: + +After considering each site, we then create a new tree sequence using +the {meth}`TableCollection.tree_sequence` method on our updated tables. +We can test this on a tree sequence that has been mutated using +{func}`msprime.sim_mutations` with the ``discrete_genome`` +parameter set to ``False``, which creates infinite sites mutations: + +```{code-cell} ipython3 +import msprime + +ts = msprime.sim_ancestry(10, random_seed=123) +ts = msprime.sim_mutations(ts, rate=10, discrete_genome=False, random_seed=123) +print(ts.num_sites, "sites in the simulated tree sequence") + +ts_new = strip_singletons(ts) +print(ts_new.num_sites, "sites after removing singletons") +``` + +:::{todo} +Add another example here where we use the array oriented API to edit +the nodes and edges of a tree sequence. Perhaps recapitating would be a +good example? +::: + +(sec_data_structures_trees)= ## Trees + +A {class}`Tree` represents a single tree in a {class}`TreeSequence`. +The {program}`tskit` Tree implementation differs from most tree libraries by +using **integer IDs** to refer to nodes rather than objects. Thus, when we wish to +find the parent of the node with ID '0', we use ``tree.parent(0)``, which +returns another integer. If '0' does not have a parent in the current tree +(e.g., if it is a root), then the special value {data}`tskit.NULL` +({math}`-1`) is returned. The children of a node are found using the +{meth}`Tree.children` method. To obtain information about a particular node, +one may either use ``tree.tree_sequence.node(u)`` to which returns the +corresponding {class}`Node` instance, or use the {attr}`Tree.time` or +{attr}`Tree.population` shorthands. + +:::{todo} +Add an example of using the tree structure. Note that traversing through the +structure is covered in a different tutorial. +::: diff --git a/getting_started.md b/getting_started.md index 5f6f08ea..929e62b7 100644 --- a/getting_started.md +++ b/getting_started.md @@ -56,6 +56,8 @@ ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321) ts ``` +You can see that there are many thousands of trees in this tree sequence. + :::{note} Since we simulated the ancestry of 20 *diploid* individuals, our tree sequence contains 40 *sample nodes*, one for each genome. @@ -68,25 +70,61 @@ sequence. This process underlies many tree sequence algorithms, including those encounter later in this tutorial for calculating {ref}`population genetic statistics`. To iterate over your own tree sequence you can use -{meth}`TreeSequence.trees()`. Here, for example, is -how to check whether all trees in the tree sequence have fully coalesced (which is to be -expected in reverse-time, coalescent simulations, but not always for tree sequences -produced by {ref}`forward simulation `). +{meth}`TreeSequence.trees()`. + +```{code-cell} ipython3 +print(f"== Tree sequence has {ts.num_trees} trees ==") +for tree in ts.trees(): + print(f"Tree {tree.index} covers {tree.interval}") + if tree.index >= 4: + print("...") + break +print(f"Tree {ts.last().index} covers {ts.last().interval}") +``` + +Here we've also used {meth}`TreeSequence.last()` to access +the last tree directly; it may not surprise you that there's a corresponding +{meth}`TreeSequence.first()` method to return the first tree. + +:::{warning} +The code above doesn't store the tree anywhere, but merely performs a calculation on +each tree within the ``for`` loop. That's because, for efficiency, the +{meth}`trees()` method repeatedly returns the same tree object, +updated internally to reflect the (usually small) changes from tree to tree along the +sequence. For that reason, this won't work: + +``` +list(ts.trees()) # Don't do this! Every tree in the list will be an identical "null" tree +``` + +If you really do want separate instances of each tree (inefficient, and for large tree +sequences risks using up all your computer memory), you can use the +{meth}`TreeSequence.aslist()` method. +::: + +Above, we stopped iterating after Tree 4 to limit the printed output, but iterating +forwards through trees in a tree sequence (or indeed backwards using the standard Python +``reversed`` function) is efficient. That means it's quick, for example to check if all +the trees in a tree sequence have fully coalesced (which is to be expected in +reverse-time, coalescent simulations, but not always for tree sequences produced by +{ref}`forward simulation `). ```{code-cell} ipython3 +import time +start = time.time() for tree in ts.trees(): if tree.has_multiple_roots: print("Tree {tree.index} has not coalesced") break else: - print("All trees in the tree sequence have coalesced") + print(f"All {ts.num_trees} trees have coalesced. Checked in {time.time()-start} seconds") ``` -Since all the trees *have* coalesced, at each position in the genome this tree sequence -must have only one most recent common ancestor (MRCA) of the 40 sample nodes. Below, we -iterate over the trees again, finding the IDs of the root (MRCA) node for each tree. The +Now that we know all trees have coalesced, we know that at each position in the genome +all the 40 sample nodes must have one most recent common ancestor (MRCA). Below, we +iterate over the trees, finding the IDs of the root (MRCA) node for each tree. The time of this root node can be found via the {meth}`tskit.TreeSequence.node` method, which -returns a {class}`node object` with attributes including the node time: +returns a {class}`node object` whose attributes include the node time: ```{code-cell} ipython3 import matplotlib.pyplot as plt @@ -105,17 +143,21 @@ plt.show() ``` It's obvious that there's something unusual about the trees in the middle of this -chromosome, where the selective sweep occurred. It's not particularly efficient to pull -out a tree from the middle of a tree sequence, but it *can* be done, via the -{meth}`TreeSequence.at()` method. Here's the tree at the location -of the sweep, drawn using the {meth}`Tree.draw_svg()` method -(see the {ref}`visualization tutorial ` for more visualization -possibilities): +chromosome, where the selective sweep occurred. + +Currently, it's not particularly efficient to pull out individual trees from the middle +of a tree sequence +(please comment on [this issue](https://github.com/tskit-dev/tskit/issues/684) if you +have a need for this to be efficient) but it *can* be done, via the +{meth}`TreeSequence.at()` method. Here's the tree at location +$5\ 000\ 000$ --- the position of the sweep --- drawn using the +{meth}`Tree.draw_svg()` method (see the +{ref}`visualization tutorial ` for more visualization possibilities): ```{code-cell} ipython3 from IPython.display import SVG -swept_tree = ts.at(5_000_000) # or you can get e.g. the first tree using ts.first() +swept_tree = ts.at(5_000_000) # or you can get e.g. the nth tree using ts.at_index(n) intvl = swept_tree.interval print(f"Tree number {swept_tree.index}, which runs from position {intvl.left} to {intvl.right}:") # Draw it at a wide size, to make room for all 40 tips diff --git a/requirements.txt b/requirements.txt index c49632c0..c833bd18 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,4 +3,5 @@ matplotlib seaborn pandas numpy +networkx msprime>=1.0