-
Notifications
You must be signed in to change notification settings - Fork 77
Find edge for a given mutation #685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Isn't this one a bit different from #684 because we know that we're only looking for a subset of nodes? A mutation on an edge is above the child node of that edge, etc.. So, given the mutation's position and node, we should be able to find the entries in each index that comprise the edge? That still doesn't give you the index to the edge table row, but that only matters if you want the metadata. |
I don't think that helps @molpopgen - think about the long edge example. We have an edge spanning (0, L) large tree sequence with a crap ton of edges. We have a mutation at position L/2 - finding the edges that start and end at L / 2 is of no help in finding the long edge. You'd have to do a linear sweep in worst case. |
It looks like one needs a different sorting. Take this simulation: import msprime
ts = msprime.simulate(100, recombination_rate = 10., mutation_rate=10., random_seed=1)
ts.dump("treefile.trees") Find the edges via brute force: import tskit
ts = tskit.load("treefile.trees")
for row, mnode in enumerate(ts.tables.mutations.node):
mnode_time = ts.tables.nodes.time[mnode]
mpos = ts.tables.sites.position[ts.tables.mutations.site[row]]
for E in range(len(ts.tables.edges)):
if ts.tables.edges.child[E] == mnode:
if mpos >= ts.tables.edges.left[E] and mpos < ts.tables.edges.right[E]:
c = ts.tables.edges.child[E]
l = ts.tables.edges.left[E]
r = ts.tables.edges.right[E]
print(f"{mnode} {mnode_time} {mpos} | {E} {c} {l} {r}") And then try to find them with binary search to do an "equal range" kinda search with a linear follow up: #include <cstddef>
#include <cstdio>
#include <algorithm>
#include <iostream>
#include <tuple> // for std::tie
#include <tskit.h>
int
main(int argc, char **argv)
{
tsk_table_collection_t tables;
tsk_table_collection_init(&tables, 0);
tsk_table_collection_load(&tables, "treefile.trees", 0);
// NOTE: I am breaking the indexes now.
std::sort(tables.indexes.edge_insertion_order,
tables.indexes.edge_insertion_order + tables.edges.num_rows,
[&tables](auto i, auto j) {
return std::tie(tables.edges.child[i], tables.edges.left[i])
< std::tie(tables.edges.child[j], tables.edges.left[j]);
});
auto b = tables.indexes.edge_insertion_order;
auto e = b + tables.edges.num_rows;
for (int i = 0; i < tables.mutations.num_rows; ++i) {
auto mnode = tables.mutations.node[i];
auto mnode_time = tables.nodes.time[mnode];
auto mpos = tables.sites.position[tables.mutations.site[i]];
auto l = std::lower_bound(b, e, mnode, [&tables](int i, int j) {
return tables.edges.child[i] < j;
});
auto u = std::upper_bound(l, e, mnode, [&tables](int i, int j) {
return i < tables.edges.child[j];
});
for (auto j = l; j < u; ++j) {
auto c = tables.edges.child[*j];
auto l = tables.edges.left[*j];
auto r = tables.edges.right[*j];
if (mpos >= l && mpos < r) {
std::cout << mnode << ' ' << mnode_time << ' ' << mpos << " | " << *j
<< ' ' << c << ' ' << ' ' << l << ' '
<< r << '\n';
}
}
}
} We get the same results modulo rounding issues in the default prints. Python:
And the C++:
Quick comments:
For posterity, the
|
I see, so if we had another index where we sort by (child, left), we can find edge in log time because we can search for the node first. Then, as a node cannot be a child on two overlapping intervals, we can also avoid the general interval search case on the subset. That's clever! I wonder if it's worth the extra index to support this query though, when we'll need the more general interval overlap query at some point for #684 anyway? |
I think that the more general solution is preferable, although this is a useful method to know about. It can be improved in a few ways, I think, too. For example, mapping unique mutation node to mutation row indexes would cut down the number of binary searches. |
There is a straightforward way to derive this information during the standard tree algorithm: def algorithm_T(ts):
sequence_length = ts.sequence_length
edges = list(ts.edges())
M = len(edges)
in_order = ts.tables.indexes.edge_insertion_order
out_order = ts.tables.indexes.edge_removal_order
sites = ts.tables.sites
mutations = ts.tables.mutations
mutation_edge = np.zeros_like(ts.tables.mutations.node) - 1
parent = np.zeros(ts.num_nodes, dtype=int) - 1
# Map the child node to the ID of its edge in the current tree.
node_edge_map = np.zeros(ts.num_nodes, dtype=int) - 1
j = 0
k = 0
left = 0
site_id = 0
mutation_id = 0
while j < M or left < sequence_length:
while k < M and edges[out_order[k]].right == left:
edge = edges[out_order[k]]
parent[edge.child] = -1
node_edge_map[edge.child] = -1
k += 1
while j < M and edges[in_order[j]].left == left:
edge = edges[in_order[j]]
node_edge_map[edge.child] = in_order[j]
parent[edge.child] = edge.parent
j += 1
right = sequence_length
if j < M:
right = min(right, edges[in_order[j]].left)
if k < M:
right = min(right, edges[out_order[k]].right)
while site_id < ts.num_sites and sites.position[site_id] < right:
assert sites.position[site_id] >= left
while (
mutation_id < ts.num_mutations
and mutations.site[mutation_id] == site_id
):
mutation_edge[mutation_id] = node_edge_map[mutations.node[mutation_id]]
mutation_id += 1
site_id += 1
yield (left, right), parent, mutation_edge
left = right Basically, we maintain a map of child node -> edge ID for each tree, and then use this map to fill out the I suggest we add a We would be careful to make sure that any mutations that are not on edges (i.e. above roots or just not on tree nodes) are assigned -1. Any thoughts? |
Seems straightforward! |
Very nice. Much simpler than going via the random-access-tree that was discussed before. |
In #668 we explored the possibility of associating an edge ID with a mutation instead of the most recent node. We decided that this would not be a good idea.
It would still be useful to be able to do this, though. It seems like it should be possible to do this using the existing indexes, which are defined as follows:
A mutation is at a site, which has a given position. Can we find the edges that intersect with a given position efficiently using these indexes, or do we need a different index?
(This is basically the same question as efficiently seeking to a given tree, #684)
The text was updated successfully, but these errors were encountered: