Find edge for a given mutation #685

jeromekelleher · 2020-06-18T07:35:17Z

In #668 we explored the possibility of associating an edge ID with a mutation instead of the most recent node. We decided that this would not be a good idea.

It would still be useful to be able to do this, though. It seems like it should be possible to do this using the existing indexes, which are defined as follows:

      range(M),
        key=lambda j: (edges[j].left, time[j], edges[j].parent, edges[j].child),
    )
    out_order = sorted(
        range(M),
        key=lambda j: (edges[j].right, -time[j], -edges[j].parent, -edges[j].child),
    )

A mutation is at a site, which has a given position. Can we find the edges that intersect with a given position efficiently using these indexes, or do we need a different index?

(This is basically the same question as efficiently seeking to a given tree, #684)

molpopgen · 2020-06-18T18:17:25Z

Isn't this one a bit different from #684 because we know that we're only looking for a subset of nodes? A mutation on an edge is above the child node of that edge, etc.. So, given the mutation's position and node, we should be able to find the entries in each index that comprise the edge? That still doesn't give you the index to the edge table row, but that only matters if you want the metadata.

jeromekelleher · 2020-06-18T19:29:38Z

I don't think that helps @molpopgen - think about the long edge example. We have an edge spanning (0, L) large tree sequence with a crap ton of edges. We have a mutation at position L/2 - finding the edges that start and end at L / 2 is of no help in finding the long edge. You'd have to do a linear sweep in worst case.

molpopgen · 2020-06-18T21:01:55Z

It looks like one needs a different sorting. Take this simulation:

import msprime

ts = msprime.simulate(100, recombination_rate = 10., mutation_rate=10., random_seed=1)

ts.dump("treefile.trees")

Find the edges via brute force:

import tskit

ts = tskit.load("treefile.trees")

for row, mnode in enumerate(ts.tables.mutations.node):
    mnode_time = ts.tables.nodes.time[mnode]
    mpos = ts.tables.sites.position[ts.tables.mutations.site[row]]
    for E in range(len(ts.tables.edges)):
        if ts.tables.edges.child[E] == mnode:
            if mpos >= ts.tables.edges.left[E] and mpos < ts.tables.edges.right[E]:
                c = ts.tables.edges.child[E]
                l = ts.tables.edges.left[E]
                r = ts.tables.edges.right[E]
                print(f"{mnode} {mnode_time} {mpos} | {E} {c} {l} {r}")

And then try to find them with binary search to do an "equal range" kinda search with a linear follow up:

#include <cstddef>
#include <cstdio>
#include <algorithm>
#include <iostream>
#include <tuple> // for std::tie
#include <tskit.h>

int
main(int argc, char **argv)
{
    tsk_table_collection_t tables;

    tsk_table_collection_init(&tables, 0);

    tsk_table_collection_load(&tables, "treefile.trees", 0);

    // NOTE: I am breaking the indexes now.

    std::sort(tables.indexes.edge_insertion_order,
        tables.indexes.edge_insertion_order + tables.edges.num_rows,
        [&tables](auto i, auto j) {
            return std::tie(tables.edges.child[i], tables.edges.left[i])
                   < std::tie(tables.edges.child[j], tables.edges.left[j]);
        });

    auto b = tables.indexes.edge_insertion_order;
    auto e = b + tables.edges.num_rows;
    for (int i = 0; i < tables.mutations.num_rows; ++i) {
        auto mnode = tables.mutations.node[i];
        auto mnode_time = tables.nodes.time[mnode];
        auto mpos = tables.sites.position[tables.mutations.site[i]];

        auto l = std::lower_bound(b, e, mnode, [&tables](int i, int j) {
            return tables.edges.child[i] < j; 
        });
        auto u = std::upper_bound(l, e, mnode, [&tables](int i, int j) {
            return i < tables.edges.child[j];
        });
        for (auto j = l; j < u; ++j) {
            auto c = tables.edges.child[*j];
            auto l = tables.edges.left[*j];
            auto r = tables.edges.right[*j];
            if (mpos >= l && mpos < r) {
                std::cout << mnode << ' ' << mnode_time << ' ' << mpos << " | " << *j
                          << ' ' << c << ' ' << ' ' << l << ' '
                          << r << '\n';
            }
        }
    }
}

We get the same results modulo rounding issues in the default prints.

Python:

kevin@pop-os:~/src/tskit/c$ python3 brute.py | head
143 0.024513978147499782 0.0022075352252396234 | 270 143 0.0 0.27172607864991905
257 0.49180389180866346 0.004779930578298082 | 607 257 0.002379634650978175 0.06532951352776249
244 0.2848507719340829 0.005414140198764588 | 425 244 0.002379634650978175 0.30973433198168276
238 0.2520305902995159 0.009142651786761517 | 557 238 0.0 0.052296985884266206
26 0.0 0.010235610461332683 | 391 26 0.0 0.10729980727361976
281 1.2453351951376384 0.013152798965435501 | 678 281 0.0 0.04681452131988819
287 1.9008354101022815 0.01590878161048165 | 680 287 0.0 0.04681452131988819
257 0.49180389180866346 0.02035719231006064 | 607 257 0.002379634650978175 0.06532951352776249
196 0.09521076845311771 0.021705921944213807 | 329 196 0.0 0.2290453513220524
287 1.9008354101022815 0.022522078214603933 | 680 287 0.0 0.04681452131988819
Traceback (most recent call last):
  File "brute.py", line 14, in <module>
    print(f"{mnode} {mnode_time} {mpos} | {E} {c} {l} {r}")
BrokenPipeError: [Errno 32] Broken pipe
kevin@pop-os:~/src/tskit/c$ python3 brute.py | tail
158 0.03297751984172531 0.978230091097973 | 322 158 0.7905863790750938 1.0
26 0.0 0.9784383524452529 | 454 26 0.9735316326780086 1.0
283 1.3639257637964965 0.9790288701729686 | 659 283 0.9627552900842287 1.0
211 0.1297209823650855 0.9851306977915976 | 464 211 0.5879051400320383 1.0
182 0.06982943775814676 0.9851828730791563 | 455 182 0.7693717178817306 1.0
262 0.5774718362345708 0.9872741755374664 | 571 262 0.8986005033130293 1.0
255 0.4376531947469375 0.9925291324113008 | 481 255 0.5849875618203049 1.0
108 0.001565437976074462 0.9963482566189535 | 367 108 0.914288541049298 1.0
236 0.2386056286352675 0.9975868605115437 | 570 236 0.914288541049298 1.0
283 1.3639257637964965 0.9980320577390316 | 659 283 0.9627552900842287 1.0

And the C++:

./search | head
143 0.024514 0.00220754 | 270 143  0 0.271726
257 0.491804 0.00477993 | 607 257  0.00237963 0.0653295
244 0.284851 0.00541414 | 425 244  0.00237963 0.309734
238 0.252031 0.00914265 | 557 238  0 0.052297
26 0 0.0102356 | 391 26  0 0.1073
281 1.24534 0.0131528 | 678 281  0 0.0468145
287 1.90084 0.0159088 | 680 287  0 0.0468145
257 0.491804 0.0203572 | 607 257  0.00237963 0.0653295
196 0.0952108 0.0217059 | 329 196  0 0.229045
287 1.90084 0.0225221 | 680 287  0 0.0468145
kevin@pop-os:~/src/tskit/c$ ./search | tail
158 0.0329775 0.97823 | 322 158  0.790586 1
26 0 0.978438 | 454 26  0.973532 1
283 1.36393 0.979029 | 659 283  0.962755 1
211 0.129721 0.985131 | 464 211  0.587905 1
182 0.0698294 0.985183 | 455 182  0.769372 1
262 0.577472 0.987274 | 571 262  0.898601 1
255 0.437653 0.992529 | 481 255  0.584988 1
108 0.00156544 0.996348 | 367 108  0.914289 1
236 0.238606 0.997587 | 570 236  0.914289 1
283 1.36393 0.998032 | 659 283  0.962755 1

Quick comments:

I'd forgotten that the tskit indexes were just arrays of indexes. That's not true in fwdpp, so withtskit, you can get back to edge metadata or whatnot.
The second linear search within [l, u) can also be done with a binary search, but I omitted it to keep it short.

For posterity, the Makefile:

CC=gcc
CXX=g++

CXXFLAGS=-I. -Isubprojects/kastore
CFLAGS=-I. -Isubprojects/kastore

TSK_OBJECTS=tskit/convert.o tskit/core.o tskit/genotypes.o tskit/haplotype_matching.o tskit/stats.o tskit/tables.o tskit/trees.o subprojects/kastore/kastore.o

all: search.o $(TSK_OBJECTS)
	$(CXX) -o search search.o $(TSK_OBJECTS)

jeromekelleher · 2020-06-19T07:23:15Z

I see, so if we had another index where we sort by (child, left), we can find edge in log time because we can search for the node first. Then, as a node cannot be a child on two overlapping intervals, we can also avoid the general interval search case on the subset. That's clever!

I wonder if it's worth the extra index to support this query though, when we'll need the more general interval overlap query at some point for #684 anyway?

molpopgen · 2020-06-19T14:25:45Z

I think that the more general solution is preferable, although this is a useful method to know about. It can be improved in a few ways, I think, too. For example, mapping unique mutation node to mutation row indexes would cut down the number of binary searches.

jeromekelleher · 2022-05-17T08:52:11Z

There is a straightforward way to derive this information during the standard tree algorithm:

def algorithm_T(ts):
    sequence_length = ts.sequence_length
    edges = list(ts.edges())
    M = len(edges)
    in_order = ts.tables.indexes.edge_insertion_order
    out_order = ts.tables.indexes.edge_removal_order
    sites = ts.tables.sites
    mutations = ts.tables.mutations

    mutation_edge = np.zeros_like(ts.tables.mutations.node) - 1
    parent = np.zeros(ts.num_nodes, dtype=int) - 1
    # Map the child node to the ID of its edge in the current tree.
    node_edge_map = np.zeros(ts.num_nodes, dtype=int) - 1

    j = 0
    k = 0
    left = 0
    site_id = 0
    mutation_id = 0
    while j < M or left < sequence_length:
        while k < M and edges[out_order[k]].right == left:
            edge = edges[out_order[k]]
            parent[edge.child] = -1
            node_edge_map[edge.child] = -1
            k += 1
        while j < M and edges[in_order[j]].left == left:
            edge = edges[in_order[j]]
            node_edge_map[edge.child] = in_order[j]
            parent[edge.child] = edge.parent
            j += 1
        right = sequence_length
        if j < M:
            right = min(right, edges[in_order[j]].left)
        if k < M:
            right = min(right, edges[out_order[k]].right)

        while site_id < ts.num_sites and sites.position[site_id] < right:
            assert sites.position[site_id] >= left
            while (
                mutation_id < ts.num_mutations
                and mutations.site[mutation_id] == site_id
            ):
                mutation_edge[mutation_id] = node_edge_map[mutations.node[mutation_id]]
                mutation_id += 1

            site_id += 1

        yield (left, right), parent, mutation_edge
        left = right

Basically, we maintain a map of child node -> edge ID for each tree, and then use this map to fill out the mutation_edge map from mutation ID to edge id.

I suggest we add a edge attribute to the tsk_mutation_t struct, and fill this information out during the initialisation step of treeseq_t. Specifically, we'd update the tsk_treeseq_init_trees function to perform this logic, and store the edge ID for each mutation there. It requires a bit of extra memory during startup, but I think it's well worth it as this is a recurring headache.

We would be careful to make sure that any mutations that are not on edges (i.e. above roots or just not on tree nodes) are assigned -1.

Any thoughts?

molpopgen · 2022-05-17T15:14:03Z

Seems straightforward!

benjeffery · 2022-05-17T16:58:10Z

Very nice. Much simpler than going via the random-access-tree that was discussed before.

Closes tskit-dev#685

jeromekelleher mentioned this issue Jun 18, 2020

Index required for efficient random access for trees #684

Closed

benjeffery added enhancement New feature or request C API Issue is about the C API Python API Issue is about the Python API labels Sep 29, 2020

jeromekelleher mentioned this issue Sep 29, 2020

Add edge ids to the SVG representation of a tree #557

Open

hyanwong mentioned this issue Mar 17, 2021

Plot mutations on a single tree even if they are outside the tree span #1253

Closed

jeromekelleher mentioned this issue May 17, 2022

Add TableCollection.split_edges #2276

Closed

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 17, 2022

Add mutation.edge for tsk_treeseq accessors.

726a28b

Closes tskit-dev#685

jeromekelleher mentioned this issue May 17, 2022

Add mutation.edge for tsk_treeseq accessors. #2279

Merged

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 17, 2022

Add mutation.edge for tsk_treeseq accessors.

8b4ffae

Closes tskit-dev#685

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 18, 2022

Add mutation.edge for tsk_treeseq accessors.

d88090b

Closes tskit-dev#685

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 20, 2022

Add mutation.edge for tsk_treeseq accessors.

82afcde

Closes tskit-dev#685

mergify bot closed this as completed in 4f675f0 May 20, 2022

hyanwong mentioned this issue May 27, 2022

Add edge to Tree arrays #2304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Find edge for a given mutation #685

Find edge for a given mutation #685

jeromekelleher commented Jun 18, 2020

molpopgen commented Jun 18, 2020

Uh oh!

jeromekelleher commented Jun 18, 2020

Uh oh!

molpopgen commented Jun 18, 2020 •

edited

Loading

Uh oh!

jeromekelleher commented Jun 19, 2020

Uh oh!

molpopgen commented Jun 19, 2020

Uh oh!

jeromekelleher commented May 17, 2022

Uh oh!

molpopgen commented May 17, 2022

Uh oh!

benjeffery commented May 17, 2022

Uh oh!

Find edge for a given mutation #685

Find edge for a given mutation #685

Comments

jeromekelleher commented Jun 18, 2020

molpopgen commented Jun 18, 2020

Uh oh!

jeromekelleher commented Jun 18, 2020

Uh oh!

molpopgen commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Jun 19, 2020

Uh oh!

molpopgen commented Jun 19, 2020

Uh oh!

jeromekelleher commented May 17, 2022

Uh oh!

molpopgen commented May 17, 2022

Uh oh!

benjeffery commented May 17, 2022

Uh oh!

molpopgen commented Jun 18, 2020 •

edited

Loading