Skip to content

Find edge for a given mutation #685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Jun 18, 2020 · 8 comments
Closed

Find edge for a given mutation #685

jeromekelleher opened this issue Jun 18, 2020 · 8 comments
Labels
C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API

Comments

@jeromekelleher
Copy link
Member

In #668 we explored the possibility of associating an edge ID with a mutation instead of the most recent node. We decided that this would not be a good idea.

It would still be useful to be able to do this, though. It seems like it should be possible to do this using the existing indexes, which are defined as follows:

      range(M),
        key=lambda j: (edges[j].left, time[j], edges[j].parent, edges[j].child),
    )
    out_order = sorted(
        range(M),
        key=lambda j: (edges[j].right, -time[j], -edges[j].parent, -edges[j].child),
    )

A mutation is at a site, which has a given position. Can we find the edges that intersect with a given position efficiently using these indexes, or do we need a different index?

(This is basically the same question as efficiently seeking to a given tree, #684)

@molpopgen
Copy link
Member

Isn't this one a bit different from #684 because we know that we're only looking for a subset of nodes? A mutation on an edge is above the child node of that edge, etc.. So, given the mutation's position and node, we should be able to find the entries in each index that comprise the edge? That still doesn't give you the index to the edge table row, but that only matters if you want the metadata.

@jeromekelleher
Copy link
Member Author

I don't think that helps @molpopgen - think about the long edge example. We have an edge spanning (0, L) large tree sequence with a crap ton of edges. We have a mutation at position L/2 - finding the edges that start and end at L / 2 is of no help in finding the long edge. You'd have to do a linear sweep in worst case.

@molpopgen
Copy link
Member

molpopgen commented Jun 18, 2020

It looks like one needs a different sorting. Take this simulation:

import msprime

ts = msprime.simulate(100, recombination_rate = 10., mutation_rate=10., random_seed=1)

ts.dump("treefile.trees")

Find the edges via brute force:

import tskit

ts = tskit.load("treefile.trees")

for row, mnode in enumerate(ts.tables.mutations.node):
    mnode_time = ts.tables.nodes.time[mnode]
    mpos = ts.tables.sites.position[ts.tables.mutations.site[row]]
    for E in range(len(ts.tables.edges)):
        if ts.tables.edges.child[E] == mnode:
            if mpos >= ts.tables.edges.left[E] and mpos < ts.tables.edges.right[E]:
                c = ts.tables.edges.child[E]
                l = ts.tables.edges.left[E]
                r = ts.tables.edges.right[E]
                print(f"{mnode} {mnode_time} {mpos} | {E} {c} {l} {r}")

And then try to find them with binary search to do an "equal range" kinda search with a linear follow up:

#include <cstddef>
#include <cstdio>
#include <algorithm>
#include <iostream>
#include <tuple> // for std::tie
#include <tskit.h>

int
main(int argc, char **argv)
{
    tsk_table_collection_t tables;

    tsk_table_collection_init(&tables, 0);

    tsk_table_collection_load(&tables, "treefile.trees", 0);

    // NOTE: I am breaking the indexes now.

    std::sort(tables.indexes.edge_insertion_order,
        tables.indexes.edge_insertion_order + tables.edges.num_rows,
        [&tables](auto i, auto j) {
            return std::tie(tables.edges.child[i], tables.edges.left[i])
                   < std::tie(tables.edges.child[j], tables.edges.left[j]);
        });

    auto b = tables.indexes.edge_insertion_order;
    auto e = b + tables.edges.num_rows;
    for (int i = 0; i < tables.mutations.num_rows; ++i) {
        auto mnode = tables.mutations.node[i];
        auto mnode_time = tables.nodes.time[mnode];
        auto mpos = tables.sites.position[tables.mutations.site[i]];

        auto l = std::lower_bound(b, e, mnode, [&tables](int i, int j) {
            return tables.edges.child[i] < j; 
        });
        auto u = std::upper_bound(l, e, mnode, [&tables](int i, int j) {
            return i < tables.edges.child[j];
        });
        for (auto j = l; j < u; ++j) {
            auto c = tables.edges.child[*j];
            auto l = tables.edges.left[*j];
            auto r = tables.edges.right[*j];
            if (mpos >= l && mpos < r) {
                std::cout << mnode << ' ' << mnode_time << ' ' << mpos << " | " << *j
                          << ' ' << c << ' ' << ' ' << l << ' '
                          << r << '\n';
            }
        }
    }
}

We get the same results modulo rounding issues in the default prints.

Python:

kevin@pop-os:~/src/tskit/c$ python3 brute.py | head
143 0.024513978147499782 0.0022075352252396234 | 270 143 0.0 0.27172607864991905
257 0.49180389180866346 0.004779930578298082 | 607 257 0.002379634650978175 0.06532951352776249
244 0.2848507719340829 0.005414140198764588 | 425 244 0.002379634650978175 0.30973433198168276
238 0.2520305902995159 0.009142651786761517 | 557 238 0.0 0.052296985884266206
26 0.0 0.010235610461332683 | 391 26 0.0 0.10729980727361976
281 1.2453351951376384 0.013152798965435501 | 678 281 0.0 0.04681452131988819
287 1.9008354101022815 0.01590878161048165 | 680 287 0.0 0.04681452131988819
257 0.49180389180866346 0.02035719231006064 | 607 257 0.002379634650978175 0.06532951352776249
196 0.09521076845311771 0.021705921944213807 | 329 196 0.0 0.2290453513220524
287 1.9008354101022815 0.022522078214603933 | 680 287 0.0 0.04681452131988819
Traceback (most recent call last):
  File "brute.py", line 14, in <module>
    print(f"{mnode} {mnode_time} {mpos} | {E} {c} {l} {r}")
BrokenPipeError: [Errno 32] Broken pipe
kevin@pop-os:~/src/tskit/c$ python3 brute.py | tail
158 0.03297751984172531 0.978230091097973 | 322 158 0.7905863790750938 1.0
26 0.0 0.9784383524452529 | 454 26 0.9735316326780086 1.0
283 1.3639257637964965 0.9790288701729686 | 659 283 0.9627552900842287 1.0
211 0.1297209823650855 0.9851306977915976 | 464 211 0.5879051400320383 1.0
182 0.06982943775814676 0.9851828730791563 | 455 182 0.7693717178817306 1.0
262 0.5774718362345708 0.9872741755374664 | 571 262 0.8986005033130293 1.0
255 0.4376531947469375 0.9925291324113008 | 481 255 0.5849875618203049 1.0
108 0.001565437976074462 0.9963482566189535 | 367 108 0.914288541049298 1.0
236 0.2386056286352675 0.9975868605115437 | 570 236 0.914288541049298 1.0
283 1.3639257637964965 0.9980320577390316 | 659 283 0.9627552900842287 1.0

And the C++:

./search | head
143 0.024514 0.00220754 | 270 143  0 0.271726
257 0.491804 0.00477993 | 607 257  0.00237963 0.0653295
244 0.284851 0.00541414 | 425 244  0.00237963 0.309734
238 0.252031 0.00914265 | 557 238  0 0.052297
26 0 0.0102356 | 391 26  0 0.1073
281 1.24534 0.0131528 | 678 281  0 0.0468145
287 1.90084 0.0159088 | 680 287  0 0.0468145
257 0.491804 0.0203572 | 607 257  0.00237963 0.0653295
196 0.0952108 0.0217059 | 329 196  0 0.229045
287 1.90084 0.0225221 | 680 287  0 0.0468145
kevin@pop-os:~/src/tskit/c$ ./search | tail
158 0.0329775 0.97823 | 322 158  0.790586 1
26 0 0.978438 | 454 26  0.973532 1
283 1.36393 0.979029 | 659 283  0.962755 1
211 0.129721 0.985131 | 464 211  0.587905 1
182 0.0698294 0.985183 | 455 182  0.769372 1
262 0.577472 0.987274 | 571 262  0.898601 1
255 0.437653 0.992529 | 481 255  0.584988 1
108 0.00156544 0.996348 | 367 108  0.914289 1
236 0.238606 0.997587 | 570 236  0.914289 1
283 1.36393 0.998032 | 659 283  0.962755 1

Quick comments:

  • I'd forgotten that the tskit indexes were just arrays of indexes. That's not true in fwdpp, so withtskit, you can get back to edge metadata or whatnot.
  • The second linear search within [l, u) can also be done with a binary search, but I omitted it to keep it short.

For posterity, the Makefile:

CC=gcc
CXX=g++

CXXFLAGS=-I. -Isubprojects/kastore
CFLAGS=-I. -Isubprojects/kastore

TSK_OBJECTS=tskit/convert.o tskit/core.o tskit/genotypes.o tskit/haplotype_matching.o tskit/stats.o tskit/tables.o tskit/trees.o subprojects/kastore/kastore.o

all: search.o $(TSK_OBJECTS)
	$(CXX) -o search search.o $(TSK_OBJECTS)

@jeromekelleher
Copy link
Member Author

I see, so if we had another index where we sort by (child, left), we can find edge in log time because we can search for the node first. Then, as a node cannot be a child on two overlapping intervals, we can also avoid the general interval search case on the subset. That's clever!

I wonder if it's worth the extra index to support this query though, when we'll need the more general interval overlap query at some point for #684 anyway?

@molpopgen
Copy link
Member

I think that the more general solution is preferable, although this is a useful method to know about. It can be improved in a few ways, I think, too. For example, mapping unique mutation node to mutation row indexes would cut down the number of binary searches.

@jeromekelleher
Copy link
Member Author

There is a straightforward way to derive this information during the standard tree algorithm:

def algorithm_T(ts):
    sequence_length = ts.sequence_length
    edges = list(ts.edges())
    M = len(edges)
    in_order = ts.tables.indexes.edge_insertion_order
    out_order = ts.tables.indexes.edge_removal_order
    sites = ts.tables.sites
    mutations = ts.tables.mutations

    mutation_edge = np.zeros_like(ts.tables.mutations.node) - 1
    parent = np.zeros(ts.num_nodes, dtype=int) - 1
    # Map the child node to the ID of its edge in the current tree.
    node_edge_map = np.zeros(ts.num_nodes, dtype=int) - 1

    j = 0
    k = 0
    left = 0
    site_id = 0
    mutation_id = 0
    while j < M or left < sequence_length:
        while k < M and edges[out_order[k]].right == left:
            edge = edges[out_order[k]]
            parent[edge.child] = -1
            node_edge_map[edge.child] = -1
            k += 1
        while j < M and edges[in_order[j]].left == left:
            edge = edges[in_order[j]]
            node_edge_map[edge.child] = in_order[j]
            parent[edge.child] = edge.parent
            j += 1
        right = sequence_length
        if j < M:
            right = min(right, edges[in_order[j]].left)
        if k < M:
            right = min(right, edges[out_order[k]].right)

        while site_id < ts.num_sites and sites.position[site_id] < right:
            assert sites.position[site_id] >= left
            while (
                mutation_id < ts.num_mutations
                and mutations.site[mutation_id] == site_id
            ):
                mutation_edge[mutation_id] = node_edge_map[mutations.node[mutation_id]]
                mutation_id += 1

            site_id += 1

        yield (left, right), parent, mutation_edge
        left = right

Basically, we maintain a map of child node -> edge ID for each tree, and then use this map to fill out the mutation_edge map from mutation ID to edge id.

I suggest we add a edge attribute to the tsk_mutation_t struct, and fill this information out during the initialisation step of treeseq_t. Specifically, we'd update the tsk_treeseq_init_trees function to perform this logic, and store the edge ID for each mutation there. It requires a bit of extra memory during startup, but I think it's well worth it as this is a recurring headache.

We would be careful to make sure that any mutations that are not on edges (i.e. above roots or just not on tree nodes) are assigned -1.

Any thoughts?

@molpopgen
Copy link
Member

Seems straightforward!

@benjeffery
Copy link
Member

Very nice. Much simpler than going via the random-access-tree that was discussed before.

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 17, 2022
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 17, 2022
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 18, 2022
jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue May 20, 2022
@mergify mergify bot closed this as completed in 4f675f0 May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API
Projects
None yet
Development

No branches or pull requests

3 participants