Random split polytomy #815

hyanwong · 2020-08-28T17:15:53Z

Fixes #809

hyanwong · 2020-08-28T17:21:57Z

Currently only a draft, as there are basically no tests, rather a long method name (but I feel we need the word "random" in there, since AFAIK this is the only routine in tskit which involves randomness), and I only have a horribly inefficient (and untested) way to check the time of mutations above child nodes. I'm sure the latter could be optimised - suggestions for approaches are welcome.

benjeffery · 2020-08-28T17:51:09Z

Great stuff - haven't reviewed the code yet, but one question is where we get the RNG on the C side (if indeed we intend to implement this C-side).

hyanwong · 2020-09-03T09:46:35Z

What RNG is used in msprime on the C side? If we do want a C side implementation, I assume we just copy that. Having said which, profiling the code suggests (surprisingly) that the majority of the time is spent in table.add_row(), so a C implementation would mainly be to allow C users to access the functionality.

benjeffery · 2020-09-03T09:50:53Z

What RNG is used in msprime on the C side?

GSL is used, which we can't use here due to licensing restrictions (It's GPL were MIT), hence it being an issue. One idea we had for another method was to pass in a random array - but I'm guessing here we don't know how many random numbers will be needed? In the other use case it was known.

benjeffery · 2020-09-03T09:52:35Z

the majority of the time is spent in table.add_row()

I'd assume that most of the time is spent in the python machinery of add_row so a C implementation would still be faster, but I assume this doesn't take that long and is roughly O(num_polytomies)?

grahamgower · 2020-09-03T09:54:27Z

PCG family of random generators could be an option: https://www.pcg-random.org/download.html

hyanwong · 2020-09-04T16:03:26Z

PCG family of random generators could be an option: https://www.pcg-random.org/download.html

Numpy currently uses PCG64 as its default PRNG, which is good enough for me!

python/tskit/tables.py

AdminBot-tskit · 2020-10-15T17:28:55Z

📖 Docs for this PR can be previewed here

hyanwong · 2020-11-17T22:34:08Z

@benjeffery - I've revised this after discussion with Jerome, so that the code is placed in combinatorics.py and doesn't clog up trees.py. I've also removed the ability to modify the epsilon value, and now we just error out if epsilon is too large. There's a helpful error message if this is because of node times, and a less helpful one (at the end of the function) which traps messages about mutation times and raises an error mentioning epsilon. I'm not sure this is the right way to do it, but it avoids having to explicitly check for mutation times, which was a pain in the previous version. The RNG is the one from numpy, so porting this to C might be less obvious. The only "method" is "random" for the moment, but I think it's worth having that argument in as it emphasises the fact that by default there is a stochastic component to this algorithm.

I've added Jerome as a reviewer, as much of this was hist suggestion for reworking, but I don't know how much he wants to check over it all?

benjeffery · 2020-11-18T11:48:14Z

I've just started reviewing.

jeromekelleher

It basically looks good, although I haven't really grokked the algorithm. This is quite a tricky thing, which I would imagine needs quite a lot of validation and testing to make sure it has the right properties in all situations.

My inclination would be to merge this ASAP but "undocument" it ahead of the upcoming release. That way, we can use it internally for our own projects without committing ourselves to any particular API or semantics, and also give us a chance to consider how we'll validate stuff like this which we can't just do simple unit tests on.

python/tskit/combinatorics.py

python/tests/test_topology.py

jeromekelleher · 2020-11-19T15:59:30Z

I started out thinking about an adaptive epsilon like this, but it's hard to do because of mutation times, I think? There's no easy way to get the time of the oldest mutation above a node, without actually just going through the mutations one by one. I think we said that would probably not be too much of a hit, but it seemed easier to just require epsilon to be set manually.

It's easy enough to get the sites and mutations - we just iterate through them all for the tree. We're already doing a O(n) traversal of the nodes, so there's no point in worrying about efficiency there.

hyanwong · 2020-11-19T15:59:45Z

python/tskit/combinatorics.py

+        implementation of Algorithm R, because we are interested in
+        the leaf node labellings.
+
+        The pre-fasicle text is available here, page 16:


I've never heard of a fasicle?! Looks like it has an extra c in the spelling anyway!

hyanwong · 2020-11-19T16:01:22Z

I started out thinking about an adaptive epsilon like this, but it's hard to do because of mutation times, I think? There's no easy way to get the time of the oldest mutation above a node, without actually just going through the mutations one by one. I think we said that would probably not be too much of a hit, but it seemed easier to just require epsilon to be set manually.

It's easy enough to get the sites and mutations - we just iterate through them all for the tree. We're already doing a O(n) traversal of the nodes, so there's no point in worrying about efficiency there.

OK, so we'll presumably have to add that logic and space the nodes out evenly between the oldest mutation above a child, or, if there are no mutation times, the oldest child node time?

brianzhang01

I had these comments queued up from a review I started but didn't finish. Hope the comments still show up (I don't even know what they are at this moment, haven't figured out how to view them).

brianzhang01 · 2020-10-19T21:59:10Z

python/tskit/tables.py

+        and details of parameters.
+        """
+        if epsilon is None:
+            epsilon = 1e-10


It seems like the advantage of this design rather than just putting epsilon = 1e-10 as the default is that we'd then need to change it in trees.py as well?

brianzhang01 · 2020-10-19T22:01:45Z

python/tskit/tables.py

+        existing_node_time = node_table.time
+
+        # We can save a lot of effort if we don't need to check the time of mutations
+        # We definitely don't need to check on the first iteration, a


Incomplete comment

hyanwong · 2020-11-19T16:05:08Z

python/tskit/combinatorics.py

+        ts = tables.tree_sequence()
+    except tskit.LibraryError as e:
+        msg = str(e)
+        if msg.startswith(


If we are checking on the time of children or mutations above children, then I think it would be better to raise these errors when we know we are going to violate the time constraints, as then we can say which nodes or mutations are causing the problems, which is much more helpful to the user.

I didn't do this for mutation times because I removed the code for checking mutation times, but if we are going to reinstate it, we should raise the error there.

hyanwong · 2020-11-19T16:40:45Z

python/tskit/combinatorics.py

+            if x.parent is not None:
+                index = x.parent.children.index(x)
+                x.parent.children[index] = internal
+            rng.shuffle(internal.children)


I must admit that I don't quite understand why this extra shuffle is in there? Won't the first rng.choice make the selection random anyway?

(by the way, I think this is essentially the same edge-addition algorithm that I originally implemented, but you are right that this one is neater - I think as a result of focussing on nodes rather than focussing on edges)

We choose a node x to modify randomly in the first step, adding a new internal node which will be the parent of x and the n'th leaf. We must randomise the order of [x, n] because the leaf would aways be either the left or right child, which would (I think) mean we don't generate all possible trees.

I'll add a few comments to clarify, as it's a simple algorithm and worth explaining for future reference.

I think it does generate all possible trees, even if the child order isn't shuffled. At least, it still did in my tests when I removed that line (only tested for polytomies up to degree 6, though). And from reading the Knuth explanation, it doesn't seem necessary (I think?)

brianzhang01 · 2020-11-19T16:51:00Z

Having read through the discussion...

I am partial towards Jerome's suggestion to resolve individual trees on-the-fly. It's simpler and the KC metric operates tree-wise and then aggregates, so it would make for conceptually simpler KC code as well. The maintenance burden would be lower as more people would understand the code. Examples of maintenance include non-random resolution and different distributions of randomly resolving trees (e.g. rtree vs. rcoal resolution in ape: https://www.rdocumentation.org/packages/ape/versions/5.4-1/topics/rtree, current behavior is rtree)
It seems like Yan's function would still have a nice place to live: tsinfer. Besides metrics, what is nice about having and serializing a tree sequence with polytomies resolved?
The coiterate function in tsutil looks useful for metrics purposes
For what it's worth, I currently have a KC with random resolution of polytomies that more-or-less resolves trees-on-the-fly and computes the KC between two trees at a particular position. Then I sample many positions and compute a Monte Carlo-style average. It's written in my own C++ codebase. The point here is that to cut down on the time of computing a tree-sequence-wide KC, this might be an option, but you'd want to measure the effects of the Monte Carlo approximation.

hyanwong · 2020-11-19T16:59:50Z

Having read through the discussion...

I am partial towards Jerome's suggestion to resolve individual trees on-the-fly. It's simpler and the KC metric operates tree-wise and then aggregates, so it would make for conceptually simpler KC code as well.

But I don't that's what the KC metric on a tree sequence does, @brianzhang01. Daniel coded up an an incremental algorithm, didn't he? I'm afraid that I still think that it's a bit nuts to re-resolve the polytomy every time anything anywhere on the tree changes: it doesn't feel like a natural thing to do, because the tree boundaries are really nothing to do with the polytomy itself. I suspect that it also won't scale to large trees, where pretty much every base could cause a tree change somewhere in the TS, probably miles away from the polytomy you are trying to resolve?

Having said which, I can see that for metric calculations, it's a convenient way to remove some non-independence in the calculations.

*The coiterate function in tsutil looks useful for metrics purposes

Yes, I didn't know about this. It's great! Should it be extended to multiple tree sequences, do you think?

python/tests/test_topology.py

python/tests/test_combinatorics.py

python/tskit/combinatorics.py

python/tskit/trees.py

brianzhang01 · 2020-11-19T17:19:58Z

python/tskit/trees.py

+            parent of the edge on which they lie. If ``epsilon`` is not small enough,
+            compared to the distance between a polytomy and its oldest child (or oldest
+            child mutation) these requirements may not be met. In this case an error is
+            raised, recommending a smaller epsilon value be used.


An alternative to epsilon would be to use https://numpy.org/doc/stable/reference/generated/numpy.nextafter.html. That should give finer resolution than trying to add a small floating point value.

I didn't know about nextafter. This is nice, but I'm cautious about using it, because we have been bitten in the past by creating nodes fractionally smaller than something, and then not being able to wedge in extra stuff later on (e.g. mutations). I can't see why you might want to do this in this case, but I'm still wary. Anyway, Jerome's suggestion of defaulting to an even spread of times seems good to me.

I agree with Yan here - if folks want to have a super-fine epsilon then they can provide it as a parameter, but as a default nextafter is a bit tricky. In practise 1e-10 is a reasonable default, I think, leaving enough space in between nodes for mutations (if they are needed) while still being very small.

I've changed my mind on the auto-spacing after thinking more about it @hyanwong - it'll be quite hard to test, and actually you probably do want super short branches on the randomly resolved polytomies.

brianzhang01 · 2020-11-19T17:30:04Z

But I don't that's what the KC metric on a tree sequence does, @brianzhang01. Daniel coded up an an incremental algorithm, didn't he? I'm afraid that I still think that it's a bit nuts to re-resolve the polytomy every time anything anywhere on the tree changes: it doesn't feel like a natural thing to do, because the tree boundaries are really nothing to do with the polytomy itself. I suspect that it also won't scale to large trees, where pretty much every base could cause a tree change somewhere in the TS, probably miles away from the polytomy you are trying to resolve?

I see, I just skimmed #548 a bit. I guess I meant that the clearest mathematical definition of tree-sequence KC is as an average over the tree. But OK, if you want to use Daniel's function with polytomy breaking then I guess you would need a resolved tree-sequence.

Yes, I didn't know about this. It's great! Should it be extended to multiple tree sequences, do you think?

I can't think of a clear use case off the top of my head.

jeromekelleher · 2020-11-19T23:54:29Z

Thanks for the comments @hyanwong and @brianzhang01, this is all very helpful. I'll have to leave this for a few days, so will respond to your individual comments when I can get backt o it.

However, I'm pretty solidly against the idea of doing resolutions tree-sequence-wide now as i'm sure there's a bunch of complexities we won't have considered, and we don't need it right now.

AdminBot-tskit · 2020-11-20T23:18:54Z

📖 Docs for this PR can be previewed here

AdminBot-tskit · 2020-11-21T17:08:17Z

📖 Docs for this PR can be previewed here

jeromekelleher · 2020-11-21T17:08:49Z

This is ready to go, I think. Here's some evaluation code, for the record.

import tskit
import io

import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt


def num_leaf_labelled_binary_trees(n):
    """
    Returns the number of leaf labelled binary trees with n leaves.

    https://oeis.org/A005373/
    """
    return int(np.math.factorial(2 * n - 3) / (2 ** (n - 2) * np.math.factorial(n - 2)))


for n in range(2, 7):
    N = num_leaf_labelled_binary_trees(n)
    print(f"Running n = {n}, {N} topologies")
    ranks = []
    tree = tskit.Tree.generate_star(n)
    for seed in range(1000 * N):
        split_tree = tree.split_polytomies()
        ranks.append(split_tree.rank())
    df = pd.DataFrame({"rank": ranks})
    assert len(df["rank"].unique()) == N
    ax = sns.countplot(x="rank", data=df)
    plt.savefig(f"n={n}.png")
    plt.clf()

hyanwong · 2020-11-21T19:29:01Z

Thanks @jeromekelleher - did you remove the extra shuffle? You can probably use your tests above to convince yourself that it isn't needed.

jeromekelleher · 2020-11-22T17:06:13Z

Thanks @jeromekelleher - did you remove the extra shuffle? You can probably use your tests above to convince yourself that it isn't needed.

No, I left it in despite not being able to see any difference in having it from the tests. According to Knuth, there are 4n - 2 ways of creating a binary tree with n labelled leaves from one with n - 1. But, we're only choosing from 2n - 1 when we select the focal node x, so it seems to me we have to have the shuffle to get the right number.

As I say, I couldn't see any difference in the tests, but maybe there's something subtle in the leaf labellings that we're missing - we're only ever testing on the ordered labels, so maybe something would show up if we spent more time on that. But, there didn't seem much point to me, since the shuffle can't do any harm and there's not actually any good reason from the algorithm description's perspective for leaving it out.

hyanwong · 2020-11-23T14:44:28Z

Sorry, just picking up a comment I made above @jeromekelleher - if possible it's really helpful to give some feedback about (a) what epsilon value to use, and (b) where the problem is when epsilon is too large. For node times (but not mutation times) we should be able to do that during the split algorithm, rather than right at the end of the function, when we have lost the information about where the problem is.

jeromekelleher · 2020-11-23T15:33:25Z

Can you open an issue to track this please @hyanwong?

hyanwong · 2020-12-13T16:49:24Z

Just to follow up. Switching to splitting polytomies per tree rather than per polytomous node now means that splitting polytomies is now taking longer than the inference process itself (and with better tree sequence, I suspect it will end up taking up more and more of the time, so I am still pretty keen on considering a tree-sequence-wide polytomy resolution algorithm.

hyanwong · 2020-12-13T16:50:15Z

Can you open an issue to track this please @hyanwong?

Opened at #1088

hyanwong mentioned this pull request Sep 4, 2020

Break polytomies by random sampling from tree topologies #564

Closed

hyanwong force-pushed the random-split-polytomy branch 2 times, most recently from 3676c5a to 9cb5dbb Compare September 11, 2020 13:16

hyanwong mentioned this pull request Sep 11, 2020

Randomly resolve polytomies #809

Closed

hyanwong commented Sep 11, 2020

View reviewed changes

python/tskit/tables.py Outdated Show resolved Hide resolved

This was referenced Sep 11, 2020

Update mutation code once mutations have time tskit-dev/msprime#977

Closed

Add option to compute_mutation_times to sample times uniformly along the edge. #849

Closed

hyanwong force-pushed the random-split-polytomy branch from 9cb5dbb to cd21969 Compare September 16, 2020 12:03

benjeffery changed the base branch from master to main September 28, 2020 12:12

hyanwong force-pushed the random-split-polytomy branch from cd21969 to 489eb43 Compare October 15, 2020 17:27

hyanwong force-pushed the random-split-polytomy branch from 489eb43 to 24751ea Compare October 19, 2020 13:36

hyanwong force-pushed the random-split-polytomy branch 3 times, most recently from 86541d5 to 7e62990 Compare November 17, 2020 21:35

hyanwong marked this pull request as ready for review November 17, 2020 21:36

hyanwong requested review from benjeffery and jeromekelleher November 17, 2020 21:36

hyanwong force-pushed the random-split-polytomy branch 2 times, most recently from 99b2362 to 5fbdd65 Compare November 17, 2020 22:06

jeromekelleher reviewed Nov 18, 2020

View reviewed changes

python/tskit/combinatorics.py Outdated Show resolved Hide resolved

python/tskit/combinatorics.py Outdated Show resolved Hide resolved

python/tskit/combinatorics.py Outdated Show resolved Hide resolved

python/tests/test_topology.py Outdated Show resolved Hide resolved

hyanwong commented Nov 19, 2020

View reviewed changes

brianzhang01 reviewed Nov 19, 2020

View reviewed changes

hyanwong commented Nov 19, 2020

View reviewed changes

brianzhang01 reviewed Nov 19, 2020

View reviewed changes

jeromekelleher mentioned this pull request Nov 20, 2020

Add TreeSequence.coiterate(other, **kwargs) method #1021

Closed

Implement equiprobable topology polytomy breaking function

1b02a64

jeromekelleher force-pushed the random-split-polytomy branch from 3ef304f to 33816b0 Compare November 20, 2020 23:17

Move the polytomy splitting code to Tree.

7423109

jeromekelleher force-pushed the random-split-polytomy branch from 33816b0 to 7423109 Compare November 21, 2020 17:06

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 21, 2020

jeromekelleher approved these changes Nov 21, 2020

View reviewed changes

mergify bot merged commit fca033d into tskit-dev:main Nov 21, 2020

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 21, 2020

hyanwong mentioned this pull request Dec 13, 2020

Provide feedback or adaptive epsilon for split polytomies? #1088

Closed

hyanwong mentioned this pull request Jan 6, 2023

Per-tree-sequence polytomy resolution method #2673

Open

Random split polytomy #815

Random split polytomy #815

Uh oh!

Conversation

hyanwong commented Aug 28, 2020

Uh oh!

hyanwong commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented Aug 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented Sep 3, 2020

Uh oh!

benjeffery commented Sep 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjeffery commented Sep 3, 2020

Uh oh!

grahamgower commented Sep 3, 2020

Uh oh!

hyanwong commented Sep 4, 2020

Uh oh!

Uh oh!

AdminBot-tskit commented Oct 15, 2020

Uh oh!

hyanwong commented Nov 17, 2020

Uh oh!

benjeffery commented Nov 18, 2020

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeromekelleher commented Nov 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyanwong commented Nov 19, 2020

Uh oh!

brianzhang01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyanwong Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyanwong Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyanwong Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brianzhang01 commented Nov 19, 2020

Uh oh!

hyanwong commented Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hyanwong commented Aug 28, 2020 •

edited

Loading

benjeffery commented Aug 28, 2020 •

edited

Loading

benjeffery commented Sep 3, 2020 •

edited

Loading

hyanwong Nov 19, 2020 •

edited

Loading

hyanwong Nov 19, 2020 •

edited

Loading

hyanwong Nov 19, 2020 •

edited

Loading

hyanwong commented Nov 19, 2020 •

edited

Loading