Skip to content

Random split polytomy #815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 21, 2020
Merged

Conversation

hyanwong
Copy link
Member

Fixes #809

@hyanwong
Copy link
Member Author

hyanwong commented Aug 28, 2020

Currently only a draft, as there are basically no tests, rather a long method name (but I feel we need the word "random" in there, since AFAIK this is the only routine in tskit which involves randomness), and I only have a horribly inefficient (and untested) way to check the time of mutations above child nodes. I'm sure the latter could be optimised - suggestions for approaches are welcome.

@benjeffery
Copy link
Member

benjeffery commented Aug 28, 2020

Great stuff - haven't reviewed the code yet, but one question is where we get the RNG on the C side (if indeed we intend to implement this C-side).

@hyanwong
Copy link
Member Author

hyanwong commented Sep 3, 2020

What RNG is used in msprime on the C side? If we do want a C side implementation, I assume we just copy that. Having said which, profiling the code suggests (surprisingly) that the majority of the time is spent in table.add_row(), so a C implementation would mainly be to allow C users to access the functionality.

@benjeffery
Copy link
Member

benjeffery commented Sep 3, 2020

What RNG is used in msprime on the C side?

GSL is used, which we can't use here due to licensing restrictions (It's GPL were MIT), hence it being an issue. One idea we had for another method was to pass in a random array - but I'm guessing here we don't know how many random numbers will be needed? In the other use case it was known.

@benjeffery
Copy link
Member

the majority of the time is spent in table.add_row()

I'd assume that most of the time is spent in the python machinery of add_row so a C implementation would still be faster, but I assume this doesn't take that long and is roughly O(num_polytomies)?

@grahamgower
Copy link
Member

PCG family of random generators could be an option: https://www.pcg-random.org/download.html

@hyanwong
Copy link
Member Author

hyanwong commented Sep 4, 2020

PCG family of random generators could be an option: https://www.pcg-random.org/download.html

Numpy currently uses PCG64 as its default PRNG, which is good enough for me!

@hyanwong hyanwong force-pushed the random-split-polytomy branch 2 times, most recently from 3676c5a to 9cb5dbb Compare September 11, 2020 13:16
@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@hyanwong hyanwong force-pushed the random-split-polytomy branch from 489eb43 to 24751ea Compare October 19, 2020 13:36
@hyanwong hyanwong force-pushed the random-split-polytomy branch 3 times, most recently from 86541d5 to 7e62990 Compare November 17, 2020 21:35
@hyanwong hyanwong marked this pull request as ready for review November 17, 2020 21:36
@hyanwong hyanwong force-pushed the random-split-polytomy branch 2 times, most recently from 99b2362 to 5fbdd65 Compare November 17, 2020 22:06
@hyanwong
Copy link
Member Author

@benjeffery - I've revised this after discussion with Jerome, so that the code is placed in combinatorics.py and doesn't clog up trees.py. I've also removed the ability to modify the epsilon value, and now we just error out if epsilon is too large. There's a helpful error message if this is because of node times, and a less helpful one (at the end of the function) which traps messages about mutation times and raises an error mentioning epsilon. I'm not sure this is the right way to do it, but it avoids having to explicitly check for mutation times, which was a pain in the previous version. The RNG is the one from numpy, so porting this to C might be less obvious. The only "method" is "random" for the moment, but I think it's worth having that argument in as it emphasises the fact that by default there is a stochastic component to this algorithm.

I've added Jerome as a reviewer, as much of this was hist suggestion for reworking, but I don't know how much he wants to check over it all?

@benjeffery
Copy link
Member

I've just started reviewing.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It basically looks good, although I haven't really grokked the algorithm. This is quite a tricky thing, which I would imagine needs quite a lot of validation and testing to make sure it has the right properties in all situations.

My inclination would be to merge this ASAP but "undocument" it ahead of the upcoming release. That way, we can use it internally for our own projects without committing ourselves to any particular API or semantics, and also give us a chance to consider how we'll validate stuff like this which we can't just do simple unit tests on.

@jeromekelleher
Copy link
Member

I started out thinking about an adaptive epsilon like this, but it's hard to do because of mutation times, I think? There's no easy way to get the time of the oldest mutation above a node, without actually just going through the mutations one by one. I think we said that would probably not be too much of a hit, but it seemed easier to just require epsilon to be set manually.

It's easy enough to get the sites and mutations - we just iterate through them all for the tree. We're already doing a O(n) traversal of the nodes, so there's no point in worrying about efficiency there.

implementation of Algorithm R, because we are interested in
the leaf node labellings.

The pre-fasicle text is available here, page 16:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never heard of a fasicle?! Looks like it has an extra c in the spelling anyway!

@hyanwong
Copy link
Member Author

I started out thinking about an adaptive epsilon like this, but it's hard to do because of mutation times, I think? There's no easy way to get the time of the oldest mutation above a node, without actually just going through the mutations one by one. I think we said that would probably not be too much of a hit, but it seemed easier to just require epsilon to be set manually.

It's easy enough to get the sites and mutations - we just iterate through them all for the tree. We're already doing a O(n) traversal of the nodes, so there's no point in worrying about efficiency there.

OK, so we'll presumably have to add that logic and space the nodes out evenly between the oldest mutation above a child, or, if there are no mutation times, the oldest child node time?

Copy link
Member

@brianzhang01 brianzhang01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had these comments queued up from a review I started but didn't finish. Hope the comments still show up (I don't even know what they are at this moment, haven't figured out how to view them).

and details of parameters.
"""
if epsilon is None:
epsilon = 1e-10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the advantage of this design rather than just putting epsilon = 1e-10 as the default is that we'd then need to change it in trees.py as well?

existing_node_time = node_table.time

# We can save a lot of effort if we don't need to check the time of mutations
# We definitely don't need to check on the first iteration, a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete comment

ts = tables.tree_sequence()
except tskit.LibraryError as e:
msg = str(e)
if msg.startswith(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are checking on the time of children or mutations above children, then I think it would be better to raise these errors when we know we are going to violate the time constraints, as then we can say which nodes or mutations are causing the problems, which is much more helpful to the user.

I didn't do this for mutation times because I removed the code for checking mutation times, but if we are going to reinstate it, we should raise the error there.

if x.parent is not None:
index = x.parent.children.index(x)
x.parent.children[index] = internal
rng.shuffle(internal.children)
Copy link
Member Author

@hyanwong hyanwong Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit that I don't quite understand why this extra shuffle is in there? Won't the first rng.choice make the selection random anyway?

Copy link
Member Author

@hyanwong hyanwong Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(by the way, I think this is essentially the same edge-addition algorithm that I originally implemented, but you are right that this one is neater - I think as a result of focussing on nodes rather than focussing on edges)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We choose a node x to modify randomly in the first step, adding a new internal node which will be the parent of x and the n'th leaf. We must randomise the order of [x, n] because the leaf would aways be either the left or right child, which would (I think) mean we don't generate all possible trees.

I'll add a few comments to clarify, as it's a simple algorithm and worth explaining for future reference.

Copy link
Member Author

@hyanwong hyanwong Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does generate all possible trees, even if the child order isn't shuffled. At least, it still did in my tests when I removed that line (only tested for polytomies up to degree 6, though). And from reading the Knuth explanation, it doesn't seem necessary (I think?)

@brianzhang01
Copy link
Member

Having read through the discussion...

  • I am partial towards Jerome's suggestion to resolve individual trees on-the-fly. It's simpler and the KC metric operates tree-wise and then aggregates, so it would make for conceptually simpler KC code as well. The maintenance burden would be lower as more people would understand the code. Examples of maintenance include non-random resolution and different distributions of randomly resolving trees (e.g. rtree vs. rcoal resolution in ape: https://www.rdocumentation.org/packages/ape/versions/5.4-1/topics/rtree, current behavior is rtree)
  • It seems like Yan's function would still have a nice place to live: tsinfer. Besides metrics, what is nice about having and serializing a tree sequence with polytomies resolved?
  • The coiterate function in tsutil looks useful for metrics purposes
  • For what it's worth, I currently have a KC with random resolution of polytomies that more-or-less resolves trees-on-the-fly and computes the KC between two trees at a particular position. Then I sample many positions and compute a Monte Carlo-style average. It's written in my own C++ codebase. The point here is that to cut down on the time of computing a tree-sequence-wide KC, this might be an option, but you'd want to measure the effects of the Monte Carlo approximation.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 19, 2020

Having read through the discussion...

  • I am partial towards Jerome's suggestion to resolve individual trees on-the-fly. It's simpler and the KC metric operates tree-wise and then aggregates, so it would make for conceptually simpler KC code as well.

But I don't that's what the KC metric on a tree sequence does, @brianzhang01. Daniel coded up an an incremental algorithm, didn't he? I'm afraid that I still think that it's a bit nuts to re-resolve the polytomy every time anything anywhere on the tree changes: it doesn't feel like a natural thing to do, because the tree boundaries are really nothing to do with the polytomy itself. I suspect that it also won't scale to large trees, where pretty much every base could cause a tree change somewhere in the TS, probably miles away from the polytomy you are trying to resolve?

Having said which, I can see that for metric calculations, it's a convenient way to remove some non-independence in the calculations.

*The coiterate function in tsutil looks useful for metrics purposes

Yes, I didn't know about this. It's great! Should it be extended to multiple tree sequences, do you think?

parent of the edge on which they lie. If ``epsilon`` is not small enough,
compared to the distance between a polytomy and its oldest child (or oldest
child mutation) these requirements may not be met. In this case an error is
raised, recommending a smaller epsilon value be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to epsilon would be to use https://numpy.org/doc/stable/reference/generated/numpy.nextafter.html. That should give finer resolution than trying to add a small floating point value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about nextafter. This is nice, but I'm cautious about using it, because we have been bitten in the past by creating nodes fractionally smaller than something, and then not being able to wedge in extra stuff later on (e.g. mutations). I can't see why you might want to do this in this case, but I'm still wary. Anyway, Jerome's suggestion of defaulting to an even spread of times seems good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Yan here - if folks want to have a super-fine epsilon then they can provide it as a parameter, but as a default nextafter is a bit tricky. In practise 1e-10 is a reasonable default, I think, leaving enough space in between nodes for mutations (if they are needed) while still being very small.

I've changed my mind on the auto-spacing after thinking more about it @hyanwong - it'll be quite hard to test, and actually you probably do want super short branches on the randomly resolved polytomies.

@brianzhang01
Copy link
Member

But I don't that's what the KC metric on a tree sequence does, @brianzhang01. Daniel coded up an an incremental algorithm, didn't he? I'm afraid that I still think that it's a bit nuts to re-resolve the polytomy every time anything anywhere on the tree changes: it doesn't feel like a natural thing to do, because the tree boundaries are really nothing to do with the polytomy itself. I suspect that it also won't scale to large trees, where pretty much every base could cause a tree change somewhere in the TS, probably miles away from the polytomy you are trying to resolve?

I see, I just skimmed #548 a bit. I guess I meant that the clearest mathematical definition of tree-sequence KC is as an average over the tree. But OK, if you want to use Daniel's function with polytomy breaking then I guess you would need a resolved tree-sequence.

Yes, I didn't know about this. It's great! Should it be extended to multiple tree sequences, do you think?

I can't think of a clear use case off the top of my head.

@jeromekelleher
Copy link
Member

Thanks for the comments @hyanwong and @brianzhang01, this is all very helpful. I'll have to leave this for a few days, so will respond to your individual comments when I can get backt o it.

However, I'm pretty solidly against the idea of doing resolutions tree-sequence-wide now as i'm sure there's a bunch of complexities we won't have considered, and we don't need it right now.

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@AdminBot-tskit
Copy link
Collaborator

📖 Docs for this PR can be previewed here

@jeromekelleher
Copy link
Member

This is ready to go, I think. Here's some evaluation code, for the record.

import tskit
import io

import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt


def num_leaf_labelled_binary_trees(n):
    """
    Returns the number of leaf labelled binary trees with n leaves.

    https://oeis.org/A005373/
    """
    return int(np.math.factorial(2 * n - 3) / (2 ** (n - 2) * np.math.factorial(n - 2)))


for n in range(2, 7):
    N = num_leaf_labelled_binary_trees(n)
    print(f"Running n = {n}, {N} topologies")
    ranks = []
    tree = tskit.Tree.generate_star(n)
    for seed in range(1000 * N):
        split_tree = tree.split_polytomies()
        ranks.append(split_tree.rank())
    df = pd.DataFrame({"rank": ranks})
    assert len(df["rank"].unique()) == N
    ax = sns.countplot(x="rank", data=df)
    plt.savefig(f"n={n}.png")
    plt.clf()

n=3

n=4

n=5

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 21, 2020
@mergify mergify bot merged commit fca033d into tskit-dev:main Nov 21, 2020
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Nov 21, 2020
@hyanwong
Copy link
Member Author

Thanks @jeromekelleher - did you remove the extra shuffle? You can probably use your tests above to convince yourself that it isn't needed.

@jeromekelleher
Copy link
Member

Thanks @jeromekelleher - did you remove the extra shuffle? You can probably use your tests above to convince yourself that it isn't needed.

No, I left it in despite not being able to see any difference in having it from the tests. According to Knuth, there are 4n - 2 ways of creating a binary tree with n labelled leaves from one with n - 1. But, we're only choosing from 2n - 1 when we select the focal node x, so it seems to me we have to have the shuffle to get the right number.

As I say, I couldn't see any difference in the tests, but maybe there's something subtle in the leaf labellings that we're missing - we're only ever testing on the ordered labels, so maybe something would show up if we spent more time on that. But, there didn't seem much point to me, since the shuffle can't do any harm and there's not actually any good reason from the algorithm description's perspective for leaving it out.

@hyanwong
Copy link
Member Author

Sorry, just picking up a comment I made above @jeromekelleher - if possible it's really helpful to give some feedback about (a) what epsilon value to use, and (b) where the problem is when epsilon is too large. For node times (but not mutation times) we should be able to do that during the split algorithm, rather than right at the end of the function, when we have lost the information about where the problem is.

@jeromekelleher
Copy link
Member

Can you open an issue to track this please @hyanwong?

@hyanwong
Copy link
Member Author

Just to follow up. Switching to splitting polytomies per tree rather than per polytomous node now means that splitting polytomies is now taking longer than the inference process itself (and with better tree sequence, I suspect it will end up taking up more and more of the time, so I am still pretty keen on considering a tree-sequence-wide polytomy resolution algorithm.

@hyanwong
Copy link
Member Author

Can you open an issue to track this please @hyanwong?

Opened at #1088

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Randomly resolve polytomies
6 participants