-
Notifications
You must be signed in to change notification settings - Fork 77
Random split polytomy #815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently only a draft, as there are basically no tests, rather a long method name (but I feel we need the word "random" in there, since AFAIK this is the only routine in tskit which involves randomness), and I only have a horribly inefficient (and untested) way to check the time of mutations above child nodes. I'm sure the latter could be optimised - suggestions for approaches are welcome. |
Great stuff - haven't reviewed the code yet, but one question is where we get the RNG on the C side (if indeed we intend to implement this C-side). |
What RNG is used in msprime on the C side? If we do want a C side implementation, I assume we just copy that. Having said which, profiling the code suggests (surprisingly) that the majority of the time is spent in |
GSL is used, which we can't use here due to licensing restrictions (It's GPL were MIT), hence it being an issue. One idea we had for another method was to pass in a random array - but I'm guessing here we don't know how many random numbers will be needed? In the other use case it was known. |
I'd assume that most of the time is spent in the python machinery of |
PCG family of random generators could be an option: https://www.pcg-random.org/download.html |
Numpy currently uses PCG64 as its default PRNG, which is good enough for me! |
3676c5a
to
9cb5dbb
Compare
9cb5dbb
to
cd21969
Compare
cd21969
to
489eb43
Compare
📖 Docs for this PR can be previewed here |
489eb43
to
24751ea
Compare
86541d5
to
7e62990
Compare
99b2362
to
5fbdd65
Compare
@benjeffery - I've revised this after discussion with Jerome, so that the code is placed in I've added Jerome as a reviewer, as much of this was hist suggestion for reworking, but I don't know how much he wants to check over it all? |
I've just started reviewing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It basically looks good, although I haven't really grokked the algorithm. This is quite a tricky thing, which I would imagine needs quite a lot of validation and testing to make sure it has the right properties in all situations.
My inclination would be to merge this ASAP but "undocument" it ahead of the upcoming release. That way, we can use it internally for our own projects without committing ourselves to any particular API or semantics, and also give us a chance to consider how we'll validate stuff like this which we can't just do simple unit tests on.
It's easy enough to get the sites and mutations - we just iterate through them all for the tree. We're already doing a O(n) traversal of the nodes, so there's no point in worrying about efficiency there. |
python/tskit/combinatorics.py
Outdated
implementation of Algorithm R, because we are interested in | ||
the leaf node labellings. | ||
|
||
The pre-fasicle text is available here, page 16: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never heard of a fasicle?! Looks like it has an extra c in the spelling anyway!
OK, so we'll presumably have to add that logic and space the nodes out evenly between the oldest mutation above a child, or, if there are no mutation times, the oldest child node time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had these comments queued up from a review I started but didn't finish. Hope the comments still show up (I don't even know what they are at this moment, haven't figured out how to view them).
python/tskit/tables.py
Outdated
and details of parameters. | ||
""" | ||
if epsilon is None: | ||
epsilon = 1e-10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the advantage of this design rather than just putting epsilon = 1e-10 as the default is that we'd then need to change it in trees.py as well?
python/tskit/tables.py
Outdated
existing_node_time = node_table.time | ||
|
||
# We can save a lot of effort if we don't need to check the time of mutations | ||
# We definitely don't need to check on the first iteration, a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete comment
ts = tables.tree_sequence() | ||
except tskit.LibraryError as e: | ||
msg = str(e) | ||
if msg.startswith( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are checking on the time of children or mutations above children, then I think it would be better to raise these errors when we know we are going to violate the time constraints, as then we can say which nodes or mutations are causing the problems, which is much more helpful to the user.
I didn't do this for mutation times because I removed the code for checking mutation times, but if we are going to reinstate it, we should raise the error there.
python/tskit/combinatorics.py
Outdated
if x.parent is not None: | ||
index = x.parent.children.index(x) | ||
x.parent.children[index] = internal | ||
rng.shuffle(internal.children) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit that I don't quite understand why this extra shuffle is in there? Won't the first rng.choice
make the selection random anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(by the way, I think this is essentially the same edge-addition algorithm that I originally implemented, but you are right that this one is neater - I think as a result of focussing on nodes rather than focussing on edges)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We choose a node x to modify randomly in the first step, adding a new internal node which will be the parent of x and the n'th leaf. We must randomise the order of [x, n] because the leaf would aways be either the left or right child, which would (I think) mean we don't generate all possible trees.
I'll add a few comments to clarify, as it's a simple algorithm and worth explaining for future reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does generate all possible trees, even if the child order isn't shuffled. At least, it still did in my tests when I removed that line (only tested for polytomies up to degree 6, though). And from reading the Knuth explanation, it doesn't seem necessary (I think?)
Having read through the discussion...
|
But I don't that's what the KC metric on a tree sequence does, @brianzhang01. Daniel coded up an an incremental algorithm, didn't he? I'm afraid that I still think that it's a bit nuts to re-resolve the polytomy every time anything anywhere on the tree changes: it doesn't feel like a natural thing to do, because the tree boundaries are really nothing to do with the polytomy itself. I suspect that it also won't scale to large trees, where pretty much every base could cause a tree change somewhere in the TS, probably miles away from the polytomy you are trying to resolve? Having said which, I can see that for metric calculations, it's a convenient way to remove some non-independence in the calculations.
Yes, I didn't know about this. It's great! Should it be extended to multiple tree sequences, do you think? |
parent of the edge on which they lie. If ``epsilon`` is not small enough, | ||
compared to the distance between a polytomy and its oldest child (or oldest | ||
child mutation) these requirements may not be met. In this case an error is | ||
raised, recommending a smaller epsilon value be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative to epsilon would be to use https://numpy.org/doc/stable/reference/generated/numpy.nextafter.html. That should give finer resolution than trying to add a small floating point value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about nextafter. This is nice, but I'm cautious about using it, because we have been bitten in the past by creating nodes fractionally smaller than something, and then not being able to wedge in extra stuff later on (e.g. mutations). I can't see why you might want to do this in this case, but I'm still wary. Anyway, Jerome's suggestion of defaulting to an even spread of times seems good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Yan here - if folks want to have a super-fine epsilon then they can provide it as a parameter, but as a default nextafter is a bit tricky. In practise 1e-10 is a reasonable default, I think, leaving enough space in between nodes for mutations (if they are needed) while still being very small.
I've changed my mind on the auto-spacing after thinking more about it @hyanwong - it'll be quite hard to test, and actually you probably do want super short branches on the randomly resolved polytomies.
I see, I just skimmed #548 a bit. I guess I meant that the clearest mathematical definition of tree-sequence KC is as an average over the tree. But OK, if you want to use Daniel's function with polytomy breaking then I guess you would need a resolved tree-sequence.
I can't think of a clear use case off the top of my head. |
Thanks for the comments @hyanwong and @brianzhang01, this is all very helpful. I'll have to leave this for a few days, so will respond to your individual comments when I can get backt o it. However, I'm pretty solidly against the idea of doing resolutions tree-sequence-wide now as i'm sure there's a bunch of complexities we won't have considered, and we don't need it right now. |
3ef304f
to
33816b0
Compare
📖 Docs for this PR can be previewed here |
33816b0
to
7423109
Compare
📖 Docs for this PR can be previewed here |
This is ready to go, I think. Here's some evaluation code, for the record. import tskit
import io
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
def num_leaf_labelled_binary_trees(n):
"""
Returns the number of leaf labelled binary trees with n leaves.
https://oeis.org/A005373/
"""
return int(np.math.factorial(2 * n - 3) / (2 ** (n - 2) * np.math.factorial(n - 2)))
for n in range(2, 7):
N = num_leaf_labelled_binary_trees(n)
print(f"Running n = {n}, {N} topologies")
ranks = []
tree = tskit.Tree.generate_star(n)
for seed in range(1000 * N):
split_tree = tree.split_polytomies()
ranks.append(split_tree.rank())
df = pd.DataFrame({"rank": ranks})
assert len(df["rank"].unique()) == N
ax = sns.countplot(x="rank", data=df)
plt.savefig(f"n={n}.png")
plt.clf() |
Thanks @jeromekelleher - did you remove the extra shuffle? You can probably use your tests above to convince yourself that it isn't needed. |
No, I left it in despite not being able to see any difference in having it from the tests. According to Knuth, there are 4n - 2 ways of creating a binary tree with n labelled leaves from one with n - 1. But, we're only choosing from 2n - 1 when we select the focal node x, so it seems to me we have to have the shuffle to get the right number. As I say, I couldn't see any difference in the tests, but maybe there's something subtle in the leaf labellings that we're missing - we're only ever testing on the ordered labels, so maybe something would show up if we spent more time on that. But, there didn't seem much point to me, since the shuffle can't do any harm and there's not actually any good reason from the algorithm description's perspective for leaving it out. |
Sorry, just picking up a comment I made above @jeromekelleher - if possible it's really helpful to give some feedback about (a) what epsilon value to use, and (b) where the problem is when epsilon is too large. For node times (but not mutation times) we should be able to do that during the split algorithm, rather than right at the end of the function, when we have lost the information about where the problem is. |
Can you open an issue to track this please @hyanwong? |
Just to follow up. Switching to splitting polytomies per tree rather than per polytomous node now means that splitting polytomies is now taking longer than the inference process itself (and with better tree sequence, I suspect it will end up taking up more and more of the time, so I am still pretty keen on considering a tree-sequence-wide polytomy resolution algorithm. |
Fixes #809