-
Notifications
You must be signed in to change notification settings - Fork 78
Per-tree-sequence polytomy resolution method #2673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @hyanwong! My use cases involved wanting to take a ts with polytomies, split the polytomies randomly, and then run tsdate on it to get branch lengths. My understanding is that currently you can split polytomies by tree, but this is then quite inefficient and slow for the time estimation step. Also breaking them up tree-by-tree obviously ruins the idea of edge span along the genome, which I wanted to preserve in the no-polytomies ts. |
My use case was comparing the treewise KC distance with and without random polytomy splitting. Again, doing this per-tree is inefficient when using the incremental There might be some issue here with non-independence among newly introduced (polytomy-splitting) nodes, so I don't know if the variance in total tree sequence KC distance will be biased in any way, but I think the mean will be unbiased under multiple replicates of applying a TS-wide polytomy splitting method. |
My original concerns about the lack of a model here still stand: in the single tree case we have the fairly simple and well understood case of randomly sampling a binary tree uniformly from the set of all binary trees. For the ARG case I'm not sure there is any research on this, so I'm pretty reluctant to add functionality to tskit that's "original research" and not very excited about the potential for doing that research ourselves in the process of adding this functionality. To me, this is a simulation problem --- take an ARG node that has > k children and insert a randomly generated ARG with k samples under some model. The only model I know of is the coalescent, since I don't think anyone has described a combinatorial model for ARGs. Nailing down what the properties would be needed of the parent ARG (does it need to be a "full" ARG or can it be an msprime-like one?) to enable this is also something that would need to be sorted out. These are interesting research problems - but consequently aren't something we are going to resolve in an GitHub issue. |
Thanks @jeromekelleher - that's a really useful clarification. I guess @a-ignatieva might want to comment. My take is that if it's stopping people doing useful things with our tree sequences then it's something that is worth thinking about. I implemented a version to do this because I needed to for testing the use of the KC distance as an ARG comparison metric. A halfway house (and this has come up before) would be to store the code to do this somewhere, but not in the tskit library itself. This also applies to other "useful functions" too, which I often end up re-implementing (possibly badly) having not realised that there exists an implementation online somewhere that isn't in the core tskit library. I guess this turns into an issue of discoverability (and testability) of code snippets to do particular tasks. I have been putting them into the GitHub discussions as a "show and tell" item, but at the moment that doesn't seem very discoverable. |
Thanks @jeromekelleher, all good points. I guess in my applications the sorts of things I commonly want to do involve working out some quantities under (an approximation to) the coalescent with recombination, and then taking inferred ARGs and checking them against the theory. But then the polytomies clearly need to be dealt with somehow - and for cases where I care about the whole ARG rather than each local tree independently, breaking them up tree-by-tree doesn't do the trick. But I get your point that the right method of doing this is probably an open problem - just feels like an important one to me for working with tsinfer type ARGs. |
I think it's a pretty big open question in general how you make an ARG that comes out of Relate or tsinfer correspond to the classical output-of-the-coalescent-with-recombination process object. Since dealing with polytomies is either pretty easy (just stop making assumptions about having two children in code) or very difficult (e.g., what is the likelihood of something with a polytomy under a model that assumes binary trees) I see very little value in polytomy resolution algorithms --- it's just pulling stuff out of a hat. I know this is unhelpful, sorry, but I think there are deep questions to be answered here and providing easy---but probably wrong---solutions is not something tskit should do. |
If we have code snippets that do useful things, which we want to point to in tskit but not be part of the library, then perhaps we could provide a link within the docstring (in this case, the one for tree.resolve_polytomies). For example, we could link to a GitHub discussions Q&A "How can I break polytomies efficiently over a whole tree sequence". I suggest a Q&A because then we could mark something as an answer, but if the routine is improved, we could mark an improved submission as a better answer. I don't know if linking from the docstring implies a maintenance burden, but I think we already have links within docstrings to e.g. GitHub issues, so maybe this is a reasonable halfway house which would aid discovery of these additional non-core routines? |
FWIW, I don't see tskit tree sequences as having to conform to any particular model, as long as they represent ancestry of some description. For example, we are happy to use tskit to encode ancestries produced by forwards simulations even if wildly unbiological, or to encode trees like |
Well exactly - that's why we shouldn't get into the business of doing these simulations in tskit. Generating random topologies must be done under some model, and tskit isn't a simulator (beyond the simple combinatorial ones you mention). To be clear, I regard "polytomy breaking" as equal to "simulating ancestry under some model". |
Fair point - I guess I thought of it as more in the remit of tsinfer (just a post-processing step to force the ARG to be binary), rather than tskit itself. |
I still think a Q&A answer linked from the Tree.spilt_polytomies function is a low cost way of making something useable without it being in the core library. |
Uh oh!
There was an error while loading. Please reload this page.
I was chatting with @a-ignatieva at PopGroup56 and she said she had a couple of use-cases for a method that randomly broke polytomies in an entire tree sequence, but didn't do so tree-by-tree. I had some code to do this (initial example in #809), but @jeromekelleher had reasons for wanting to make the random-polytomy-breaking routine independent from tree-to-tree (see #815 (comment))
Perhaps Ana could comment in this issue about why she wanted a function that worked over the entire TS rather than the tree? I would have found it useful too (see #815 (comment)), so it may be worth implementing if there is a wider a call for it. I suspect that my implementation was overwritten when the PR (#815) was updated to perform only a single-tree resolution, but I guess the logic in #809 (comment) is roughly what I used.
The text was updated successfully, but these errors were encountered: