Skip to content

How to skip empty regions at the start and end of a tree seq #2600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hyanwong opened this issue Oct 9, 2022 · 10 comments
Closed

How to skip empty regions at the start and end of a tree seq #2600

hyanwong opened this issue Oct 9, 2022 · 10 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Oct 9, 2022

We are about to make a change to tsinfer so that the regions before the last site and after the last site in a tree sequence have no topology. I imagine that this could be quite confusing for people looping through the trees, as sometimes (if the first site is not at position 0), the first tree will have num_roots == num_samples, and will not have any topology.

I suspect it would be useful to be able to skip these empty regions at the start and end of a tree sequence when using the trees() iterator. Have we any suggestions for a nice interface to do this? At the moment I'm doing something like this:

for tree in output_ts.trees():
    if tree.num_edges == 0 and (tree.index == 0 or tree.index == ts.num_trees - 1):
        continue
    ...

But I think that might be a bit advanced for normal users to have to do regularly?

@hyanwong
Copy link
Member Author

hyanwong commented Oct 9, 2022

This also makes me think that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1

@jeromekelleher
Copy link
Member

Good idea, how about

for tree in ts.trees(skip_missing=True):
    ...

we define a tree as missing if tree.num_roots == ts.num_samples. This would also take out centromeres etc, not just the teleomeres (if they've been chopped out of the ts).

@jeromekelleher
Copy link
Member

This also makes me thing that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1

This would clash with seeking methods, but I don't think we need them anyway if we use skip_missing as suggested here.

@hyanwong
Copy link
Member Author

This also makes me thing that a tree.first and tree.last boolean attribute would be neat. In particular, tree.last would avoid having to do tree.index == tree.tree_sequence.num_trees - 1

This would clash with seeking methods, but I don't think we need them anyway if we use skip_missing as suggested here.

Sure. I was simply thinking of an attribute that returned True if the tree was the last tree, as a shorthand (I don't think this would clash with the seeking methods, but it's only a syntactic nicety)

@property
def last(self)
    return self.index == self.tree_sequence.num_trees - 1

@jeromekelleher
Copy link
Member

I meant it would clash with the method called first

Anyway, like I say, I don't think we need it.

@hyanwong
Copy link
Member Author

I meant it would clash with the method called first

Ah, ISWYM - I forgot that existed.

@hyanwong
Copy link
Member Author

hyanwong commented Nov 17, 2022

we define a tree as missing if tree.num_roots == ts.num_samples

Actually, I think this particular definition is a bad idea, because of the root_threshold setting. If that's set to 2, the test will fail. Perhaps it's better to look for tree.num_edges == 0? That won't skip trees with entirely dead branches, but I don't think we would want to do that anyway.

Given that it's not obvious, I suspect a convenience function, tree.is_empty would be helpful. then we can document what we mean by an empty tree, and use a standardised definition everywhere.

@jeromekelleher
Copy link
Member

tree.num_edges = 0 is a good definition, and an is_empty function would be useful.

@hyanwong hyanwong mentioned this issue Dec 24, 2022
3 tasks
@hyanwong
Copy link
Member Author

We decided at a group meeting that in the absence of current user demand, we can suggest the following in documentation, which is possibly clearer than a specific parameter anyway

for tree in ts.trees():
    if tree.num_edges == 0:
        continue
    # Do stuff

@benjeffery
Copy link
Member

Note that num_edges is preferable here as it is pre-calculated, total_branch_length involves a traversal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants