Skip to content

Tree.is_empty method #2640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hyanwong opened this issue Nov 18, 2022 · 10 comments
Closed

Tree.is_empty method #2640

hyanwong opened this issue Nov 18, 2022 · 10 comments

Comments

@hyanwong
Copy link
Member

hyanwong commented Nov 18, 2022

As discussed in #2600 (comment), it would be good to have an standard way of checking if a tree is "empty". However, there are 2 definitions of empty:

  1. No topology, anywhere. This is what happens when we delete_intervals e.g. when we simulate a certain region, or when we remove the flanking regions in tsinfer.
  2. No topology reachable from the roots (where a root can also be subject to the root_threshold parameter, such that isolated samples may not always be considered a root).

I think that an is_empty() method would be a good place to document these two different meanings. So something like:

def is_empty(self, check_roots=None) -> bool:
    """
    Check if this tree is "empty" (i.e. has no topology). A tree is empty if it has no edges. However, it
    may also be considered empty if it contains edges which represent "dead branches" (i.e. not
    reachable from the :meth:`~Tree.roots` of the tree). To consider a tree like this as empty, specify
    ``check_roots=True``.

    :param bool check_roots: Should we also consider a tree empty if it has topology which is unconnected
        to any of the roots of the tree. Default: ``None`` treated as ``False``.
    :return: ``True`` if this tree is empty, ``False`` otherwise.
    """
    if check_roots:
        if tree.num_roots == tree.num_samples():  # A common case, when root_threshold is 1
            return True
        else:
            # More inefficient if this is an empty tree with many samples, e.g. a missing region of genome
            return set(tree.roots) <= set(tree.tree_sequence.samples())
    else:
         return self.num_edges == 0
@hyanwong
Copy link
Member Author

Hmm, my logic above is flawed: we can't simply count the number or roots and samples, because a tree could have a single edge above a each sample (not empty, but same number of roots as samples). Similarly, we can't simply look to see if the roots are all samples, as they could have edges descending from the samples, leading nowhere. So I think we are forced to be longwinded here and iterate through each root checking that it has no children:

    def is_empty(self, check_roots=None) -> bool:
        """
        Check if this tree is "empty" (i.e. has no topology). A tree is empty if it has
        no edges. However, it may also be considered empty if it contains edges which
        represent :ref:`dead branches<sec_data_model_tree_dead_leaves_and_branches>`
        (i.e. not reachable from the :meth:`~Tree.roots` of the tree). To consider such
        a tree as empty too, which is more involved, specify ``check_roots=True``.
        
        Note that this is purely a property of the topology. An "empty" tree can still
        contain sites and there may even be mutations on those sites.
    
        :param bool check_roots: Should we also consider a tree empty if it has
            topology but the topology is unconnected to any of the roots of the tree?
            Default: ``None`` treated as ``False``.
        :return: ``True`` if this tree is empty, ``False`` otherwise.
        """
        if self.num_edges == 0:
            return True

        if not check_roots:
            return False

        # Exhaustively check the roots: it's not simply enough to check that the roots
        # are all samples, as they could still have children 
        for u in self.roots:
            if self.num_children(u) != 0:
                return False
        return True

hyanwong added a commit to hyanwong/tskit that referenced this issue Dec 23, 2022
hyanwong added a commit to hyanwong/tskit that referenced this issue Dec 23, 2022
@jeromekelleher
Copy link
Member

I don't understand why we're worried about the check_roots case - what's the application for this?

I think it would be simpler to just define a tree as empty if it has zero edges. We can come up with another word to describe a tree that only contains dead branches, if we need it.

@hyanwong
Copy link
Member Author

hyanwong commented Jan 8, 2023

ISTR that the main "reason" was to document the various definitions of empty for the user. But we could indeed just define an empty tree as having no edges at all, and document somewhere else that trees can "appear" empty (e.g. when plotting or when traversing their nodes in the normal way), but may not be strictly empty by the formal definition.

I guess the thought was that having the param forces people to think about what they mean by "empty" when they are looking at a tree. But this could be done by e.g. a ..note:: in the docstring?

@hyanwong
Copy link
Member Author

hyanwong commented Jan 8, 2023

I guess the thought was that having the param forces people to think about what they mean by "empty" when they are looking at a tree. But this could be done by e.g. a ..note:: in the docstring?

Then we could also point out that an "empty" tree can still have e.g. sites defined in it, which is another weird edge case.

@jeromekelleher
Copy link
Member

jeromekelleher commented Jan 8, 2023

Or just choose another word, like branchless or something? (i.e., a tree that has zero branches)

@hyanwong
Copy link
Member Author

hyanwong commented Jan 8, 2023

Or just choose another word, like branchless or something? (i.e., a tree that has zero branches)

Hmm, not a bad idea. Would we therefore be happy with e.g. ts.trees(skip_branchless=True) rather than ts.trees(skip_empty=True)? It sounds less intuitive to me, although it's more precise, which is good. Perhaps we should ask others in the tskit community for feedback here?

@jeromekelleher
Copy link
Member

It seems to me that what we're really interested in here is missingness, so why don't we call it is_missing?

The point is to skip trees representing missing data, so it would be easier to follow if we used the word "missing" rather than adding another more-or-less equivalent term.

I guess the most consistent definition would be that the tree contains no non isolated samples or something?

@jeromekelleher
Copy link
Member

There is definitely an argument for keeping it simple and saying a tree is "missing" if it has zero edges, though.

@hyanwong
Copy link
Member Author

hyanwong commented Jan 8, 2023

It seems to me that what we're really interested in here is missingness, so why don't we call it is_missing?

Yes, but I would be worried that a missing tree and missing data might get confused. Missing data is really a property of sites, whereas you can have missing tree topology even in a tree sequence with no mutations. So I think we need to tread carefully here. Maybe something to ask for opinions on during a tskit.dev meeting?

@hyanwong
Copy link
Member Author

We decided that it's too soon to implement this since it's just as easy to do tree.num_edges == 0 or tree.total_branch_length==0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants