Skip to content

Multiple chromosomes #176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jeromekelleher opened this issue Apr 16, 2019 · 32 comments
Open

Multiple chromosomes #176

jeromekelleher opened this issue Apr 16, 2019 · 32 comments
Labels
enhancement New feature or request

Comments

@jeromekelleher
Copy link
Member

@DomNelson and I were just discussing the possibility of adding support for multiple chromosomes to tskit. One possibility which seems like it might be a reasonably smooth path forward is the following:

  1. Add a chromosome table with an ID, name, length and metadata.
  2. Add a chromosome column to the edge table. Within a chromosome, coordinates must be from 0 to the chromosome's length.
  3. Either deprecate the sequence_length property (probably best), or make it equal to the sum of the length of all chromosomes.

For things like trees() we could add an optional chromosome argument. If the tree sequence contains multiple chromosomes which would raise an error if it's not specified. For tree sequences with a single chromosome, things would continue to work as now.

Any thoughts @petrelharp, @hyanwong, @bhaller, @molpopgen?

@molpopgen
Copy link
Member

molpopgen commented Apr 16, 2019 via email

@molpopgen
Copy link
Member

molpopgen commented Apr 16, 2019 via email

@bhaller
Copy link

bhaller commented Apr 16, 2019

SLiM allows simulation of multiple chromosomes, in practice, by letting the user define a recombination map with a recombination rate of 0.5 between specific pairs of bases that constitute the end of one chromosome and the start of the next. With this method you can simulate as many chromosomes as you want and it works quite nicely, but it's not terribly aesthetically pleasing. We may put a layer on top of this machinery that hides this implementation detail from the user and instead lets them formally define multiple chromosomes; but I imagine it will keep working the same way under the hood, more or less. So the upshot is: like fwdpy, we're already doing this and haven't seen any particular need for a new chromosome table, etc. But if those features were to appear in tskit, we could certainly adapt to using them.

My main question is: would this just be essentially semantic information, or would some real, substantial benefit accrue as a result? If it's just semantics, then so far SLiM users don't seem to care, and I would lean toward "why complicate our lives further?" But if there were a substantial benefit, then that's another matter. Apropos of @molpopgen's question about simplification, for example: if knowing about chromosome structure would allow a 5-chromosome model to simplify much faster, or with a much lower peak memory usage, than a 1-chromosome model of the same total genome length, that would suddenly look very interesting indeed, I think.

@molpopgen
Copy link
Member

molpopgen commented Apr 16, 2019 via email

@hyanwong
Copy link
Member

A trivial comment, but as per #29 I would like to deprecate the word "length" to refer to genome length, and replace it with "span", or something less easily confused with branch lengths. So I would vote for deprecating sequence_length (and if you go down the route of a new chromosome table, I would have ID, name, span and metadata as the columns).

I like the original idea proposed by @jeromekelleher, and don't see it as problematic to create a new table, after all, if @molpopgen recommends...

keeping the individual
chromosome lengths as a global variable in a python script

...then that's essentially the same as a small table, but the table idea has the advantage of keeping names and other metadata associated with the chromosomes. The burden would be in the extra column in the edge table.

Also, there is a major evolutionary distinction between the autosomes and the X chromosome, which makes me think that simply keeping track of boundaries along a continuous span of all chromosomes concatenated together might not be a very good idea. As far as I know, all organisms (including bacteria) have genomes organised into chromosomes, so the structure would be pretty general - although it might be worth a flag in the chromosomes table to keep track of whether the chromosome in question is circular.

Much down the line, I wonder if it would ever be possible for sites to move between chromosomes via translocation (but this is a graph-genomes question, and way off-topic)

@molpopgen
Copy link
Member

I like the original idea proposed by @jeromekelleher, and don't see it as problematic to create a new table, after all, if @molpopgen recommends...

keeping the individual
chromosome lengths as a global variable in a python script

Kind of -- it is really just the genetic map. It is needed to parameterize the forward simulation anyways.

...then that's essentially the same as a small table, but the table idea has the advantage of keeping names and other metadata associated with the chromosomes. The burden would be in the extra column in the edge table.

The burden is that the existing algorithm breaks. If an individual inherits the following two edges, which are (chrom, left, right) tuples, and L_i is the length of chromosome i:

0 L_0 - x L_0
1 0 y,

then the edges are "sorted", but not in a way that is supported by the current code. Further, the notion of scanning with respect to the "current left position" breaks, and must be replaced by the tuple (chrom, left), with comparison operators written for nested ordering ,etc..

So, if chromosome tables are going to be optional, then you end up with two code paths for sorting and simplification. Alternately, you need to manually add a dummy chromosome id for the case where no chromosome table is present.

Also, there is a major evolutionary distinction between the autosomes and the X chromosome, which makes me think that simply keeping track of boundaries along a continuous span of all chromosomes concatenated together might not be a very good idea.

I've mocked this up in Python for sex chromosomes. It isn't that hard--there is some extra external book-keeping, but that's it. The bigger problem is downstream, running a "variant iterator" over an individual that is heterogametic.

As far as I know, all organisms (including bacteria) have genomes organised into chromosomes, so the structure would be pretty general - although it might be worth a flag in the chromosomes table to keep track of whether the chromosome in question is circular.

Circular chromosomes should be easy for forward sims--it is the responsibility of the simulator to generate proper edges when a GC event spans the "0" point of a circle.

Much down the line, I wonder if it would ever be possible for sites to move between chromosomes via translocation (but this is a graph-genomes question, and way off-topic)

@jeromekelleher
Copy link
Member Author

One question: if you add chromosome labels, and edges on chromosome i are in [0,i-1), then do you plan to simplify separately by chromosome or change how edges are sorted to include the chromosome label?

I haven't thought this through in detail, but I imagine that we have another int32_t column on the edge table, which is -1 by default (== no chromosome), similar to the population and individual references in the Node table. We would then add this chromosome ID as the first key in the sorting order for edges, so that all edges for a given chromosome are adjacent. If there are no chromosomes, then nothing changes. The site table would also need an extra chromosome column.

In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space.

I completely agree with you here @bhaller and @molpopgen that there's no point in doing this unless there's real gains in expressivity (i.e, being able to encode real biology that otherwise very awkward/impossible) or performance gains (e.g., being able to split up simplify). This came out of discussions with @DomNelson, who is interested in these whole-genome simulations, which are currently pretty ugly in msprime.

@hyanwong, can you think of examples in tsinfer where we'd want to be able to do this? For example, could we infer a recent ancestor across multiple chromosomes, in principle?

@molpopgen
Copy link
Member

molpopgen commented Apr 17, 2019 via email

@hyanwong
Copy link
Member

@hyanwong, can you think of examples in tsinfer where we'd want to be able to do this? For example, could we infer a recent ancestor across multiple chromosomes, in principle?

Certainly I think that we want to be able to restrict recent events such that multiple discontiguous regions (on different chromosomes) pass through a single individual ancestor. Essentially this would appear as "trapped material" if we treat all chromosomes as one long concatenated sequence. But I don't see how this is solved by creating a chromosomes table?

@jeromekelleher
Copy link
Member Author

From a forward simulation perspective, an alternative that seems more natural is to keep an array of table collections, and simplify each independently. Further, I then know immediately how to parallelize the simplification of each chromosome, and it would be lock-free.

It's definitely attractive (and where we started when talking about this), but we'd have a lot of duplication for, say, Individuals and it would get tricky to keep the different table collection in sync. Since the whole point is to allow us to track where different chromosomes go through the same individual, I think it would cause problems...

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Apr 17, 2019

Certainly I think that we want to be able to restrict recent events such that multiple discontiguous regions (on different chromosomes) pass through a single individual ancestor. Essentially this would appear as "trapped material" if we treat all chromosomes as one long concatenated sequence. But I don't see how this is solved by creating a chromosomes table?

Well, I guess it's not so much 'solving' as giving a framework where this can be expressed naturally and the results interpreted in a user-friendly way. Certainly it seems more user-friendly to have something like

for tree in ts.trees(chromosome=22):
     for site in tree.sites():
          # Do something with site which has coordinate == standard reference

vs

for tree in ts.trees():
     for site in tree.sites():
          if chrs[22].start <= site.position < chrs[22].end:
               pos = site.pos - chrs[22].start
               # Pos should now be mapped into standard coordinates

Plus, the first is clearly going to be much more efficient.

@jeromekelleher
Copy link
Member Author

Hmm, last point was wrong. The alternative would be:

for tree in ts.trees(start=chrs[22].start, end=chrs[22].end):
     for site in tree.sites():        
           pos = site.pos - chrs[22].start
           # Pos should now be mapped into standard coordinates

so the efficiency point doesn't matter. Still, seems easier/less error prone than having to subtract offsets from coordinates, doesn't it?

@molpopgen
Copy link
Member

From a forward simulation perspective, an alternative that seems more natural is to keep an array of table collections, and simplify each independently. Further, I then know immediately how to parallelize the simplification of each chromosome, and it would be lock-free.

It's definitely attractive (and where we started when talking about this), but we'd have a lot of duplication for, say, Individuals and it would get tricky to keep the different table collection in sync. Since the whole point is to allow us to track where different chromosomes go through the same individual, I think it would cause problems...

I do individuals rather differently, so there is less overhead. For me, a bigger problem with this approach is how to deal with mutations, etc., as this would break existing data models.

I do like the idea of simplifying in parallel, though!

@molpopgen
Copy link
Member

I haven't thought this through in detail, but I imagine that we have another int32_t column on the edge table, which is -1 by default (== no chromosome), similar to the population and individual references in the Node table. We would then add this chromosome ID as the first key in the sorting order for edges, so that all edges for a given chromosome are adjacent. If there are no chromosomes, then nothing changes. The site table would also need an extra chromosome column.

In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space.

In essence, this changes the definition of a position from a double to a tuple/struct, which is relatively easy to accommodate.

@jeromekelleher
Copy link
Member Author

I do like the idea of simplifying in parallel, though!

Nothing stopping you doing it within fwdpp! Whichever way we end up encoding things for the canonical representation, internally it could make a lot of sense to have an array of table collections and do a bit of bookkeeping at export time.

@molpopgen
Copy link
Member

Nothing stopping you doing it within fwdpp! Whichever way we end up encoding things for the canonical representation, internally it could make a lot of sense to have an array of table collections and do a bit of bookkeeping at export time.

It has been a TODO item on the kind of TODO list that you secretly never want to get around to.

@molpopgen
Copy link
Member

In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space.

Just thought a bit more about this: if we are just redefining what a position means, then simplification proceeds exactly as it does now, just with a slightly more fiddly comparison function. Thus, the only way savings can come about is if we simplify into multiple table collections one chromosome at a time, and the savings must at most be len(total number of edges)/(num chromosomes)*sizeof(edge) in terms of extra RAM needed to simplify. Does that sound right?

@jeromekelleher
Copy link
Member Author

I was thinking of doing it as a for-loop over chromosomes, but maybe this wouldn't work at all actually and we do need to process all edges for a given parent over all chromosomes at the same time. In either case, there's not going to be any real performance difference, so I take that one back.

@molpopgen
Copy link
Member

molpopgen commented Apr 17, 2019

I think that this is right, at least with the way that we have been thinking about the problem.

EDIT: if you give up some things, specifically non-sample nodes meaning the same thing across chromosomes, then you can do it per chromosome, I think, but you won't know until you try!

@bhaller
Copy link

bhaller commented Apr 17, 2019

A question regarding this comment from @jeromekelleher:

I haven't thought this through in detail, but I imagine that we have another int32_t column on the edge table, which is -1 by default (== no chromosome), similar to the population and individual references in the Node table. We would then add this chromosome ID as the first key in the sorting order for edges, so that all edges for a given chromosome are adjacent. If there are no chromosomes, then nothing changes. The site table would also need an extra chromosome column.

The process of simplification seems, in SLiM, to spend a great deal of time sorting; I think it's usually around 50% of the total simplification time, pre-sorting the tables, IIRC. If we add another key to the sorting comparison function, and it's the first key compared, that will presumably add overhead, and it'll be completely wasted overhead for models that involve just a single chromosome. And presumably quite a bit of the time that is spent sorting is spent in the comparison function. So do we have a sense of how much this would be expected to slow down the base case of simulating a single chromosome? I would hate to see tree-seq recording get slower for all users because of a "fringe" feature. Would it be easy to use different comparison functions depending upon whether multiple chromosomes are being simulated or not?

As this discussion continues, I personally feel like I'm leaning further toward "no thanks", or at least "not now, and not until all the details have been sorted out quite concretely". Is that how others are feeling, or not? To me it just feels like a bunch of work and disruption (a change to the table formats and thus a change to the file format, which is a headache, to begin with), and possible negative performance implications for most simulations, for unclear/fringe benefit. I would also like to hear what @petrelharp thinks.

@molpopgen
Copy link
Member

@bhaller -- can you distinguish time in qsort from time in sort_tables? The latter is actually a copy-qsort-copy.

@bhaller
Copy link

bhaller commented Apr 17, 2019

I believe the "around 50% of total" is in qsort(), yes. I don't think the copies take long. I can take a profile if this seems surprising, it's been a long time since I looked.

@molpopgen
Copy link
Member

no, that makes sense. thanks.

@molpopgen
Copy link
Member

molpopgen commented Apr 17, 2019

Different comparison functions are probably not too hard to do in the sorting. One challenge is that you'd need to introduce another one for all the left/right tracking business. That would add a fair bit of additional function pointer de-referencing which is part of what bogs qsort down (compared to a typical application of std::sort).

@jeromekelleher
Copy link
Member Author

I'm with @bhaller here on the 'not right now, thanks' assessment. The idea came up during a discussion and I wanted to get some feedback on what would be involved and how much hassle it would be. This isn't a small change, so we definitely wouldn't make it lightly, without considering performance implications and taking the impact on downstream users into account.

Let's keep the issue open for discussion anyway though.

@petrelharp
Copy link
Contributor

In terms of simplification then, each chromosome would be simplified separately,

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes. It would be nice to be able to parallelize, though!

I also lean towards 'no thanks'. An extra benefit of sticking all chromosomes end-to-end is that it makes plotting easier.

@molpopgen
Copy link
Member

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes.

I think this is one of the stickier bits, actually.

@petrelharp
Copy link
Contributor

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes.
I think this is one of the stickier bits, actually.

Not that bad, I think, since we would have the (old nodes) -> (new nodes) map for each chromosome.

@molpopgen
Copy link
Member

I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes.
I think this is one of the stickier bits, actually.

Not that bad, I think, since we would have the (old nodes) -> (new nodes) map for each chromosome.

I think its a little fiddly when one node has different output IDs on different chromosomes?

@petrelharp
Copy link
Contributor

I think its a little fiddly when one node has different output IDs on different chromosomes?

Fiddly, maybe - you'd need a "merge" step where you step through all the node tables and their translation tables, but it'd just take one pass by sortedness.

@benjeffery
Copy link
Member

@hyanwong and I were just talking about this issue and came up with two proposals where the chromosomes are stored as one continuous tree sequence, but where there is API help to access what you need. Both require storing the chromosome mapping, but this can just be a simple array and doesn't have to be a full table:

  1. Add a position mapper that is used wherever you want to translate coords - e.g.
    tree = ts.at(position=ts.chromosome_map(chromosome=3)(postion=12345))

or

  1. Add a method to TreeSequence that makes a new ts where all positions are translated on API calls: e.g.:
ts_chrom3 = ts.chromosome(chromosome=3)
tree = ts_chrom3.at(position=12345)

I don't see us getting to this feature soon, but wanted to record the discussion.

@hyanwong
Copy link
Member

I like 2. and with the new zero copy stuff this should be low cost, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants