-
Notifications
You must be signed in to change notification settings - Fork 77
Multiple chromosomes #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have implemented discrete chromosomes in fwdpy11. Length refers to the
total genome length, and the recombination functions work to ensure correct
recording of transmitted edges. In practice, keeping the individual
chromosome lengths as a global variable in a python script has been
sufficient for any downstream analysis. I'm in favor of not adding more
tables, although I do see the utility for considering "real" data.
…On Tue, Apr 16, 2019, 1:21 PM Jerome Kelleher ***@***.***> wrote:
@DomNelson <https://github.com/DomNelson> and I were just discussing the
possibility of adding support for multiple chromosomes to tskit. One
possibility which seems like it might be a reasonably smooth path forward
is the following:
1. Add a chromosome table with an ID, name, length and metadata.
2. Add a chromosome column to the edge table. Within a chromosome,
coordinates must be from 0 to the chromosome's length.
3. Either deprecate the sequence_length property (probably best), or
make it equal to the sum of the length of all chromosomes.
For things like trees() we could add an optional chromosome argument. If
the tree sequence contains multiple chromosomes which would raise an error
if it's not specified. For tree sequences with a single chromosome, things
would continue to work as now.
Any thoughts @petrelharp <https://github.com/petrelharp>, @hyanwong
<https://github.com/hyanwong>, @bhaller <https://github.com/bhaller>,
@molpopgen <https://github.com/molpopgen>?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#176>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGHnHyBN448cJH8LuU53ehq2rUFjU1-yks5vhjCqgaJpZM4czoSO>
.
|
One question: if you add chromosome labels, and edges on chromosome i are
in [0,i-1), then do you plan to simplify separately by chromosome or change
how edges are sorted to include the chromosome label?
I've let edges cross "chromosome" boundaries, which occurs at 1 minus the
crossover rates between chromosomes. The nice feature of this approach is
that none of the sorting and simplifying code needed touching.
…On Tue, Apr 16, 2019, 1:21 PM Jerome Kelleher ***@***.***> wrote:
@DomNelson <https://github.com/DomNelson> and I were just discussing the
possibility of adding support for multiple chromosomes to tskit. One
possibility which seems like it might be a reasonably smooth path forward
is the following:
1. Add a chromosome table with an ID, name, length and metadata.
2. Add a chromosome column to the edge table. Within a chromosome,
coordinates must be from 0 to the chromosome's length.
3. Either deprecate the sequence_length property (probably best), or
make it equal to the sum of the length of all chromosomes.
For things like trees() we could add an optional chromosome argument. If
the tree sequence contains multiple chromosomes which would raise an error
if it's not specified. For tree sequences with a single chromosome, things
would continue to work as now.
Any thoughts @petrelharp <https://github.com/petrelharp>, @hyanwong
<https://github.com/hyanwong>, @bhaller <https://github.com/bhaller>,
@molpopgen <https://github.com/molpopgen>?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#176>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGHnHyBN448cJH8LuU53ehq2rUFjU1-yks5vhjCqgaJpZM4czoSO>
.
|
SLiM allows simulation of multiple chromosomes, in practice, by letting the user define a recombination map with a recombination rate of 0.5 between specific pairs of bases that constitute the end of one chromosome and the start of the next. With this method you can simulate as many chromosomes as you want and it works quite nicely, but it's not terribly aesthetically pleasing. We may put a layer on top of this machinery that hides this implementation detail from the user and instead lets them formally define multiple chromosomes; but I imagine it will keep working the same way under the hood, more or less. So the upshot is: like fwdpy, we're already doing this and haven't seen any particular need for a new chromosome table, etc. But if those features were to appear in tskit, we could certainly adapt to using them. My main question is: would this just be essentially semantic information, or would some real, substantial benefit accrue as a result? If it's just semantics, then so far SLiM users don't seem to care, and I would lean toward "why complicate our lives further?" But if there were a substantial benefit, then that's another matter. Apropos of @molpopgen's question about simplification, for example: if knowing about chromosome structure would allow a 5-chromosome model to simplify much faster, or with a much lower peak memory usage, than a 1-chromosome model of the same total genome length, that would suddenly look very interesting indeed, I think. |
I'd probably populate the tables when "dumping" to tskit's format. But I'm
practice, we keep everything in the fwdpp format as the metadata are much
smaller.
I may be thinking sloppily, but I'm writing from a courthouse awaiting a
jury call, but simplifying separately may "get weird", unless you track
separate node tables per chromosome? If you fully simplify chromosome
zero, and then chromosome 1 has a node specific to it and more recent than
most ancient node on chr0, you have a book keeping violation. If you build
the "global" node table to avoid this, then you've basically run the entire
algorithm, right?
…On Tue, Apr 16, 2019, 1:56 PM Ben Haller ***@***.***> wrote:
SLiM allows simulation of multiple chromosomes, in practice, by letting
the user define a recombination map with a recombination rate of 0.5
between specific pairs of bases that constitute the end of one chromosome
and the start of the next. With this method you can simulate as many
chromosomes as you want and it works quite nicely, but it's not terribly
aesthetically pleasing. We may put a layer on top of this machinery that
hides this implementation detail from the user and instead lets them
formally define multiple chromosomes; but I imagine it will keep working
the same way under the hood, more or less. So the upshot is: like fwdpy,
we're already doing this and haven't seen any particular need for a new
chromosome table, etc. But if those features were to appear in tskit, we
could certainly adapt to using them.
My main question is: would this just be essentially semantic information,
or would some real, substantial benefit accrue as a result? If it's just
semantics, then so far SLiM users don't seem to care, and I would lean
toward "why complicate our lives further?" But if there were a substantial
benefit, then that's another matter. Apropos of @molpopgen
<https://github.com/molpopgen>'s question about simplification, for
example: if knowing about chromosome structure would allow a 5-chromosome
model to simplify much faster, or with a much lower peak memory usage, than
a 1-chromosome model of the same total genome length, that would suddenly
look very interesting indeed, I think.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#176 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGHnH94UubwWDbz9_XM_Km-PzCQtDUiMks5vhjjCgaJpZM4czoSO>
.
|
A trivial comment, but as per #29 I would like to deprecate the word "length" to refer to genome length, and replace it with "span", or something less easily confused with branch lengths. So I would vote for deprecating I like the original idea proposed by @jeromekelleher, and don't see it as problematic to create a new table, after all, if @molpopgen recommends...
...then that's essentially the same as a small table, but the table idea has the advantage of keeping names and other metadata associated with the chromosomes. The burden would be in the extra column in the edge table. Also, there is a major evolutionary distinction between the autosomes and the X chromosome, which makes me think that simply keeping track of boundaries along a continuous span of all chromosomes concatenated together might not be a very good idea. As far as I know, all organisms (including bacteria) have genomes organised into chromosomes, so the structure would be pretty general - although it might be worth a flag in the chromosomes table to keep track of whether the chromosome in question is circular. Much down the line, I wonder if it would ever be possible for sites to move between chromosomes via translocation (but this is a graph-genomes question, and way off-topic) |
Kind of -- it is really just the genetic map. It is needed to parameterize the forward simulation anyways.
The burden is that the existing algorithm breaks. If an individual inherits the following two edges, which are (chrom, left, right) tuples, and L_i is the length of chromosome i: 0 L_0 - x L_0 then the edges are "sorted", but not in a way that is supported by the current code. Further, the notion of scanning with respect to the "current left position" breaks, and must be replaced by the tuple (chrom, left), with comparison operators written for nested ordering ,etc.. So, if chromosome tables are going to be optional, then you end up with two code paths for sorting and simplification. Alternately, you need to manually add a dummy chromosome id for the case where no chromosome table is present.
I've mocked this up in Python for sex chromosomes. It isn't that hard--there is some extra external book-keeping, but that's it. The bigger problem is downstream, running a "variant iterator" over an individual that is heterogametic. As far as I know, all organisms (including bacteria) have genomes organised into chromosomes, so the structure would be pretty general - although it might be worth a flag in the chromosomes table to keep track of whether the chromosome in question is circular. Circular chromosomes should be easy for forward sims--it is the responsibility of the simulator to generate proper edges when a GC event spans the "0" point of a circle.
|
I haven't thought this through in detail, but I imagine that we have another In terms of simplification then, each chromosome would be simplified separately, but with respect to the same set of nodes. I'm not certain of this, but I think this should lead to some performance gains/memory reduction over having all the chromosomes in the same coordinate space. I completely agree with you here @bhaller and @molpopgen that there's no point in doing this unless there's real gains in expressivity (i.e, being able to encode real biology that otherwise very awkward/impossible) or performance gains (e.g., being able to split up simplify). This came out of discussions with @DomNelson, who is interested in these whole-genome simulations, which are currently pretty ugly in msprime. @hyanwong, can you think of examples in tsinfer where we'd want to be able to do this? For example, could we infer a recent ancestor across multiple chromosomes, in principle? |
From a forward simulation perspective, an alternative that seems more
natural is to keep an array of table collections, and simplify each
independently. Further, I then know immediately how to parallelize the
simplification of each chromosome, and it would be lock-free.
This doesn't solve the problem for a coalescent simulation, though.
…On Wed, Apr 17, 2019, 7:57 AM Jerome Kelleher ***@***.***> wrote:
One question: if you add chromosome labels, and edges on chromosome i are
in [0,i-1), then do you plan to simplify separately by chromosome or change
how edges are sorted to include the chromosome label?
I haven't thought this through in detail, but I imagine that we have
another int32_t column on the edge table, which is -1 by default (== no
chromosome), similar to the population and individual references in the
Node table. We would then add this chromosome ID as the first key in the
sorting order for edges, so that all edges for a given chromosome are
adjacent. If there are no chromosomes, then nothing changes. The site table
would also need an extra chromosome column.
In terms of simplification then, each chromosome would be simplified
separately, but with respect to the same set of nodes. I'm not certain of
this, but I think this *should* lead to some performance gains/memory
reduction over having all the chromosomes in the same coordinate space.
I completely agree with you here @bhaller <https://github.com/bhaller>
and @molpopgen <https://github.com/molpopgen> that there's no point in
doing this unless there's real gains in expressivity (i.e, being able to
encode real biology that otherwise very awkward/impossible) or performance
gains (e.g., being able to split up simplify). This came out of discussions
with @DomNelson <https://github.com/DomNelson>, who is interested in
these whole-genome simulations, which are currently pretty ugly in msprime.
@hyanwong <https://github.com/hyanwong>, can you think of examples in
tsinfer where we'd want to be able to do this? For example, could we infer
a recent ancestor across multiple chromosomes, in principle?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#176 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGHnHx0zzrSIAaI6j2IssKAPYuognQPJks5vhzYkgaJpZM4czoSO>
.
|
Certainly I think that we want to be able to restrict recent events such that multiple discontiguous regions (on different chromosomes) pass through a single individual ancestor. Essentially this would appear as "trapped material" if we treat all chromosomes as one long concatenated sequence. But I don't see how this is solved by creating a chromosomes table? |
It's definitely attractive (and where we started when talking about this), but we'd have a lot of duplication for, say, Individuals and it would get tricky to keep the different table collection in sync. Since the whole point is to allow us to track where different chromosomes go through the same individual, I think it would cause problems... |
Well, I guess it's not so much 'solving' as giving a framework where this can be expressed naturally and the results interpreted in a user-friendly way. Certainly it seems more user-friendly to have something like for tree in ts.trees(chromosome=22):
for site in tree.sites():
# Do something with site which has coordinate == standard reference vs for tree in ts.trees():
for site in tree.sites():
if chrs[22].start <= site.position < chrs[22].end:
pos = site.pos - chrs[22].start
# Pos should now be mapped into standard coordinates Plus, the first is clearly going to be much more efficient. |
Hmm, last point was wrong. The alternative would be: for tree in ts.trees(start=chrs[22].start, end=chrs[22].end):
for site in tree.sites():
pos = site.pos - chrs[22].start
# Pos should now be mapped into standard coordinates so the efficiency point doesn't matter. Still, seems easier/less error prone than having to subtract offsets from coordinates, doesn't it? |
I do individuals rather differently, so there is less overhead. For me, a bigger problem with this approach is how to deal with mutations, etc., as this would break existing data models. I do like the idea of simplifying in parallel, though! |
In essence, this changes the definition of a position from a double to a tuple/struct, which is relatively easy to accommodate. |
Nothing stopping you doing it within fwdpp! Whichever way we end up encoding things for the canonical representation, internally it could make a lot of sense to have an array of table collections and do a bit of bookkeeping at export time. |
It has been a TODO item on the kind of TODO list that you secretly never want to get around to. |
Just thought a bit more about this: if we are just redefining what a position means, then simplification proceeds exactly as it does now, just with a slightly more fiddly comparison function. Thus, the only way savings can come about is if we simplify into multiple table collections one chromosome at a time, and the savings must at most be len(total number of edges)/(num chromosomes)*sizeof(edge) in terms of extra RAM needed to simplify. Does that sound right? |
I was thinking of doing it as a for-loop over chromosomes, but maybe this wouldn't work at all actually and we do need to process all edges for a given parent over all chromosomes at the same time. In either case, there's not going to be any real performance difference, so I take that one back. |
I think that this is right, at least with the way that we have been thinking about the problem. EDIT: if you give up some things, specifically non-sample nodes meaning the same thing across chromosomes, then you can do it per chromosome, I think, but you won't know until you try! |
A question regarding this comment from @jeromekelleher:
The process of simplification seems, in SLiM, to spend a great deal of time sorting; I think it's usually around 50% of the total simplification time, pre-sorting the tables, IIRC. If we add another key to the sorting comparison function, and it's the first key compared, that will presumably add overhead, and it'll be completely wasted overhead for models that involve just a single chromosome. And presumably quite a bit of the time that is spent sorting is spent in the comparison function. So do we have a sense of how much this would be expected to slow down the base case of simulating a single chromosome? I would hate to see tree-seq recording get slower for all users because of a "fringe" feature. Would it be easy to use different comparison functions depending upon whether multiple chromosomes are being simulated or not? As this discussion continues, I personally feel like I'm leaning further toward "no thanks", or at least "not now, and not until all the details have been sorted out quite concretely". Is that how others are feeling, or not? To me it just feels like a bunch of work and disruption (a change to the table formats and thus a change to the file format, which is a headache, to begin with), and possible negative performance implications for most simulations, for unclear/fringe benefit. I would also like to hear what @petrelharp thinks. |
@bhaller -- can you distinguish time in |
I believe the "around 50% of total" is in qsort(), yes. I don't think the copies take long. I can take a profile if this seems surprising, it's been a long time since I looked. |
no, that makes sense. thanks. |
Different comparison functions are probably not too hard to do in the sorting. One challenge is that you'd need to introduce another one for all the left/right tracking business. That would add a fair bit of additional function pointer de-referencing which is part of what bogs |
I'm with @bhaller here on the 'not right now, thanks' assessment. The idea came up during a discussion and I wanted to get some feedback on what would be involved and how much hassle it would be. This isn't a small change, so we definitely wouldn't make it lightly, without considering performance implications and taking the impact on downstream users into account. Let's keep the issue open for discussion anyway though. |
I think you all are thinking about this, but if you simplify each chromosome separately, we'll have to do an extra "merge" step at the end to ensure that the node tables agree across chromosomes. It would be nice to be able to parallelize, though! I also lean towards 'no thanks'. An extra benefit of sticking all chromosomes end-to-end is that it makes plotting easier. |
I think this is one of the stickier bits, actually. |
Not that bad, I think, since we would have the (old nodes) -> (new nodes) map for each chromosome. |
I think its a little fiddly when one node has different output IDs on different chromosomes? |
Fiddly, maybe - you'd need a "merge" step where you step through all the node tables and their translation tables, but it'd just take one pass by sortedness. |
@hyanwong and I were just talking about this issue and came up with two proposals where the chromosomes are stored as one continuous tree sequence, but where there is API help to access what you need. Both require storing the chromosome mapping, but this can just be a simple array and doesn't have to be a full table:
or
ts_chrom3 = ts.chromosome(chromosome=3)
tree = ts_chrom3.at(position=12345) I don't see us getting to this feature soon, but wanted to record the discussion. |
I like 2. and with the new zero copy stuff this should be low cost, I think. |
@DomNelson and I were just discussing the possibility of adding support for multiple chromosomes to tskit. One possibility which seems like it might be a reasonably smooth path forward is the following:
chromosome
table with an ID,name
,length
andmetadata
.chromosome
column to the edge table. Within a chromosome, coordinates must be from 0 to the chromosome's length.sequence_length
property (probably best), or make it equal to the sum of the length of all chromosomes.For things like
trees()
we could add an optionalchromosome
argument. If the tree sequence contains multiple chromosomes which would raise an error if it's not specified. For tree sequences with a single chromosome, things would continue to work as now.Any thoughts @petrelharp, @hyanwong, @bhaller, @molpopgen?
The text was updated successfully, but these errors were encountered: