Skip to content

Chromosome data in VCF #2728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MleGhis opened this issue Mar 30, 2023 · 3 comments
Open

Chromosome data in VCF #2728

MleGhis opened this issue Mar 30, 2023 · 3 comments

Comments

@MleGhis
Copy link

MleGhis commented Mar 30, 2023

Hi,
My SLiM simulations include multiple chromosomes (recombination rates are set at 0.5 between them, as described here ) and I'd like to create a VCF that reflects this structure. Sadly, the write_vcf method currently only accepts strings as input for the "contig_id" parameter, thus forcing every value in the CHROM column of the output VCF to be the same. I can easily create a list of the values I want in this column (since I know the end position of each chromosome), which makes it possible to parse the VCF myself, and I guess that creating a VCF for each chromosome and fusing them afterward would also be an option, but VCF files are rather heavy and many simulations are gonna run, meaning that I'd rather directly give write_vcf the informations needed to create the right file. Do you plan to give the possibility of creating VCF with multiple chromosomes in the future ?
Thank you for reading, have a nice day.

@benjeffery
Copy link
Member

Hi @MleGhis!
Thanks for your suggestion, it is normal practice to have a single chromosome per tree sequence, so the API was initially designed to support this. But as the comment you linked suggests there may be reasons for doing this.
It seems write_vcf could be adapted to take an array, but this doesn't seem ideal as the contig id should come from site metadata ideally. However, it might make sense as a solution for now as we don't have full multi-chrom support.

Would you like to prepare a PR of how this would work for the community to review? There is a development guide in the docs.

@MleGhis
Copy link
Author

MleGhis commented Mar 31, 2023

Thanks @benjeffery for your prompt reply.
Unfortunately, as an intern, I might not have the time to do this right now, and I'll use another (unoptimised) solution for now. But I'll keep that in mind, and I might try later.

@jeromekelleher
Copy link
Member

There's a few ways we could enable this I think:

  1. Add options to specify the coordinate ranges required for output in the VCF, and an option to suppress header generation. You could then write the different chunks to the same file, efficiently. I guess there might be some trouble with exact VCF conformance, if you want the contigs to be defined in the header, though.
  2. We add support for contig_id being some kind of mapping from interval -> string. I guess we should follow the conventions in RateMap to do this. Then we could put in the header in the correct form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants