-
Notifications
You must be signed in to change notification settings - Fork 35
sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @tnguyengel, I'm not familiar INFO/CSQ annotations but you can specify a maximum bound for a VCF field using the from sgkit.io.vcf import vcf_to_zarr
vcf_to_zarr(
"example.vcf.gz",
"example.zarr",
fields=["INFO/CSQ "],
field_defs={"INFO/CSQ ": {"Number": 100}}, # or whatever a reasonable maximum is.
) Essentially, the |
Thanks! That sounds like a quick workaround! From brief googling, a human gene has ~4 transcripts on average, but can have up to ~1000 transcripts. If this is correct, in order to ensure that the INFO/CSQ field is not truncated in the zarr, I would need to set:
Assuming that a typical variant allele does indeed have ~4 INFO/CSQ entries, but can have up to ~1000 INFO/CSQ entries, would I run the risk of bloating the zarr file size due to the INFO/CSQ field? Or does sgkit use sparse representation to handle fields with many zero/null/empty elements? |
Good questions @tnguyengel! Any thoughts @tomwhite? |
This is a case where fixed-size arrays struggle, I'm afraid. The problem is that in Zarr you can have ragged arrays, or variable-length strings, but not both. There's more discussion about this in #634 and the linked xarray issue at pydata/xarray#4285 (which seems to have some newer discussion). The good news is that there are lots of people who want to solve this issue. In the meantime, you can either just store the first n entries (which is lossy), or store all of them (which wastes space). Another trick is to use the undocumented from sgkit.io.vcf.vcf_reader import zarr_array_sizes
kwargs = zarr_array_sizes(path)
vcf_to_zarr(path, ..., **kwargs) |
Looks like we could improve the error message here? |
I've opened #1064 for this. Also, @benjeffery reminded me that the |
@tnguyengel I noticed that VEP allows you to export annotations as a JSON file, so as another way to approach the problem I wondered if you could use that to generate a list of variant IDs (using Pandas, or DuckDB, or ...), then use that list to subset the sgkit dataset? |
BTW to filter the dataset, you can do something like: variant_ids = ["rs6054257", "rs6040355"]
ds_filtered = ds.isel(variants=(ds.variant_id.isin(variant_ids))) |
Thanks! The annotations were executed awhile ago and only output to VCF. Json is admittedly a much easier format to query than VCF. At the moment, we're playing around with splitvep output. If it's not too computationally intensive, reannotating to json might be more attractive. We'll give the zarr_array_sizes approach a shot too! |
The parallel version of |
@tnguyengel Adding a username ping here in case the last message didn't reach your notifications. |
Uh oh!
There was an error while loading. Please reload this page.
sgkit.io.vcf.vcf_to_zarr()
fails to convert VCFs with INFO/CSQ annotations with error:ValueError: INFO field 'CSQ' is defined as Number '.', which is not supported.
as tested on sgkit v0.6.0.
Presumably, the method will also fail for any VCFs containing annotations with unbounded size. INFO/CSQ contains variant effect predictions from VEP. There can be multiple predictions for each allele, one for every transcript that an allele overlaps. Each prediction is separated by a comma. The number of predictions per allele is not known in advance, and so the INFO/CSQ field is defined with unbounded size in the header, or "Number=.":
For example:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|...>
It would be very useful to be able to filter a zarr for variants that are deemed clinically relevant according to annotation, such as loss of function variants.
Do you suggest any workarounds in the meantime?
The text was updated successfully, but these errors were encountered: