sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations

`sgkit.io.vcf.vcf_to_zarr()`   fails to convert VCFs with INFO/CSQ annotations with error:

`ValueError: INFO field 'CSQ' is defined as Number '.', which is not supported.`

as tested on sgkit v0.6.0.

Presumably, the method will also fail for any VCFs containing annotations with unbounded size.  INFO/CSQ contains variant effect predictions from [VEP](https://www.ensembl.org/info/docs/tools/vep/index.html).   There can be multiple predictions for each allele, one for every transcript that an allele overlaps.  Each prediction is separated by a comma.  The number of predictions per allele is not known in  advance, and so the INFO/CSQ field is defined with unbounded size in the header, or "Number=.":

For example:

`##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|...>`

It would be very useful to be able to filter a zarr for variants that are deemed clinically relevant according to annotation, such as loss of function variants.

Do you suggest any workarounds in the meantime?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions