Skip to content

VCF lossless conversion tests are failing for Zarr 2.11.0 #828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tomwhite opened this issue Mar 3, 2022 · 6 comments
Closed

VCF lossless conversion tests are failing for Zarr 2.11.0 #828

tomwhite opened this issue Mar 3, 2022 · 6 comments
Labels
upstream Used when our build breaks due to upstream changes

Comments

@tomwhite
Copy link
Collaborator

tomwhite commented Mar 3, 2022

From https://github.com/pystatgen/sgkit/runs/5401218690?check_suite_focus=true (which is truncated due to lots of output):

2022-03-03T15:45:05.4820016Z =========================== short test summary info ============================
2022-03-03T15:45:05.4820284Z FAILED sgkit/tests/io/vcf/test_vcf_lossless_conversion.py::test_lossless_conversion[CEUTrio.20.21.gatk3.4.g.vcf.bgz]
2022-03-03T15:45:05.4820484Z FAILED sgkit/tests/io/vcf/test_vcf_lossless_conversion.py::test_lossless_conversion[all_fields.vcf]
2022-03-03T15:45:05.4820638Z = 2 failed, 813 passed, 32 skipped, 2 xfailed, 1 xpassed in 259.62s (0:04:19) ==

I haven't been able to reproduce this, but it looks like some fields are nan (e.g. QUAL) which is odd:

2022-03-03T15:45:05.4783854Z #CHROM     POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
2022-03-03T15:45:05.4783952Z 1  1       .       G       A,C     nan     PASS    IB0     .       .       .
@tomwhite
Copy link
Collaborator Author

tomwhite commented Mar 4, 2022

I looked into this a bit more and found that it's due to changes in Zarr 2.11.0. In that release, chunks with no data are not written to disk. However, we use two types of NaN value to encode missing vs fill (see https://github.com/pystatgen/vcf-zarr-spec/blob/main/vcf_zarr_spec.md#missing-and-fill-values) both of which are different to regular NaN. So the QUAL field, which is all missing in this example, is not written to disk and then is read as a regular NaN when read back.

It's possible to use the old Zarr behaviour, by setting write_empty_chunks=True when creating the Zarr file. However, Xarray creates the Zarr file for us, and I couldn't see a way of passing this through to Xarray at first glance.

For the moment we should probably pin to zarr<2.11.0 while we fix this. It needs a bit more investigation but may need changes in both Zarr and Xarray.

@tomwhite tomwhite changed the title VCF lossless conversion tests are failing VCF lossless conversion tests are failing for Zarr 2.11.0 Mar 4, 2022
tomwhite added a commit to tomwhite/sgkit that referenced this issue Mar 4, 2022
tomwhite added a commit to tomwhite/sgkit that referenced this issue Mar 4, 2022
@jeromekelleher
Copy link
Collaborator

@tomwhite
Copy link
Collaborator Author

tomwhite commented Mar 4, 2022

Thanks for the references @jeromekelleher!

mergify bot pushed a commit that referenced this issue Mar 4, 2022
@hammer hammer added the upstream Used when our build breaks due to upstream changes label Mar 4, 2022
@tomwhite
Copy link
Collaborator Author

Opened pydata/xarray#6347

@tomwhite
Copy link
Collaborator Author

tomwhite commented Apr 1, 2022

Opened zarr-developers/zarr-specs#194

tomwhite added a commit to tomwhite/sgkit that referenced this issue Apr 7, 2022
mergify bot pushed a commit that referenced this issue Apr 7, 2022
@tomwhite
Copy link
Collaborator Author

tomwhite commented Apr 7, 2022

Fixed in zarr-developers/zarr-python#834

@tomwhite tomwhite closed this as completed Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream Used when our build breaks due to upstream changes
Projects
None yet
Development

No branches or pull requests

3 participants