Skip to content

Add method to convert genotype probabilities to calls #419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 22, 2020

Conversation

eric-czech
Copy link
Collaborator

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - some minor feedback.

GP = da.asarray(ds[call_genotype_probability])
# Remove chunking in genotypes dimension, if present
if len(GP.chunks[2]) > 1:
GP = GP.rechunk((None, None, -1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm wary of rechunk operations that are hidden from the user. But perhaps this one is OK and you have run at scale?

Alternatively, we could fail if the dataset in chunked in this dimension - and get the user to explicitly rechunk.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK here since the rechunk is across the genotypes dimension which can only be of size 3 currently. We may want to rethink that in supporting non-diploid data, if/when it becomes necessary.

out[:] = 1


def convert_probability_to_call(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add this to the public API docs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should do it (you can check by generating the docs).

ds[variables.call_genotype_probability] = ds[ # type: ignore[no-untyped-call]
variables.call_genotype_probability
].astype(dtype)
ds = convert_probability_to_call(ds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test with a different threshold (and 0) would be good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-czech
Copy link
Collaborator Author

FYI I also added this in the last commit, which is unrelated to any comments:
pystatgen/sgkit@b42859c#diff-f93614eb0d28b19a2127ac73dd623269a9c047454c3b66a89f6e022ec93e69e5R106-R107

@tomwhite
Copy link
Collaborator

FYI I also added this in the last commit, which is unrelated to any comments:
b42859c#diff-f93614eb0d28b19a2127ac73dd623269a9c047454c3b66a89f6e022ec93e69e5R106-R107

Do you have a test for this to satisfy coverage criteria?

@tomwhite
Copy link
Collaborator

I fixed the BGEN doc in pystatgen/sgkit@b0a6b38. The build is now failing since coverage is not 100%. Do you want to look at that @eric-czech?

@eric-czech
Copy link
Collaborator Author

Thanks @tomwhite, it's clearing now except for an error in test_vcfzarr_reader. Could this be intermittent? https://github.com/pystatgen/sgkit/pull/419/checks?check_run_id=1589577224

Oddly though, the Windows build has failed twice in a row now with:

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/win-64/repodata.json>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.
'https://conda.anaconda.org/conda-forge/win-64'

I'll try again later on that one.

@codecov-io
Copy link

codecov-io commented Dec 21, 2020

Codecov Report

Merging #419 (363aa26) into master (81ffc0a) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #419   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           33        34    +1     
  Lines         2302      2315   +13     
=========================================
+ Hits          2302      2315   +13     
Impacted Files Coverage Δ
sgkit/io/bgen/bgen_reader.py 100.00% <ø> (ø)
sgkit/__init__.py 100.00% <100.00%> (ø)
sgkit/stats/conversion.py 100.00% <100.00%> (ø)
sgkit/variables.py 100.00% <0.00%> (ø)
sgkit/io/vcf/csi.py 100.00% <0.00%> (ø)
sgkit/io/vcf/tbi.py 100.00% <0.00%> (ø)
sgkit/stats/association.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81ffc0a...363aa26. Read the comment docs.

@eric-czech
Copy link
Collaborator Author

FYI that same test (test_vcfzarr_to_zarr[None-True-False-True]) passed in https://github.com/pystatgen/sgkit/pull/419/checks?check_run_id=1589647095.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. (Not sure if it's worth squashing some of these commits? Feel free to merge either way.)

@eric-czech eric-czech added the auto-merge Auto merge label for mergify test flight label Dec 22, 2020
@mergify mergify bot merged commit e7d979b into sgkit-dev:master Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add function to create genotype calls from genotype probabilities
3 participants