-
Notifications
You must be signed in to change notification settings - Fork 35
Add method to convert genotype probabilities to calls #419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8030f40
to
a249886
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - some minor feedback.
GP = da.asarray(ds[call_genotype_probability]) | ||
# Remove chunking in genotypes dimension, if present | ||
if len(GP.chunks[2]) > 1: | ||
GP = GP.rechunk((None, None, -1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm wary of rechunk operations that are hidden from the user. But perhaps this one is OK and you have run at scale?
Alternatively, we could fail if the dataset in chunked in this dimension - and get the user to explicitly rechunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK here since the rechunk is across the genotypes dimension which can only be of size 3 currently. We may want to rethink that in supporting non-diploid data, if/when it becomes necessary.
out[:] = 1 | ||
|
||
|
||
def convert_probability_to_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add this to the public API docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pystatgen/sgkit@61176be#diff-0ee295c29def16f570f86378eff8ba37315e464d415e731d5d8ea2e67ce16f68R45
Is there any more to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should do it (you can check by generating the docs).
ds[variables.call_genotype_probability] = ds[ # type: ignore[no-untyped-call] | ||
variables.call_genotype_probability | ||
].astype(dtype) | ||
ds = convert_probability_to_call(ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test with a different threshold (and 0) would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I also added this in the last commit, which is unrelated to any comments: |
Do you have a test for this to satisfy coverage criteria? |
I fixed the BGEN doc in pystatgen/sgkit@b0a6b38. The build is now failing since coverage is not 100%. Do you want to look at that @eric-czech? |
Thanks @tomwhite, it's clearing now except for an error in test_vcfzarr_reader. Could this be intermittent? https://github.com/pystatgen/sgkit/pull/419/checks?check_run_id=1589577224 Oddly though, the Windows build has failed twice in a row now with:
I'll try again later on that one. |
Codecov Report
@@ Coverage Diff @@
## master #419 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 33 34 +1
Lines 2302 2315 +13
=========================================
+ Hits 2302 2315 +13
Continue to review full report at Codecov.
|
FYI that same test ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. (Not sure if it's worth squashing some of these commits? Feel free to merge either way.)
Closes https://github.com/pystatgen/sgkit/issues/346