-
Notifications
You must be signed in to change notification settings - Fork 35
sgkit-bgen merger #314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sgkit-bgen merger #314
Conversation
d9a173e
to
ba0cea3
Compare
Codecov Report
@@ Coverage Diff @@
## master #314 +/- ##
==========================================
+ Coverage 97.46% 97.61% +0.14%
==========================================
Files 23 26 +3
Lines 1618 1843 +225
==========================================
+ Hits 1577 1799 +222
- Misses 41 44 +3
Continue to review full report at Codecov.
|
78fc8de
to
01db67a
Compare
01db67a
to
53ca999
Compare
582caf0
to
0dc57af
Compare
0dc57af
to
ccb2fb7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
Overarching request for future, it would likely make things easier to review to break down PRs into multiple steps, for example:
- just move of BGEN repo over to sgkit main repo
- separate PRs for improvements from the current BGEN repo
otherwise we are kind of losing that history in this squashed PR and there is more to review in a single code dump.
@@ -57,6 +57,9 @@ vcf = | |||
cyvcf2 | |||
fsspec | |||
yarl | |||
bgen = | |||
rechunker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this different than the dev-requirements? it can't be git based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had it that way but it doesn't work when you pip install sgkit from github which then tries to pip install rechunker from github. I'm not sure why but I'm instead treating it as an "abstract" dependency for now (e.g. pyscaffold/pyscaffold#261 (comment)). https://github.com/pystatgen/sgkit/issues/321 for when the required changes are released.
try: | ||
from .bgen_reader import bgen_to_zarr, read_bgen, rechunk_bgen # noqa: F401 | ||
|
||
__all__ = ["read_bgen", "bgen_to_zarr", "rechunk_bgen"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we actually need those __all__
variables? in this case for example I think this doesn't change anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did what you did with plink and tom with vcf. I plan on waiting until https://github.com/pystatgen/sgkit/pull/294 is done before making a decision on that.
sgkit/io/bgen/bgen_reader.py
Outdated
"call_dosage_mask", | ||
] | ||
|
||
METAFILE_FIELDS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable seems to be only used to create the dict
in the next line, why not create that dict
in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgkit/io/bgen/bgen_reader.py
Outdated
) -> None: | ||
self.path = Path(path) | ||
self.metafile_path = ( | ||
Path(metafile_path) if metafile_path else path.with_suffix(".metafile") # type: ignore[union-attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be else self.path.with_suffix(".metafile")
? Otherwise you might call with_suffix
on string afaiu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good call -- I should have listened to mypy: pystatgen/sgkit@8899b57#diff-845c788aeac5e31bc9b2635b6fd903f0R67
sgkit/io/bgen/bgen_reader.py
Outdated
f"Generating BGEN metafile for '{self.path}' (this may take a while)" | ||
) | ||
bgen.create_metafile(self.metafile_path, verbose=False) | ||
logger.info("BGEN metafile generation complete") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also add time elapsed here in the log msg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Diff'ing the timestamps in the logs is what I've been doing rather than adding more code. Is there some precedent/good reason to start adding timers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, sometimes this becomes useful if you have tons of logs in production (or even tests) job coming from different places (say in debug mode), and you don't need to search for two corresponding log entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm well this code won't run on distributed workers (for the same bgen file) and I doubt that it ever will but 🤷♂️: pystatgen/sgkit@586aa9d#diff-845c788aeac5e31bc9b2635b6fd903f0R82
sgkit/io/bgen/bgen_reader.py
Outdated
metafile_path | ||
Path to companion index file used to determine bgen byte offsets. | ||
Defaults to ``path`` + ".metafile" if not provided. | ||
This file is necessary for reading bgen genotype probabilities and it will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this codebase is not very consistent with capitalisation of BGEN
:) (there are other places in this codebase, but I'm just commenting here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I capitalized it in the docs: pystatgen/sgkit@8899b57#diff-845c788aeac5e31bc9b2635b6fd903f0R219
df = read_metafile(bgen_reader.metafile_path) | ||
if persist: | ||
df = df.persist() | ||
arrs = dataframe_to_dict(df, METAFILE_DTYPE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm likely missing sth here, why not user xarray here instead of the bespoke dict of arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably use pydata/xarray#3929 if it existed (with default dimension names), but I'm not here to avoid repeating so much of what create_genotype_dosage_dataset does (i.e. the dimension labels don't help in this case). It's also a reasonably convenient place to put the special string conversion handling.
for var in encoding | ||
} | ||
with tempfile.TemporaryDirectory( | ||
prefix="bgen_to_zarr_", suffix=".zarr", dir=tempdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the tempdir
must be a local FS? does rechunker require that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added https://github.com/pystatgen/sgkit/issues/317 yesterday and thought about doing it but it's not really that helpful if the source data can't also be cloud-native. Shouldn't be that hard to add later though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-czech oh, I thought the input supports fsspec mapper, and assumed this reader supports all fsspec FSs, I guess that was a wrong assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great if it did but there's no way for cbgen / bed-reader to use those like htslib does unfortunately.
return ds | ||
|
||
|
||
def rechunk_bgen( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really BGEN specific or could this function be used for generic rechunk
on sgkit dataset (+/- extra handling of variables, or failing if we don't know how to efficiently encode variables present in the dataset)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say it does much of use over .rechunk
other than the bgen-specific variable encoding. I'd expect anything else like https://github.com/pystatgen/sgkit/issues/309 to probably call rechunker directly. Can you see a need for something in between?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I guess pointing at #309, I wonder if there is some common function that could be reused, but we can work on that later, maybe potentially even creating a separate extra for sgkit[rechunker]
to bring in some useful methods.
sgkit/utils.py
Outdated
ndim = a.ndim | ||
|
||
def fn(x: np.ndarray) -> np.ndarray: | ||
max_len = np.asarray(np.frompyfunc(len, 1, 1)(x)).max() | ||
return np.expand_dims(max_len, list(range(ndim))) | ||
|
||
return da.map_blocks(fn, a, chunks=(1,) * ndim, dtype=int).max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this significantly faster than:
return da.frompyfunc(len, 1, 1)(a).max()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like it could be based on https://github.com/dask/dask/blob/7a46e7b4a436f5152872e6d7fa4f5291342bdd2f/dask/array/ufunc.py#L47 since it's wrapping np.frompyfunc. It's about the same, maybe slightly slower in this test:
def max_str_len1(a):
ndim = a.ndim
def fn(x: np.ndarray) -> np.ndarray:
max_len = np.asarray(np.frompyfunc(len, 1, 1)(x)).max()
return np.expand_dims(max_len, list(range(ndim)))
return da.map_blocks(fn, a, chunks=(1,) * ndim, dtype=int).max()
def max_str_len2(a):
ndim = a.ndim
str_len = da.frompyfunc(len, 1, 1)
def fn(x: np.ndarray) -> np.ndarray:
max_len = np.asarray(str_len(x)).max()
return np.expand_dims(max_len, list(range(ndim)))
return da.map_blocks(fn, a, chunks=(1,) * ndim, dtype=int).max()
x = da.asarray(['x']*100000, chunks=(100))
%timeit max_str_len1(x).compute()
154 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit max_str_len2(x).compute()
155 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Apparently not defining it per-block doesn't matter either. I started a list in https://github.com/pystatgen/sgkit/issues/68 though with this function on it as a fast one that could be easily tracked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no, I mean the whole implementation would be:
def max_str_len(a: ArrayLike) -> ArrayLike:
if a.size == 0:
raise ValueError("Max string length cannot be calculated for empty array")
if a.dtype.kind == "O":
a = a.astype(str)
if a.dtype.kind not in {"U", "S"}:
raise ValueError(f"Array must have string dtype (got dtype {a.dtype})")
lens = np.frompyfunc(len, 1, 1)(a)
return lens if isinstance(lens, int) else lens.max()
Notice that:
- we return
ArrayLike
instead of forcing Dask Array -> we usenp
dispatching - we need to handle numpy scalars (e.g.
np.asarray("foo")
) - for the implementation above you need to adjust tests to call
compute()
only on Dask Arrays, or call Dask aware equality check, for example:
actual = d if isinstance(d := max_str_len(x), int) else d.compute()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see what you mean -- yea dask interprets them as more or less the same graph and there's essentially no difference in times between them that I can find. I don't like the numpy semantics since dask, xarray and cupy preserve the array backend for ufunc_reduce functions on scalars, so I made that a special case:
pystatgen/sgkit@ed9c070#diff-ee100e898f8291cbcc9fcecc41cef338R219
This means the function preserves the array type and can be forced to evaluate (for dask or xarray + dask) with just int(max_str_len(x))
rather than needing to switch on anything in the calling code.
Ok, can you take another look @ravwojdyla? |
586aa9d
to
ed9c070
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
sgkit/io/plink/plink_reader.py
Outdated
from ... import create_genotype_call_dataset | ||
from ...model import DIM_SAMPLE | ||
from ...utils import encode_array | ||
from ..utils import dataframe_to_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Any reason to change these to relative imports? I think the absolute ones are more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closes #256
This also addresses sgkit-dev/sgkit-bgen#20 and https://github.com/pystatgen/sgkit/issues/90.
Switching to cbgen instead of bgen-reader essentially involved a re-write of our wrapper, but the changes @horta made are fantastic so it was for the best.
Other notes:
read_bgen
results part of the docstring until https://github.com/pystatgen/sgkit/issues/295 is doneTODO: