sgkit-bgen merger #314

eric-czech · 2020-10-08T20:36:34Z

Closes #256

This also addresses sgkit-dev/sgkit-bgen#20 and https://github.com/pystatgen/sgkit/issues/90.

Switching to cbgen instead of bgen-reader essentially involved a re-write of our wrapper, but the changes @horta made are fantastic so it was for the best.

Other notes:

I added https://github.com/pystatgen/sgkit/issues/319 as a follow-up to further unify the plink/bgen readers (didn't do it here to keep this PR from creeping more)
I also added Bgen docs sgkit-bgen#26 to document bgen in a separate PR
I'm intentionally not copying the variable descriptions in the read_bgen results part of the docstring until https://github.com/pystatgen/sgkit/issues/295 is done

TODO:

Decide on https://github.com/pystatgen/sgkit/issues/315
Test load times on UKB (dataset loading time decreased from 45s to 30s on chr21)

codecov-io · 2020-10-08T20:49:38Z

Codecov Report

Merging #314 into master will increase coverage by 0.14%.
The diff coverage is 98.78%.

@@            Coverage Diff             @@
##           master     #314      +/-   ##
==========================================
+ Coverage   97.46%   97.61%   +0.14%     
==========================================
  Files          23       26       +3     
  Lines        1618     1843     +225     
==========================================
+ Hits         1577     1799     +222     
- Misses         41       44       +3

Impacted Files	Coverage Δ
sgkit/io/plink/__init__.py	`50.00% <ø> (ø)`
sgkit/variables.py	`96.11% <ø> (ø)`
sgkit/io/bgen/__init__.py	`50.00% <50.00%> (ø)`
sgkit/io/bgen/bgen_reader.py	`100.00% <100.00%> (ø)`
sgkit/io/plink/plink_reader.py	`100.00% <100.00%> (ø)`
sgkit/io/utils.py	`100.00% <100.00%> (ø)`
sgkit/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d604302...ed9c070. Read the comment docs.

ravwojdyla

🔥

Overarching request for future, it would likely make things easier to review to break down PRs into multiple steps, for example:

just move of BGEN repo over to sgkit main repo
separate PRs for improvements from the current BGEN repo

otherwise we are kind of losing that history in this squashed PR and there is more to review in a single code dump.

ravwojdyla · 2020-10-09T16:07:19Z

setup.cfg

@@ -57,6 +57,9 @@ vcf =
    cyvcf2
    fsspec
    yarl
+bgen =
+    rechunker


why is this different than the dev-requirements? it can't be git based?

I originally had it that way but it doesn't work when you pip install sgkit from github which then tries to pip install rechunker from github. I'm not sure why but I'm instead treating it as an "abstract" dependency for now (e.g. pyscaffold/pyscaffold#261 (comment)). https://github.com/pystatgen/sgkit/issues/321 for when the required changes are released.

ravwojdyla · 2020-10-09T16:08:13Z

sgkit/io/bgen/__init__.py

+try:
+    from .bgen_reader import bgen_to_zarr, read_bgen, rechunk_bgen  # noqa: F401
+
+    __all__ = ["read_bgen", "bgen_to_zarr", "rechunk_bgen"]


I don't know if we actually need those __all__ variables? in this case for example I think this doesn't change anything?

I just did what you did with plink and tom with vcf. I plan on waiting until https://github.com/pystatgen/sgkit/pull/294 is done before making a decision on that.

ravwojdyla · 2020-10-09T16:30:31Z

sgkit/io/bgen/bgen_reader.py

+    "call_dosage_mask",
+]
+
+METAFILE_FIELDS = [


This variable seems to be only used to create the dict in the next line, why not create that dict in place?

pystatgen/sgkit@8899b57#diff-845c788aeac5e31bc9b2635b6fd903f0R42

ravwojdyla · 2020-10-09T16:33:00Z

sgkit/io/bgen/bgen_reader.py

+    ) -> None:
+        self.path = Path(path)
+        self.metafile_path = (
+            Path(metafile_path) if metafile_path else path.with_suffix(".metafile")  # type: ignore[union-attr]


should it be else self.path.with_suffix(".metafile")? Otherwise you might call with_suffix on string afaiu.

Ah good call -- I should have listened to mypy: pystatgen/sgkit@8899b57#diff-845c788aeac5e31bc9b2635b6fd903f0R67

ravwojdyla · 2020-10-09T16:34:26Z

sgkit/io/bgen/bgen_reader.py

+                    f"Generating BGEN metafile for '{self.path}' (this may take a while)"
+                )
+                bgen.create_metafile(self.metafile_path, verbose=False)
+                logger.info("BGEN metafile generation complete")


Should we also add time elapsed here in the log msg?

Diff'ing the timestamps in the logs is what I've been doing rather than adding more code. Is there some precedent/good reason to start adding timers?

Yea, sometimes this becomes useful if you have tons of logs in production (or even tests) job coming from different places (say in debug mode), and you don't need to search for two corresponding log entries.

Hm well this code won't run on distributed workers (for the same bgen file) and I doubt that it ever will but 🤷‍♂️: pystatgen/sgkit@586aa9d#diff-845c788aeac5e31bc9b2635b6fd903f0R82

ravwojdyla · 2020-10-09T17:08:54Z

sgkit/io/bgen/bgen_reader.py

+    metafile_path
+        Path to companion index file used to determine bgen byte offsets.
+        Defaults to ``path`` + ".metafile" if not provided.
+        This file is necessary for reading bgen genotype probabilities and it will be


nit: this codebase is not very consistent with capitalisation of BGEN :) (there are other places in this codebase, but I'm just commenting here).

Makes sense, I capitalized it in the docs: pystatgen/sgkit@8899b57#diff-845c788aeac5e31bc9b2635b6fd903f0R219

ravwojdyla · 2020-10-09T17:18:49Z

sgkit/io/bgen/bgen_reader.py

+    df = read_metafile(bgen_reader.metafile_path)
+    if persist:
+        df = df.persist()
+    arrs = dataframe_to_dict(df, METAFILE_DTYPE)


I'm likely missing sth here, why not user xarray here instead of the bespoke dict of arrays?

I would probably use pydata/xarray#3929 if it existed (with default dimension names), but I'm not here to avoid repeating so much of what create_genotype_dosage_dataset does (i.e. the dimension labels don't help in this case). It's also a reasonably convenient place to put the special string conversion handling.

ravwojdyla · 2020-10-09T17:30:54Z

sgkit/io/bgen/bgen_reader.py

+        for var in encoding
+    }
+    with tempfile.TemporaryDirectory(
+        prefix="bgen_to_zarr_", suffix=".zarr", dir=tempdir


So the tempdir must be a local FS? does rechunker require that?

I added https://github.com/pystatgen/sgkit/issues/317 yesterday and thought about doing it but it's not really that helpful if the source data can't also be cloud-native. Shouldn't be that hard to add later though.

@eric-czech oh, I thought the input supports fsspec mapper, and assumed this reader supports all fsspec FSs, I guess that was a wrong assumption.

That would be great if it did but there's no way for cbgen / bed-reader to use those like htslib does unfortunately.

ravwojdyla · 2020-10-09T17:39:46Z

sgkit/io/bgen/bgen_reader.py

+    return ds
+
+
+def rechunk_bgen(


Is this really BGEN specific or could this function be used for generic rechunk on sgkit dataset (+/- extra handling of variables, or failing if we don't know how to efficiently encode variables present in the dataset)?

I can't say it does much of use over .rechunk other than the bgen-specific variable encoding. I'd expect anything else like https://github.com/pystatgen/sgkit/issues/309 to probably call rechunker directly. Can you see a need for something in between?

Yea, I guess pointing at #309, I wonder if there is some common function that could be reused, but we can work on that later, maybe potentially even creating a separate extra for sgkit[rechunker] to bring in some useful methods.

ravwojdyla · 2020-10-09T18:00:39Z

sgkit/utils.py

+    ndim = a.ndim
+
+    def fn(x: np.ndarray) -> np.ndarray:
+        max_len = np.asarray(np.frompyfunc(len, 1, 1)(x)).max()
+        return np.expand_dims(max_len, list(range(ndim)))
+
+    return da.map_blocks(fn, a, chunks=(1,) * ndim, dtype=int).max()


is this significantly faster than:

return da.frompyfunc(len, 1, 1)(a).max()

Doesn't look like it could be based on https://github.com/dask/dask/blob/7a46e7b4a436f5152872e6d7fa4f5291342bdd2f/dask/array/ufunc.py#L47 since it's wrapping np.frompyfunc. It's about the same, maybe slightly slower in this test:

def max_str_len1(a): ndim = a.ndim def fn(x: np.ndarray) -> np.ndarray: max_len = np.asarray(np.frompyfunc(len, 1, 1)(x)).max() return np.expand_dims(max_len, list(range(ndim))) return da.map_blocks(fn, a, chunks=(1,) * ndim, dtype=int).max() def max_str_len2(a): ndim = a.ndim str_len = da.frompyfunc(len, 1, 1) def fn(x: np.ndarray) -> np.ndarray: max_len = np.asarray(str_len(x)).max() return np.expand_dims(max_len, list(range(ndim))) return da.map_blocks(fn, a, chunks=(1,) * ndim, dtype=int).max() x = da.asarray(['x']*100000, chunks=(100)) %timeit max_str_len1(x).compute() 154 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit max_str_len2(x).compute() 155 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Apparently not defining it per-block doesn't matter either. I started a list in https://github.com/pystatgen/sgkit/issues/68 though with this function on it as a fast one that could be easily tracked.

No, no, I mean the whole implementation would be:

def max_str_len(a: ArrayLike) -> ArrayLike: if a.size == 0: raise ValueError("Max string length cannot be calculated for empty array") if a.dtype.kind == "O": a = a.astype(str) if a.dtype.kind not in {"U", "S"}: raise ValueError(f"Array must have string dtype (got dtype {a.dtype})") lens = np.frompyfunc(len, 1, 1)(a) return lens if isinstance(lens, int) else lens.max()

Notice that:

we return ArrayLike instead of forcing Dask Array -> we use np dispatching

we need to handle numpy scalars (e.g. np.asarray("foo"))

for the implementation above you need to adjust tests to call compute() only on Dask Arrays, or call Dask aware equality check, for example:
actual = d if isinstance(d := max_str_len(x), int) else d.compute()

Ahh I see what you mean -- yea dask interprets them as more or less the same graph and there's essentially no difference in times between them that I can find. I don't like the numpy semantics since dask, xarray and cupy preserve the array backend for ufunc_reduce functions on scalars, so I made that a special case:
pystatgen/sgkit@ed9c070#diff-ee100e898f8291cbcc9fcecc41cef338R219

This means the function preserves the array type and can be forced to evaluate (for dask or xarray + dask) with just int(max_str_len(x)) rather than needing to switch on anything in the calling code.

eric-czech · 2020-10-09T19:23:16Z

Ok, can you take another look @ravwojdyla?

tomwhite

+1

tomwhite · 2020-10-12T09:03:02Z

sgkit/io/plink/plink_reader.py

+from ... import create_genotype_call_dataset
+from ...model import DIM_SAMPLE
+from ...utils import encode_array
+from ..utils import dataframe_to_dict


Nit: Any reason to change these to relative imports? I think the absolute ones are more readable.

pystatgen/sgkit@7796448

eric-czech mentioned this pull request Oct 8, 2020

Determine if caching/lock is necessary in bgen reader #315

Closed

1 task

eric-czech force-pushed the bgen_merger branch 2 times, most recently from d9a173e to ba0cea3 Compare October 8, 2020 20:45

eric-czech force-pushed the bgen_merger branch 3 times, most recently from 78fc8de to 01db67a Compare October 8, 2020 22:19

Move sgkit-bgen into main repo sgkit-dev#256

53ca999

eric-czech force-pushed the bgen_merger branch from 01db67a to 53ca999 Compare October 8, 2020 22:28

eric-czech marked this pull request as ready for review October 8, 2020 22:28

eric-czech force-pushed the bgen_merger branch from 582caf0 to 0dc57af Compare October 9, 2020 10:08

eric-czech requested review from ravwojdyla and tomwhite October 9, 2020 10:10

Improve docs+dtype handling and remove cachetools

ccb2fb7

eric-czech force-pushed the bgen_merger branch from 0dc57af to ccb2fb7 Compare October 9, 2020 11:19

Switch to pypi rechunker dependency

14210b2

ravwojdyla requested changes Oct 9, 2020

View reviewed changes

This was referenced Oct 9, 2020

Change rechunker dependency once a release is out #321

Closed

Add benchmark suite #68

Open

eric-czech added 2 commits October 9, 2020 15:02

Suggested changes

8899b57

More docstring bgen capitalization

06c6b99

ravwojdyla mentioned this pull request Oct 10, 2020

Add assert_array_shape calls to method code #267

Open

Suggested changes

ed9c070

eric-czech force-pushed the bgen_merger branch from 586aa9d to ed9c070 Compare October 10, 2020 13:42

tomwhite approved these changes Oct 12, 2020

View reviewed changes

Update relative imports

7796448

eric-czech added the auto-merge Auto merge label for mergify test flight label Oct 13, 2020

Merge branch 'master' into bgen_merger

f694b34

eric-czech mentioned this pull request Oct 13, 2020

Simplify bgen-reader code #323

Closed

mergify bot merged commit 045acad into sgkit-dev:master Oct 13, 2020

eric-czech deleted the bgen_merger branch October 14, 2020 13:43

This was referenced Oct 14, 2020

Use cbgen instead of bgen_reader sgkit-dev/sgkit-bgen#20

Closed

Create utils function to find max length of strings in arrays #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgkit-bgen merger #314

sgkit-bgen merger #314

eric-czech commented Oct 8, 2020 •

edited

Loading

codecov-io commented Oct 8, 2020 •

edited

Loading

ravwojdyla left a comment

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 10, 2020

eric-czech Oct 10, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 10, 2020

eric-czech Oct 10, 2020 •

edited

Loading

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 10, 2020

ravwojdyla Oct 9, 2020

eric-czech Oct 9, 2020

ravwojdyla Oct 10, 2020 •

edited

Loading

eric-czech Oct 10, 2020

eric-czech commented Oct 9, 2020

tomwhite left a comment

tomwhite Oct 12, 2020

eric-czech Oct 13, 2020

sgkit-bgen merger #314

sgkit-bgen merger #314

Conversation

eric-czech commented Oct 8, 2020 • edited Loading

codecov-io commented Oct 8, 2020 • edited Loading

Codecov Report

ravwojdyla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech Oct 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravwojdyla Oct 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech commented Oct 9, 2020

tomwhite left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-czech commented Oct 8, 2020 •

edited

Loading

codecov-io commented Oct 8, 2020 •

edited

Loading

eric-czech Oct 10, 2020 •

edited

Loading

ravwojdyla Oct 10, 2020 •

edited

Loading