Speedup sc.get.obs_df #1499

fidelram · 2020-11-18T15:12:40Z

By using array slicing, this codes improves ~10 fold the speed of sc.get.obs_df().

%timeit sc.get.obs_df(adata, list(adata.var_names[:100]) + ['louvain'])

before:
40.6 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

after:
4.45 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

ivirshup

Good idea! Could you show some of the benchmarks you got those time improvements from?

Could you do the same for var_df?

Also, I think might need a few additional tests. I initially used the .{dim}_vector functions since I know they return numpy vectors. Can this be tested against AnnData objects with sparse X, as well as backed objects?

scanpy/tests/test_get.py

scanpy/get.py

ivirshup · 2020-11-19T12:39:39Z

About use_raw with sc.get.var_df: I didn't include this because the semantics differ significantly from sc.get.obs_df, as raw can have a different number of variables. I think it makes more sense for a user to call sc.get.var_df(adata.raw, ...), since it's much more explicit that adata.raw.var and adata.raw.varm will be used.

fidelram · 2020-11-20T07:53:53Z

I like the ideas of using adata.raw when raw data wants to be used in general. This is an elegant solution to a source of endless confusion.

Regarding var_df I can remove the use_raw. However, I consider that since this option is everywhere it should be here as well. I find odd that is not present. I don't see a problem that the size of the dataframe will be different that would be expected.

ivirshup · 2020-11-20T08:45:11Z

It's not just that the length is different, is that sc.get.obs_df(adata, ["col"], use_raw=x)["col"] is the same regardless of the value of x, but it's different for var_df. I think it's easier to build code around functions with more orthogonal arguments.

However, I consider that since this option is everywhere it should be here as well.

Could we add an example of sc.get.var_df(adata.raw, ...), leave out use_raw for now, and see if anyone complains?

I've been trying to leave out use_raw on functions where variable length matters anyways. For example: adata.var_vector.

fidelram · 2020-11-23T06:59:34Z

Ok, will remove that as soon as I can

ivirshup · 2020-11-23T09:59:00Z

scanpy/get.py

+        X = _get_obs_rep(adata, layer=layer)
+        matrix = X[adata.obs_names.get_indexer(obs_names), :]


Just had a thought, I don't think this will work if X is a backed dense array and obs_names isn't sorted. h5py.Datasets require that the indices be in order. This should probably get a test case.

An alternative is _get_obs_rep(adata[obs_names], ...).copy(), but this will have performance issues with raw.

Internally in anndata, I index in-order then reorder the array for h5py.Dataset objects.

How can we test for this? I added a test to read a backed dataset from the included datasets. Why this works? or by chance are the indices ordered?

The indices look ordered, since you're getting them like this: list(adata.var_names[:10]).

I think list(adata.var_names[10::-1]) would be enough to make it fail. In AnnDatavalid indices in random order are generated with functions inanndata/tests/helpers.py`.

ivirshup · 2020-11-25T12:02:57Z

scanpy/get.py

@@ -253,7 +253,7 @@ def var_df(
    # add obs values
    if len(obs_names) > 0:
        X = _get_obs_rep(adata, layer=layer)
-        matrix = X[adata.obs_names.get_indexer(obs_names), :]
+        matrix = X[adata.obs.index.isin(obs_names), :]


Will these be in the right order?

Won't the value from X be in whatever order the appear, while the values in obs_names are in the order they were passed?

This has made me realize some stuff needs to get fixed in anndata, mostly around raw. In future, I think this should look like adata[obs_names].to_df(layer=layer), but we're not quite there yet.

Here's what you can do to make backed mode work for now:

idxs = adata.obs_names.get_indexer(obs_names) idxs_order = np.argsort(idxs_order) matrix = X[idxs[idxs_order], :][np.argsort(idxs_order)]

In the next major release of anndata I'm thinking we should export some of the utilities that are used for giving all these indexing operations a consistent interface.

Yes. That is why this should be safe for backed mode. Later in the code, before the result is returned the columns are reorder to match the keys order.

fidelram · 2020-11-25T14:36:58Z

The previous test failed but is not clear to me why, as it passes the local tests (anndata 0.7.5). It seems that on travis server, backed slicing requires integer indices and will not work with a boolean vector. I changed to sorted integers hoping that this will solve the issue.

scanpy/tests/test_get.py

ivirshup · 2020-12-03T04:51:42Z

My thinking on my change is that I would like all the code that handles backed mode to be cleanly separated. I think this should be handled more cleanly on the anndata side, and once that's been done it's easier to replace the backed mode specific code if it's all together.

… couple of new tests.

…sts.

This moves any backed specific logic to only apply to backed.

ivirshup requested changes Nov 19, 2020

View reviewed changes

scanpy/tests/test_get.py Outdated Show resolved Hide resolved

scanpy/get.py Outdated Show resolved Hide resolved

scanpy/get.py Outdated Show resolved Hide resolved

ivirshup reviewed Nov 19, 2020

View reviewed changes

scanpy/get.py Outdated Show resolved Hide resolved

ivirshup reviewed Nov 23, 2020

View reviewed changes

ivirshup reviewed Nov 25, 2020

View reviewed changes

ivirshup reviewed Dec 2, 2020

View reviewed changes

scanpy/tests/test_get.py Show resolved Hide resolved

fidelram and others added 18 commits December 3, 2020 16:46

Speed up 10 fold sc.get.obs_df when querying var names.

bf7c103

fix case when no var_names or no obs_names are given.

99d59d3

add new cases to test and 'black' test_get.py

72e29c6

use _get_obs_rep method.

35c137e

use pandas assert to compare data frames. Fix issue with dtypes.

b9fe3b2

use pandas assert to compare data frames for var_df tests and add a…

7d486e6

… couple of new tests.

use df.join instead of pd.concat

fe613ee

add optimization to sc.get.var_df and add 'use_raw' to parameters.

e73c056

add test for backed mode vs memory.

ad5ec99

fix issue with sparse matrices. Add sparse matrix to tests

04562ad

fix indices for raw data. Update tests

e8ee47f

remove parameter use_raw from sc.get.var_df

8b8f05a

use ordered indices to allow slicing in backed mode. Update backed te…

d8b58bf

…sts.

fix col names. Add test to check col names order and content

a1fccc0

instead of a boolean for slicing use indices

6c5ce4a

Keep backed specific logic to backed

7b626b5

This moves any backed specific logic to only apply to backed.

small test change.

827f449

Add release note

bd666ea

ivirshup force-pushed the speedup_get_obs_df branch from 8480761 to bd666ea Compare December 3, 2020 05:52

ivirshup self-requested a review December 3, 2020 05:53

ivirshup approved these changes Dec 3, 2020

View reviewed changes

ivirshup added the Maint – Backport needed Needs back porting for bugfix release label Dec 3, 2020

ivirshup merged commit 35519eb into master Dec 4, 2020

ivirshup removed the Maint – Backport needed Needs back porting for bugfix release label Jan 24, 2021

ivirshup mentioned this pull request Jan 30, 2021

Allow plots to use adata.obs index as groupby #1583

Merged

flying-sheep deleted the speedup_get_obs_df branch October 30, 2023 13:24

		X = _get_obs_rep(adata, layer=layer)
		matrix = X[adata.obs_names.get_indexer(obs_names), :]

Speedup sc.get.obs_df #1499

Speedup sc.get.obs_df #1499

Uh oh!

Conversation

fidelram commented Nov 18, 2020

Uh oh!

ivirshup left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivirshup commented Nov 19, 2020

Uh oh!

fidelram commented Nov 20, 2020

Uh oh!

ivirshup commented Nov 20, 2020

Uh oh!

fidelram commented Nov 23, 2020

Uh oh!

ivirshup Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivirshup Nov 23, 2020

Choose a reason for hiding this comment

Uh oh!

fidelram Nov 23, 2020

Choose a reason for hiding this comment

Uh oh!

ivirshup Nov 24, 2020

Choose a reason for hiding this comment

Uh oh!

ivirshup Nov 25, 2020

Choose a reason for hiding this comment

Uh oh!

ivirshup Nov 25, 2020

Choose a reason for hiding this comment

Uh oh!

fidelram Nov 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fidelram commented Nov 25, 2020

Uh oh!

Uh oh!

ivirshup commented Dec 3, 2020

Uh oh!

Uh oh!

ivirshup Nov 23, 2020 •

edited

Loading

fidelram Nov 25, 2020 •

edited

Loading