-
Notifications
You must be signed in to change notification settings - Fork 638
Speedup sc.get.obs_df #1499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup sc.get.obs_df #1499
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Could you show some of the benchmarks you got those time improvements from?
Could you do the same for var_df
?
Also, I think might need a few additional tests. I initially used the .{dim}_vector
functions since I know they return numpy vectors. Can this be tested against AnnData
objects with sparse X
, as well as backed objects?
About |
I like the ideas of using Regarding |
It's not just that the length is different, is that
Could we add an example of I've been trying to leave out |
Ok, will remove that as soon as I can |
scanpy/get.py
Outdated
X = _get_obs_rep(adata, layer=layer) | ||
matrix = X[adata.obs_names.get_indexer(obs_names), :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a thought, I don't think this will work if X is a backed dense array and obs_names
isn't sorted. h5py.Dataset
s require that the indices be in order. This should probably get a test case.
An alternative is _get_obs_rep(adata[obs_names], ...).copy()
, but this will have performance issues with raw
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally in anndata
, I index in-order then reorder the array for h5py.Dataset
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we test for this? I added a test to read a backed dataset from the included datasets. Why this works? or by chance are the indices ordered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indices look ordered, since you're getting them like this: list(adata.var_names[:10])
.
I think list(adata.var_names[10::-1]) would be enough to make it fail. In
AnnDatavalid indices in random order are generated with functions in
anndata/tests/helpers.py`.
scanpy/get.py
Outdated
@@ -253,7 +253,7 @@ def var_df( | |||
# add obs values | |||
if len(obs_names) > 0: | |||
X = _get_obs_rep(adata, layer=layer) | |||
matrix = X[adata.obs_names.get_indexer(obs_names), :] | |||
matrix = X[adata.obs.index.isin(obs_names), :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these be in the right order?
Won't the value from X
be in whatever order the appear, while the values in obs_names
are in the order they were passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has made me realize some stuff needs to get fixed in anndata
, mostly around raw
. In future, I think this should look like adata[obs_names].to_df(layer=layer)
, but we're not quite there yet.
Here's what you can do to make backed mode work for now:
idxs = adata.obs_names.get_indexer(obs_names)
idxs_order = np.argsort(idxs_order)
matrix = X[idxs[idxs_order], :][np.argsort(idxs_order)]
In the next major release of anndata I'm thinking we should export some of the utilities that are used for giving all these indexing operations a consistent interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That is why this should be safe for backed mode. Later in the code, before the result is returned the columns are reorder to match the keys
order.
The previous test failed but is not clear to me why, as it passes the local tests (anndata 0.7.5). It seems that on travis server, backed slicing requires integer indices and will not work with a boolean vector. I changed to sorted integers hoping that this will solve the issue. |
My thinking on my change is that I would like all the code that handles backed mode to be cleanly separated. I think this should be handled more cleanly on the |
… couple of new tests.
This moves any backed specific logic to only apply to backed.
8480761
to
bd666ea
Compare
By using array slicing, this codes improves ~10 fold the speed of
sc.get.obs_df()
.before:
40.6 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
after:
4.45 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)