Skip to content

Speedup sc.get.obs_df #1499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Dec 4, 2020
Merged

Speedup sc.get.obs_df #1499

merged 18 commits into from
Dec 4, 2020

Conversation

fidelram
Copy link
Collaborator

By using array slicing, this codes improves ~10 fold the speed of sc.get.obs_df().

%timeit sc.get.obs_df(adata, list(adata.var_names[:100]) + ['louvain'])

before:
40.6 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

after:
4.45 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Could you show some of the benchmarks you got those time improvements from?

Could you do the same for var_df?

Also, I think might need a few additional tests. I initially used the .{dim}_vector functions since I know they return numpy vectors. Can this be tested against AnnData objects with sparse X, as well as backed objects?

@ivirshup
Copy link
Member

About use_raw with sc.get.var_df: I didn't include this because the semantics differ significantly from sc.get.obs_df, as raw can have a different number of variables. I think it makes more sense for a user to call sc.get.var_df(adata.raw, ...), since it's much more explicit that adata.raw.var and adata.raw.varm will be used.

@fidelram
Copy link
Collaborator Author

I like the ideas of using adata.raw when raw data wants to be used in general. This is an elegant solution to a source of endless confusion.

Regarding var_df I can remove the use_raw. However, I consider that since this option is everywhere it should be here as well. I find odd that is not present. I don't see a problem that the size of the dataframe will be different that would be expected.

@ivirshup
Copy link
Member

It's not just that the length is different, is that sc.get.obs_df(adata, ["col"], use_raw=x)["col"] is the same regardless of the value of x, but it's different for var_df. I think it's easier to build code around functions with more orthogonal arguments.

However, I consider that since this option is everywhere it should be here as well.

Could we add an example of sc.get.var_df(adata.raw, ...), leave out use_raw for now, and see if anyone complains?

I've been trying to leave out use_raw on functions where variable length matters anyways. For example: adata.var_vector.

@fidelram
Copy link
Collaborator Author

Ok, will remove that as soon as I can

scanpy/get.py Outdated
Comment on lines 255 to 256
X = _get_obs_rep(adata, layer=layer)
matrix = X[adata.obs_names.get_indexer(obs_names), :]
Copy link
Member

@ivirshup ivirshup Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a thought, I don't think this will work if X is a backed dense array and obs_names isn't sorted. h5py.Datasets require that the indices be in order. This should probably get a test case.

An alternative is _get_obs_rep(adata[obs_names], ...).copy(), but this will have performance issues with raw.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally in anndata, I index in-order then reorder the array for h5py.Dataset objects.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we test for this? I added a test to read a backed dataset from the included datasets. Why this works? or by chance are the indices ordered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indices look ordered, since you're getting them like this: list(adata.var_names[:10]).

I think list(adata.var_names[10::-1]) would be enough to make it fail. In AnnDatavalid indices in random order are generated with functions inanndata/tests/helpers.py`.

scanpy/get.py Outdated
@@ -253,7 +253,7 @@ def var_df(
# add obs values
if len(obs_names) > 0:
X = _get_obs_rep(adata, layer=layer)
matrix = X[adata.obs_names.get_indexer(obs_names), :]
matrix = X[adata.obs.index.isin(obs_names), :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these be in the right order?

Won't the value from X be in whatever order the appear, while the values in obs_names are in the order they were passed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has made me realize some stuff needs to get fixed in anndata, mostly around raw. In future, I think this should look like adata[obs_names].to_df(layer=layer), but we're not quite there yet.

Here's what you can do to make backed mode work for now:

idxs = adata.obs_names.get_indexer(obs_names)
idxs_order = np.argsort(idxs_order)
matrix = X[idxs[idxs_order], :][np.argsort(idxs_order)]

In the next major release of anndata I'm thinking we should export some of the utilities that are used for giving all these indexing operations a consistent interface.

Copy link
Collaborator Author

@fidelram fidelram Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is why this should be safe for backed mode. Later in the code, before the result is returned the columns are reorder to match the keys order.

@fidelram
Copy link
Collaborator Author

The previous test failed but is not clear to me why, as it passes the local tests (anndata 0.7.5). It seems that on travis server, backed slicing requires integer indices and will not work with a boolean vector. I changed to sorted integers hoping that this will solve the issue.

@ivirshup
Copy link
Member

ivirshup commented Dec 3, 2020

My thinking on my change is that I would like all the code that handles backed mode to be cleanly separated. I think this should be handled more cleanly on the anndata side, and once that's been done it's easier to replace the backed mode specific code if it's all together.

@ivirshup ivirshup self-requested a review December 3, 2020 05:53
@ivirshup ivirshup added the Maint – Backport needed Needs back porting for bugfix release label Dec 3, 2020
@ivirshup ivirshup merged commit 35519eb into master Dec 4, 2020
@ivirshup ivirshup removed the Maint – Backport needed Needs back porting for bugfix release label Jan 24, 2021
@flying-sheep flying-sheep deleted the speedup_get_obs_df branch October 30, 2023 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants