Embeddings search experimental API #1164

mlin · 2024-05-27T23:37:20Z

Adds two new functions to cellxgene_census.experimental:

find_nearest_obs uses TileDB-Vector-Search indexes of Census embeddings to find nearest neighbors of given embedding vectors (in an AnnData obsm layer). Census cell similarity search: experimental Python API for searching given AnnData #1114
predict_obs_metadata uses the nearest neighbors to predict metadata attributes like cell_type and tissue_general for the query cells. Naive initial implementation is just a starting point to start experimenting with. Census cell similarity search: experimental Python API for metadata prediction #1115

The TileDB-Vector-Search query speed seems to be very S3-latency-sensitive, even moreso than typical Census queries. It's many times faster to run from within AWS us-west-2 than externally.

codecov · 2024-06-09T03:49:00Z

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.41%. Comparing base (eb8f449) to head (5c0668e).
Report is 2 commits behind head on main.

Files	Patch %	Lines
...cellxgene_census/experimental/_embedding_search.py	96.72%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1164      +/-   ##
==========================================
+ Coverage   91.26%   91.41%   +0.15%     
==========================================
  Files          80       82       +2     
  Lines        6329     6463     +134     
==========================================
+ Hits         5776     5908     +132     
- Misses        553      555       +2

Flag	Coverage Δ
unittests	`91.41% <96.82%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mlin · 2024-06-10T07:34:14Z

@ebezzi Putting this up for initial review since it's working well, but we still need to plan action on #1181 -- this still copies the approach of hard-coding the base S3 URI.

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

ivirshup · 2024-06-10T20:42:27Z

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

+    """
+
+
+def find_nearest_obs(


On the API side, it would be nice if this could produce output that can be directly with sklearn style classes. For example, if this returned a KNNTransformer subclass, that could be used directly with the KNeighborsClassifier and KNeighborsRegressor classes.

@ivirshup I like this idea very much, but I'm not quite sure it's workable (albeit I'm not as familiar with those APIs)...

Those scikit-learn classes seem oriented around the scenario where you're providing either all the points (in the "universe") or the complete distance matrix for them. Here we're working with a more limited view of the query points and their neighbor distances; we don't have or want the complete distance matrix, and actually we don't even have the coordinates of the neighbors immediately handy.

Do you think the shoe fits? I see there's some stuff about the "K neighbors graph" that might be relevant, but I'm not personally familiar enough to use them in an unconventional/advanced way like this.

ivirshup · 2024-06-10T20:48:30Z

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

+
+
+def find_nearest_obs(
+    embedding_metadata: Dict[str, Any],


Why does this use a different way to specify an embedding than get_embedding does?

get_embedding is a little low level in that it wants the full URI to the embeddings TileDB array, which isn't actually needed to find the index. embedding_metadata is the information returned from get_embedding_metadata() which seems like the appropriate level (especially in view of #1181 wherein we will actually put the relative URIs to the index arrays in there), although of course it'd be nice if it were more typesafe.

@mlin can we change it to?

embedding_name: str, organism: str, census_version: str

I think that provides an easier entry point to users and it aligns to get_embedding_metadata_by_name

This doesn't uniquely refer to an embedding, right? It refers to the newest embedding with this name?

I think we need to be able to non-ambiguously specify an embedding here since it's really important that the embedding being queried matches what we projected our data into.

At the moment the combination of the three pieces does match to a unique embedding matrix, unless there is an edge-case I'm not catching.

The documentation for get_embedding_metadata_by_name says that it can be ambiguous and returns the latest one.

I think the unique value for embeddings is in the "id" key of the embedding metadata, while "embedding_name" just says what method was used.

cellxgene-census/api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py

Lines 171 to 172 in 7612fb3

"""Return metadata for a specific embedding. If more embeddings match the query parameters,

the most recent one will be returned.

Right "id" is unique, but the IDs are "cxg-czi-[\d+]" or "cxg-contributed-[\d+]" -- so no human-friendly.

The reason for If more embeddings match the query parameters, the most recent one will be returned. is that we ended up having two versions of scVI for the 2023-12-15 Census version, the second effectively being a replacement of the first one (the Yosef lab recommended to that first needed some fixing).

So now refreshing my memory that one is actually the only edge-case we have, but it is intended for our users to use the latest version. I don't anticipate having to have multiple versions of the same embedding with the same name, unless one is the replacement of the other.

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

…h-api

pablo-gar

Looks good to me except one comment in API signature

pablo-gar · 2024-06-24T17:38:17Z

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

+
+
+def find_nearest_obs(
+    embedding_metadata: Dict[str, Any],


@mlin can we change it to?

embedding_name: str, organism: str, census_version: str

I think that provides an easier entry point to users and it aligns to get_embedding_metadata_by_name

mlin · 2024-06-27T17:04:20Z

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

+def _resolve_embedding_index(
+    embedding_metadata: Dict[str, Any],
+    mirror: Optional[str] = None,
+) -> Optional[Tuple[str, str]]:


@ebezzi new index resolution method here

mlin · 2024-06-27T17:06:48Z

@ebezzi @pablo-gar @ivirshup Updated this to resolve indexes through mirrors/contributions json and remove the need for caller to use get_embedding_metadata_by_name() on their own. Please take another pass including the prior discussion. Unfortunately we have known CI issues currently but I've run the new test cases locally. 🙏

…h-api

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

…_embedding_search.py Co-authored-by: Isaac Virshup <[email protected]>

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py

…_embedding_search.py Co-authored-by: Isaac Virshup <[email protected]>

…h-api

…imental/_embedding_search.py" This reverts commit b7179b8.

This reverts commit eaa8868.

…imental/_embedding_search.py" This reverts commit b35048c.

mlin · 2024-08-08T21:49:58Z

@ivirshup I split out the perf optimization to #1257 since I was still getting an error, will write more there -- hope you don't mind, it's only because I need to triage desperately right now!

mlin mentioned this pull request May 30, 2024

Census cell similarity search: experimental Python API for searching given AnnData #1114

Closed

squash for PR

639e64c

mlin force-pushed the mlin/similarity-search-api branch from 6bf1181 to 639e64c Compare June 10, 2024 01:18

mlin marked this pull request as ready for review June 10, 2024 07:32

mlin requested review from ebezzi, ivirshup and pablo-gar June 10, 2024 07:33

mlin requested a review from prathapsridharan June 10, 2024 17:43

ivirshup reviewed Jun 10, 2024

View reviewed changes

mlin added 4 commits June 16, 2024 22:27

use DEFAULT_TILEDB_CONFIGURATION

8088f9e

workaround

fc91d2d

workaround

c8bb01a

fix

e73102b

mlin mentioned this pull request Jun 18, 2024

[python] fix mypy complaint in experimental/_embedding.py #1197

Merged

Merge remote-tracking branch 'origin/main' into mlin/similarity-searc…

fcc05b4

…h-api

pablo-gar requested changes Jun 24, 2024

View reviewed changes

mlin added 3 commits June 26, 2024 13:07

resolve indexes through JSONs

ca8d44e

lint

ee6c184

API refactoring

a1e4daa

mlin commented Jun 27, 2024

View reviewed changes

pablo-gar approved these changes Jun 27, 2024

View reviewed changes

mlin added 3 commits June 28, 2024 10:22

Merge remote-tracking branch 'origin/main' into mlin/similarity-searc…

874b2eb

…h-api

Merge remote-tracking branch 'origin/main' into mlin/similarity-searc…

c33df3f

…h-api

Merge remote-tracking branch 'origin/main' into mlin/similarity-searc…

b7d79f5

…h-api

ivirshup reviewed Jul 15, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py Show resolved Hide resolved

mlin and others added 2 commits July 15, 2024 16:33

Update api/python/cellxgene_census/src/cellxgene_census/experimental/…

b35048c

…_embedding_search.py Co-authored-by: Isaac Virshup <[email protected]>

fixups

eaa8868

mlin commented Jul 16, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding_search.py Outdated Show resolved Hide resolved

mlin and others added 5 commits August 5, 2024 18:01

Update api/python/cellxgene_census/src/cellxgene_census/experimental/…

b7179b8

…_embedding_search.py Co-authored-by: Isaac Virshup <[email protected]>

Merge remote-tracking branch 'origin/main' into mlin/similarity-searc…

aad1f0d

…h-api

Revert "Update api/python/cellxgene_census/src/cellxgene_census/exper…

ee2c476

…imental/_embedding_search.py" This reverts commit b7179b8.

Revert "fixups"

a66d366

This reverts commit eaa8868.

Revert "Update api/python/cellxgene_census/src/cellxgene_census/exper…

5c0668e

…imental/_embedding_search.py" This reverts commit b35048c.

mlin mentioned this pull request Aug 8, 2024

[python] similarity search API: optimize predict_obs_metadata #1257

Merged

mlin merged commit 6cf1f8a into main Aug 8, 2024

mlin deleted the mlin/similarity-search-api branch August 8, 2024 21:55

	"""Return metadata for a specific embedding. If more embeddings match the query parameters,
	the most recent one will be returned.

Embeddings search experimental API #1164

Embeddings search experimental API #1164

Uh oh!

Conversation

mlin commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mlin commented Jun 10, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pablo-gar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mlin commented Jun 27, 2024

Uh oh!

Uh oh!

Uh oh!

mlin commented Aug 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mlin commented May 27, 2024 •

edited

Loading

codecov bot commented Jun 9, 2024 •

edited

Loading