Skip to content

✨Add Schema based CELLxGENE Curator & ✨ enable curation of Feature typed indices #2878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Zethson
Copy link
Member

@Zethson Zethson commented Jul 2, 2025

Partially fixes #2585

  • Adds CELLxGENE Schema 5.3.0
  • Adds support for curating specific indices of DataFrames
  • Adds a Schema based CELLxGENE curator
  • Fixes a very nasty bug where Schemas that were already saved earlier were returned with changed _aux which led to import field such as the index (via _index_feature_uid) got removed

Copy link

codecov bot commented Jul 2, 2025

Codecov Report

Attention: Patch coverage is 56.16438% with 32 lines in your changes missing coverage. Please review.

Project coverage is 91.51%. Comparing base (8265549) to head (18bf0ae).
Report is 56 commits behind head on main.

Files with missing lines Patch % Lines
lamindb/curators/core.py 28.57% 15 Missing ⚠️
lamindb/curators/_cellxgene_schemas/__init__.py 52.94% 8 Missing ⚠️
lamindb/models/feature.py 68.75% 5 Missing ⚠️
lamindb/models/sqlrecord.py 73.33% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2878      +/-   ##
==========================================
- Coverage   91.84%   91.51%   -0.34%     
==========================================
  Files          69       71       +2     
  Lines       10807    11140     +333     
==========================================
+ Hits         9926    10195     +269     
- Misses        881      945      +64     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

github-actions bot commented Jul 2, 2025

Deployment URL: https://76a3a8d2.lamindb.pages.dev

Zethson added 5 commits July 2, 2025 11:07
Zethson added 2 commits July 2, 2025 16:36
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
@Zethson Zethson changed the title ✨Add Schema based CELLxGENE Curator ✨Add Schema based CELLxGENE Curator & ✨ enable curation of Feature typed indices Jul 3, 2025
@@ -8,6 +8,7 @@
MuDataCurator
SpatialDataCurator
TiledbsomaExperimentCurator
CxGCurator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a Curator anymore, unless I'm wrong. The idea would be to just provide a schema, no? There wouldn't be any specific curator code for CxG, would it?

Copy link
Member Author

@Zethson Zethson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Curator needs custom code for missing ontology values (such as "normal"), missing defaults, handling names vs ontology_id, ...

I currently don't think that we can hack all of that into a Schema (some things we can).

I'll first add tests for the low level functionality that I changed/fixed and then I'll open a Slack thread to discuss this topic. Yesterday, I actually had moved a CxG generator function to lamindb.examples.anndata just to move it back. I understand what you're looking for.

Copy link
Member Author

@Zethson Zethson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function that gets the Schema can probably do some of that. I'll revisit this ASAP but first tests!

Copy link
Member

@sunnyosun sunnyosun Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry that comment was from me)
It should still not have its Curator code, because the code is just to prepare the instance. We need a different strategy to add control terms; or we can just guide users to prepare the instance with a proper notebook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have a few schemas available, like cxg, perturbation. Maybe even under lamindb.schemas. or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is legitimate to have some logic in code and others in a schema. Code and schema are quite different. Evidently the former can't be easily queried for because it's not structured and the later can. Not everything has to fit into a schema. Some could -- as Sunny says -- be part of a "curation notebook" which is likely better a curation script which is likely better some kind of configurable thing in lamindb (a "curator", just that this doesn't have so much to do with what we currently brand as curator which is just a thing that validates the schema).

@sunnyosun
Copy link
Member

Is it still possible to move this fix to a new PR?

"Fixes a very nasty bug where Schemas that were already saved earlier were returned with changed _aux which led to import field such as the index (via _index_feature_uid) got removed"

I think it will make both the fix discussion and the cxg schema discussion more organized in their own PRs.

You can re-point this PR to the fix PR.

@@ -41,3 +41,14 @@ schema_version,entity,organism,source,version
5.2.0,Tissue,all,uberon,2024-08-07
5.2.0,Gene,human,ensembl,release-110
5.2.0,Gene,mouse,ensembl,release-110
5.3.0,CellType,all,cl,2025-02-13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the ontology versions information should be parsed from https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.3.0/schema.md directly. Otherwise, we are duplicating the information and also increasing the maintenance of this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do that reliably... This is a rather free style doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please look into it? I mean single-cell-curation must be using this information for their validator somehow. I think this session is also rather a structured table:
Screenshot 2025-07-03 at 17 28 52

But single-cell-curation should have relevant code already.

Copy link
Member Author

@Zethson Zethson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found https://github.com/chanzuckerberg/cellxgene-ontology-guide/blob/main/ontology-assets/ontology_info.json but it's only for 5.3.0+. It also lacks Genes just like the table in what you've shown. They have a more comprehensive set up to get gene versions.

I asked whether they'd be willing to add gene versions.

A disadvantage of allowing any version to pulled also causes issues because we have to ensure that all required versions are available in Bionty. Yes, people can technically pull any, but not for all ontologies.

Zethson added 3 commits July 3, 2025 17:09
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Signed-off-by: Lukas Heumos <[email protected]>
Zethson added 2 commits July 4, 2025 14:54
Signed-off-by: Lukas Heumos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

✨ Rewrite CellxgeneCatManager and PertCatManager using schemas
3 participants