Skip to content

✨Add Schema based CELLxGENE Curator & ✨ enable curation of Feature typed indices #2878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
42e7703
✨ Current status
Zethson Jun 25, 2025
24d0cfd
🎨 Current status
Zethson Jun 25, 2025
0082beb
Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…
Zethson Jul 1, 2025
b648c82
🎨 WIP
Zethson Jul 1, 2025
6686ee2
Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…
Zethson Jul 2, 2025
0c58eee
🎨 Iterate Sources
Zethson Jul 2, 2025
8531645
🎨 Simplify Schema generation
Zethson Jul 2, 2025
7074712
🎨 Add minimal set to schema describe
Zethson Jul 2, 2025
562d8ab
🎨 Maybe fix
Zethson Jul 2, 2025
e96cf3e
🎨 Backwards compatibility
Zethson Jul 2, 2025
3637584
🎨 Backwards compatibility 2
Zethson Jul 2, 2025
1d468b2
🎨 Curate adata.var columns
Zethson Jul 2, 2025
c636a51
🎨 Fix test
Zethson Jul 3, 2025
51bd805
Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…
Zethson Jul 3, 2025
5e1a7f2
🎨 Enable index curation & fix schema _aux bug
Zethson Jul 3, 2025
78490c7
Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…
Zethson Jul 3, 2025
c1d0520
🎨 Fix test
Zethson Jul 3, 2025
b3941d0
🎨 Fix test
Zethson Jul 3, 2025
18bf0ae
🎨 Fix test
Zethson Jul 3, 2025
a6dadf9
🎨 Add feature test
Zethson Jul 4, 2025
aecf9a7
🎨 Add more cat_filters tests
Zethson Jul 4, 2025
f2d6fa0
Merge branch 'main' of https://github.com/laminlabs/lamindb into feat…
Zethson Jul 7, 2025
603c0e4
🎨 Main _aux fix
Zethson Jul 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions lamindb/curators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
MuDataCurator
SpatialDataCurator
TiledbsomaExperimentCurator
CxGCurator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a Curator anymore, unless I'm wrong. The idea would be to just provide a schema, no? There wouldn't be any specific curator code for CxG, would it?

Copy link
Member Author

@Zethson Zethson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the Curator needs custom code for missing ontology values (such as "normal"), missing defaults, handling names vs ontology_id, ...

I currently don't think that we can hack all of that into a Schema (some things we can).

I'll first add tests for the low level functionality that I changed/fixed and then I'll open a Slack thread to discuss this topic. Yesterday, I actually had moved a CxG generator function to lamindb.examples.anndata just to move it back. I understand what you're looking for.

Copy link
Member Author

@Zethson Zethson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function that gets the Schema can probably do some of that. I'll revisit this ASAP but first tests!

Copy link
Member

@sunnyosun sunnyosun Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry that comment was from me)
It should still not have its Curator code, because the code is just to prepare the instance. We need a different strategy to add control terms; or we can just guide users to prepare the instance with a proper notebook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to have a few schemas available, like cxg, perturbation. Maybe even under lamindb.schemas. or something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is legitimate to have some logic in code and others in a schema. Code and schema are quite different. Evidently the former can't be easily queried for because it's not structured and the later can. Not everything has to fit into a schema. Some could -- as Sunny says -- be part of a "curation notebook" which is likely better a curation script which is likely better some kind of configurable thing in lamindb (a "curator", just that this doesn't have so much to do with what we currently brand as curator which is just a thing that validates the schema).


Modules.

Expand All @@ -19,23 +20,23 @@
"""

from ._legacy import ( # backward compat
CellxGeneAnnDataCatManager,
PertAnnDataCatManager,
)
from .core import (
AnnDataCurator,
CxGCurator,
DataFrameCurator,
MuDataCurator,
SpatialDataCurator,
TiledbsomaExperimentCurator,
)

__all__ = [
"CellxGeneAnnDataCatManager",
"PertAnnDataCatManager",
"AnnDataCurator",
"DataFrameCurator",
"MuDataCurator",
"SpatialDataCurator",
"TiledbsomaExperimentCurator",
"CxGCurator",
]
86 changes: 76 additions & 10 deletions lamindb/curators/_cellxgene_schemas/__init__.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
from typing import Literal

import pandas as pd
from lamin_utils import logger
from lamindb_setup.core.upath import UPath

from lamindb.base.types import FieldAttr
from lamindb.models import SQLRecord, ULabel
from lamindb.models import Feature, Schema, SQLRecord, ULabel
from lamindb.models._from_values import _format_values

CELLxGENESchemaVersions = Literal["4.0.0", "5.0.0", "5.1.0", "5.2.0", "5.3.0"]

# These names are reserved by the CELLxGENE Schema and are not allowed to be used as obs columns
RESERVED_NAMES = {
"ethnicity",
"ethnicity_ontology_term_id",
Expand Down Expand Up @@ -35,7 +40,6 @@ def _get_cxg_categoricals() -> dict[str, FieldAttr]:
"development_stage_ontology_term_id": bt.DevelopmentalStage.ontology_id,
"disease": bt.Disease.name,
"disease_ontology_term_id": bt.Disease.ontology_id,
# "donor_id": "str", via pandera
"self_reported_ethnicity": bt.Ethnicity.name,
"self_reported_ethnicity_ontology_term_id": bt.Ethnicity.ontology_id,
"sex": bt.Phenotype.name,
Expand All @@ -46,6 +50,7 @@ def _get_cxg_categoricals() -> dict[str, FieldAttr]:
"tissue_type": ULabel.name,
"organism": bt.Organism.name,
"organism_ontology_term_id": bt.Organism.ontology_id,
"donor_id": str,
}


Expand Down Expand Up @@ -110,10 +115,16 @@ def _fetch_bionty_source(entity: str, organism: str) -> SQLRecord | None: # typ
name=row.source,
version=row.version,
).one_or_none()
# if the source was not found, we register it from bionty-assets
if source is None:
logger.error(
f"Could not find source: {entity}\n"
" → consider running `bionty.core.sync_public_sources()`"
source = getattr(bt, entity).add_source(
bt.Source.using("laminlabs/bionty-assets")
.get(
entity=f"bionty.{entity}",
version=row.version,
organism=row.organism,
)
.save()
)
return source

Expand All @@ -127,16 +138,23 @@ def _fetch_bionty_source(entity: str, organism: str) -> SQLRecord | None: # typ

key_to_source: dict[str, bt.Source] = {}
for key, field in categoricals.items():
if field.field.model.__get_module_name__() == "bionty":
entity = field.field.model.__name__
key_to_source[key] = _fetch_bionty_source(entity, organism)
if hasattr(field, "field"):
if field.field.model.__get_module_name__() == "bionty":
entity = field.field.model.__name__
key_to_source[key] = _fetch_bionty_source(entity, organism)
else:
key_to_source[key] = field
key_to_source["var_index"] = _fetch_bionty_source("Gene", organism)

return key_to_source


def _init_categoricals_additional_values() -> None:
"""Add additional values from CellxGene schema to the DB."""
"""Add additional values from CellxGene schema to the instance.

CELLxGENE schemas use specific (control) values that are not available
in the ontologies. Therefore, we save them to the instance.
"""
import bionty as bt

# Note: if you add another control below, be mindful to change the if condition that
Expand All @@ -150,7 +168,7 @@ def _init_categoricals_additional_values() -> None:
# "normal" in Disease
normal = bt.Phenotype.from_source(
ontology_id="PATO:0000461",
source=bt.Source.get(name="pato", version="2024-03-28"),
source=bt.Source.get(name="pato", currently_used=True),
)
bt.Disease(
uid=normal.uid,
Expand Down Expand Up @@ -196,3 +214,51 @@ def _init_categoricals_additional_values() -> None:
ULabel(
name=name, type=suspension_type, description="From CellxGene schema."
).save()


def _get_cxg_schema(
schema_version: CELLxGENESchemaVersions, sources: dict[str, SQLRecord]
) -> Schema:
"""Generates a `~lamindb.Schema` for a specific CELLxGENE schema version."""
import bionty as bt

categoricals = _get_cxg_categoricals()

var_schema = Schema(
name=f"CELLxGENE var of version {schema_version}",
index=Feature(
name="var_index",
dtype=bt.Gene.ensembl_gene_id,
cat_filters={"source": sources["var_index"]},
).save(),
itype=Feature,
dtype="DataFrame",
minimal_set=True,
coerce_dtype=True,
).save()

obs_features = [
Feature(
name=field, dtype=categoricals[field], cat_filters={"source": source}
).save()
for field, source in sources.items()
if field != "var_index"
]

obs_schema = Schema(
name=f"CELLxGENE obs of version {schema_version}",
features=obs_features,
otype="DataFrame",
minimal_set=True,
coerce_dtype=True,
).save()

full_cxg_schema = Schema(
name=f"CELLxGENE AnnData schema of version {schema_version}",
otype="AnnData",
minimal_set=True,
coerce_dtype=True,
slots={"var": var_schema, "obs": obs_schema},
).save()

return full_cxg_schema
11 changes: 11 additions & 0 deletions lamindb/curators/_cellxgene_schemas/schema_versions.csv
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,14 @@ schema_version,entity,organism,source,version
5.2.0,Tissue,all,uberon,2024-08-07
5.2.0,Gene,human,ensembl,release-110
5.2.0,Gene,mouse,ensembl,release-110
5.3.0,CellType,all,cl,2025-02-13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the ontology versions information should be parsed from https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.3.0/schema.md directly. Otherwise, we are duplicating the information and also increasing the maintenance of this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do that reliably... This is a rather free style doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please look into it? I mean single-cell-curation must be using this information for their validator somehow. I think this session is also rather a structured table:
Screenshot 2025-07-03 at 17 28 52

But single-cell-curation should have relevant code already.

Copy link
Member Author

@Zethson Zethson Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found https://github.com/chanzuckerberg/cellxgene-ontology-guide/blob/main/ontology-assets/ontology_info.json but it's only for 5.3.0+. It also lacks Genes just like the table in what you've shown. They have a more comprehensive set up to get gene versions.

I asked whether they'd be willing to add gene versions.

A disadvantage of allowing any version to pulled also causes issues because we have to ensure that all required versions are available in Bionty. Yes, people can technically pull any, but not for all ontologies.

5.3.0,ExperimentalFactor,all,efo,3.75.0
5.3.0,Ethnicity,human,hancestro,3.0
5.3.0,DevelopmentalStage,human,hsapdv,2025-01-23
5.3.0,DevelopmentalStage,mouse,mmusdv,2025-01-23
5.3.0,Disease,all,mondo,2025-02-04
5.3.0,Organism,all,ncbitaxon,2024-11-25
5.3.0,Phenotype,all,pato,2025-02-01
5.3.0,Tissue,all,uberon,2025-01-15
5.3.0,Gene,human,ensembl,release-113
5.3.0,Gene,mouse,ensembl,release-113
8 changes: 7 additions & 1 deletion lamindb/curators/_legacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -1318,7 +1318,7 @@ def save_artifact(
class CellxGeneAnnDataCatManager(AnnDataCatManager):
"""Categorical manager for `AnnData` respecting the CELLxGENE schema.

This will be superceded by a schema-based curation flow.
This will be superseded by a schema-based curation flow.
"""

cxg_categoricals_defaults = {
Expand Down Expand Up @@ -1369,6 +1369,9 @@ def __init__(
# Filter categoricals based on what's present in adata
if categoricals is None:
categoricals = self._get_cxg_categoricals()

# backwards compatibility
categoricals.pop("donor_id", None)
categoricals = _restrict_obs_fields(adata.obs, categoricals)

# Configure sources
Expand Down Expand Up @@ -1703,6 +1706,9 @@ def _configure_categoricals(self, adata: ad.AnnData):
"pert_target": "unknown",
}

# backwards compatibility
categoricals.pop("donor_id", None)

return categoricals, categoricals_defaults

def _configure_sources(self, adata: ad.AnnData):
Expand Down
98 changes: 93 additions & 5 deletions lamindb/curators/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
import copy
import re
from collections.abc import Iterable
from typing import TYPE_CHECKING, Any, Callable
from typing import TYPE_CHECKING, Any, Callable, Literal

import lamindb_setup as ln_setup
import numpy as np
Expand All @@ -26,6 +26,11 @@
from lamindb_setup.core._docs import doc_args

from lamindb.base.types import FieldAttr # noqa
from lamindb.curators._cellxgene_schemas import (
CELLxGENESchemaVersions,
_get_cxg_categoricals,
_get_cxg_schema,
)
from lamindb.models import (
Artifact,
Feature,
Expand Down Expand Up @@ -463,9 +468,11 @@ def __init__(
slot: str | None = None,
) -> None:
super().__init__(dataset=dataset, schema=schema)

categoricals = []
features = []
feature_ids: set[int] = set()

if schema.flexible:
features += Feature.filter(name__in=self._dataset.keys()).list()
feature_ids = {feature.id for feature in features}
Expand All @@ -488,6 +495,7 @@ def __init__(
features.extend(schema_features)
else:
assert schema.itype is not None # noqa: S101

pandera_columns = {}
if features or schema._index_feature_uid is not None:
# populate features
Expand Down Expand Up @@ -540,18 +548,23 @@ def __init__(
"list[cat["
):
# validate categoricals if the column is required or if the column is present
if required or feature.name in self._dataset.keys():
# but exclude the index feature from column categoricals
if (required or feature.name in self._dataset.keys()) and (
schema._index_feature_uid is None
or feature.uid != schema._index_feature_uid
):
categoricals.append(feature)
if schema._index_feature_uid is not None:
# in almost no case, an index should have a pandas.CategoricalDtype in a DataFrame
# so, we're typing it as `str` here
# in almost no case, an index should have a pandas.CategoricalDtype in a DataFrame
# so, we're typing it as `str` here
if schema.index is not None:
index = pandera.Index(
schema.index.dtype
if not schema.index.dtype.startswith("cat")
else str
)
else:
index = None

self._pandera_schema = pandera.DataFrameSchema(
pandera_columns,
coerce=schema.coerce_dtype,
Expand Down Expand Up @@ -986,6 +999,81 @@ def __init__(
self._columns_field = self._var_fields


class CxGCurator(SlotsCurator):
"""Curator for `AnnData` objects that should adhere to a specific CELLxGENE Schema version.

Args:
dataset: The AnnData-like object to validate & annotate.
schema_version: A CELLxGENE Schema version that defines the validation constraints.
organism: The organism of the Schema.
defaults: Default values that are set if columns or column values are missing.
extra_sources: A dictionary mapping ``.obs.columns`` to Source records.
These extra sources are joined with the CELLxGENE fixed sources.
Use this parameter when subclassing.

Example:

.. literalinclude:: scripts/curate_cxg.py
:language: python
:caption: curate_cxg.py
"""

def __init__(
self,
dataset: AnnData | Artifact,
schema_version: CELLxGENESchemaVersions,
*,
organism: Literal["human", "mouse"] = "human",
defaults: dict[str, str] = None,
extra_sources: dict[str, SQLRecord] = None,
) -> None:
from ._cellxgene_schemas import (
_add_defaults_to_obs,
_create_sources,
_init_categoricals_additional_values,
_restrict_obs_fields,
)

# Add defaults first to ensure that we fetch valid sources
if defaults:
_add_defaults_to_obs(dataset.obs, defaults)

# Filter categoricals based on what's present in the dataset
present_categoricals = _restrict_obs_fields(
dataset.obs, _get_cxg_categoricals()
)

sources = _create_sources(present_categoricals, schema_version, organism)
# These sources are not a part of the cellxgene schema but rather passed through.
# This is useful when other Curators extend the CELLxGENE curator
if extra_sources:
sources = sources | extra_sources
cxg_schema = _get_cxg_schema(schema_version, sources=sources).save()
super().__init__(dataset=dataset, schema=cxg_schema)

if not data_is_scversedatastructure(self._dataset, "AnnData"):
raise InvalidArgument("dataset must be AnnData-like.")

self.schema_version = schema_version
self.schema_reference = f"https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/{schema_version}/schema.md"

self._slots = {
slot: DataFrameCurator(
(
getattr(self._dataset, slot.strip(".T")).T
if slot == "var.T"
else getattr(self._dataset, slot)
),
slot_schema,
slot=slot,
)
for slot, slot_schema in cxg_schema.slots.items()
if slot in {"obs", "var", "var.T", "uns"}
}

_init_categoricals_additional_values()


class CatVector:
"""Vector with categorical values."""

Expand Down
Loading
Loading