Skip to content

Proposal: Stop using AnnData.raw #1304

@ivirshup

Description

@ivirshup

I would like to see data distributed with cellxgene stop using .raw.X for counts and to instead put this matrix in adata.layers["counts"]. I would ideally like to get this schema change in for 6.0.

For some background, raw was initially put in anndata so the user could have a "normalized" and "raw" copy of the matrix. The workflow back in the day also assumed you may only be interested in normalized values for a subset of selected features – largely due to memory constraints and the use of densifying normalization methods. However, people generally need all of their features normalized for downstream methods that are used across all features (e.g. differential expression/ plotting/ enrichment) and densifying normalization methods are less popular now. In addition, Anndata has since added the .layers attribute which allows storing multiple matrices.

Within scverse, we would like to eventually get rid of the .raw entry as a whole. It's confusing to users, has difficult semantics (e.g. it's assumed to be read-only, but we can't actually enforce that), and is easily replaced by existing functionality/ just using a separate object. In addition, we have stopped developing features for .raw (including improved out-of-core compute support) a while ago and will not be adding more features.

Because cellxgene stores a matrix in .raw.X with the same shape as .X, I don't see any barrier to moving this over. What is gained is better support within scanpy api, out-of-core support, and better usability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    conversationUnder active discussionschemaCELLxGENE Discover dataset schema

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions