Skip to content

scANVI CPU memory peak during training #3419

@jan-engelmann

Description

@jan-engelmann

scANVI goes out of memory for training on 6.6 million cells.
Anndata is 22GB with 5,000 highly variable genes but watching htop during training I see peaks of CPU memory:
164GB virtual and 145GB RES towards the end of the epoch. I have now fixed the issue by requesting 400GB memory but that should not be necessary.

I am training from a pre-trained scVI model

model = scvi.model.SCVI(
    adata, n_latent=50, dropout_rate=0.2, n_layers=2, gene_likelihood="nb"
)
tparams = {
    "max_epochs": 30,
    "early_stopping": True,
    "early_stopping_patience": 5,
    "simple_progress_bar": True,
    "batch_size": 1024,
    "check_val_every_n_epoch": 1,
    "enable_model_summary": True,
    "enable_checkpointing": True,
}
model.train(**tparams)

scvi.settings.seed = 0

scanvi_model = scvi.model.SCANVI.from_scvi_model(
    model,
    adata=adata,
    labels_key="lineage_2",
    unlabeled_category="nan",
)
scanvi_tparams = {
    "batch_size": 2048,
    "early_stopping": True,
    "early_stopping_patience": 3,
    "check_val_every_n_epoch": 1,
    "early_stopping_monitor": "validation_loss",
    "max_epochs": 20,
}
scanvi_model.train(**scanvi_tparams)

"lineage_2" contains the following cell type counts:

celltype counts:
T 3900338
Monocyte/DC/basophil 1329707
NK/ILC 770840
B/plasma 376316
nan 177093 # unlabeled_category
HSC_MPP 3728
Platelet/erythroid 87

in total 6,558,109 cells

Likely the error occurs in the validation loss loop but I have not pinpointed it to a specific line of code.

Out of CPU memory

Thanks a lot for your help!

Versions:

1.3.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions