Skip to content

Checkpoints cannot use CUDA #54887

@Theo-Fan

Description

@Theo-Fan

I trained a model using GPU, and everything worked fine during training. However, when I try to load the saved checkpoint later (on a CPU-only machine), I get the following error:

025-07-24 22:22:56,431	ERROR actor_manager.py:873 -- Ray error (System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. 
If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Steps to reproduce

  1. Train a model on a GPU machine
  2. Save the model or checkpoint
  3. Move the checkpoint to a CPU-only machine
  4. Try to load it (using ·torch.load()·)

Test:

from ray import train, tune
from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("Pendulum-v1")
    
    .training(
        lr=tune.grid_search([0.001, 0.0001]),
    )
    .env_runners(
        num_env_runners=2,
        batch_mode="complete_episodes"
    )
    .learners(
        num_learners=1,
        num_gpus_per_learner=1, # gpu config
    )
    
)


tuner = tune.Tuner(
    config.algo_class,
    param_space=config,
    run_config=train.RunConfig(
        stop={
        	"training_iteration": 5,
        },
        checkpoint_config=tune.CheckpointConfig(
            checkpoint_at_end=True,  # Problem
        ),
    ),
)

results = tuner.fit()

Environment:

  • RLlib version: 2.47.1 (Ray 2.47.1)
  • CUDA available during training: Yes
  • CUDA available during restore: No
  • OS: Ubuntu 22.04
  • Python version: 3.10.18

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Issues that should be fixed in short ordercommunity-backlogrllibRLlib related issuesrllib-checkpointing-or-recoveryAn issue related to checkpointing/recovering RLlib Trainers.rllib-modelsAn issue related to RLlib (default or custom) Models.stability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions