-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Closed
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short ordercommunity-backlogrllibRLlib related issuesRLlib related issuesrllib-checkpointing-or-recoveryAn issue related to checkpointing/recovering RLlib Trainers.An issue related to checkpointing/recovering RLlib Trainers.rllib-modelsAn issue related to RLlib (default or custom) Models.An issue related to RLlib (default or custom) Models.stability
Description
I trained a model using GPU, and everything worked fine during training. However, when I try to load the saved checkpoint later (on a CPU-only machine), I get the following error:
025-07-24 22:22:56,431 ERROR actor_manager.py:873 -- Ray error (System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False.
If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.Steps to reproduce
- Train a model on a GPU machine
- Save the model or checkpoint
- Move the checkpoint to a CPU-only machine
- Try to load it (using ·torch.load()·)
Test:
from ray import train, tune
from ray.rllib.algorithms.ppo import PPOConfig
config = (
PPOConfig()
.environment("Pendulum-v1")
.training(
lr=tune.grid_search([0.001, 0.0001]),
)
.env_runners(
num_env_runners=2,
batch_mode="complete_episodes"
)
.learners(
num_learners=1,
num_gpus_per_learner=1, # gpu config
)
)
tuner = tune.Tuner(
config.algo_class,
param_space=config,
run_config=train.RunConfig(
stop={
"training_iteration": 5,
},
checkpoint_config=tune.CheckpointConfig(
checkpoint_at_end=True, # Problem
),
),
)
results = tuner.fit()Environment:
- RLlib version: 2.47.1 (Ray 2.47.1)
- CUDA available during training: Yes
- CUDA available during restore: No
- OS: Ubuntu 22.04
- Python version: 3.10.18
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P0Issues that should be fixed in short orderIssues that should be fixed in short ordercommunity-backlogrllibRLlib related issuesRLlib related issuesrllib-checkpointing-or-recoveryAn issue related to checkpointing/recovering RLlib Trainers.An issue related to checkpointing/recovering RLlib Trainers.rllib-modelsAn issue related to RLlib (default or custom) Models.An issue related to RLlib (default or custom) Models.stability