Skip to content

Cuda Memory Error After Saving Checkpoint #70

@howard-yen

Description

@howard-yen

Hello, I'm following the GSM8K training script, and I'm getting Cuda memory error in the training step after a checkpoint is saved.
When running the script, it's able to complete all the training steps until the step where it saves the checkpoint. The checkpoint seems to be saved correctly and I'm able to restart the run and resume from training at that step.
I'm using Qwen3-8B as the model, and decreased the mini batch size accordingly. This is on a 8 x H100 node so I also changed the number of GPUs to 8.

The error log is here and I would be happy to provide more information as needed, thanks!

(skyrl_entrypoint pid=716206) 2025-07-08 19:50:10.227 | INFO     | skyrl_train.trainer:train:310 - Finished: 'save_checkpoints', time cost: 29.84s
Training Step Progress: 100%|██████████| 20/20 [1:58:31<00:00, 722.48s/it]
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:34.677 | INFO     | skyrl_train.trainer:train:232 - Started: 'step'
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:35.931 | INFO     | skyrl_train.weights_manager:__enter__:56 - Started: 'sync_weights_to_inference_engines'
(AsyncVLLMInferenceEngine pid=719401) INFO 07-08 19:50:35 [executor_base.py:226] It took 1.185601 seconds to wake up tags ['weights'].
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:39.509 | INFO     | skyrl_train.weights_manager:__enter__:56 - Finished: 'sync_weights_to_inference_engines', time cost: 3.58s
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:39.509 | INFO     | skyrl_train.weights_manager:__enter__:60 - Started: 'offload_policy_model_to_cpu'
(AsyncVLLMInferenceEngine pid=719401) INFO 07-08 19:50:39 [block_pool.py:264] Successfully reset prefix cache


(skyrl_entrypoint pid=716206) 2025-07-08 19:50:40.385 | INFO     | skyrl_train.weights_manager:__enter__:60 - Finished: 'offload_policy_model_to_cpu', time cost: 0.88s
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:40.452 | INFO     | skyrl_train.trainer:train:232 - Finished: 'step', time cost: 5.78s
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] Invocation of wake_up method failed
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] Traceback (most recent call last):
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     output.result = method(
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]                     ^^^^^^^
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 271, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     self.model_executor.wake_up(tags)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 224, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     self.collective_rpc("wake_up", kwargs=dict(tags=tags))
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     answer = run_method(self.driver_worker, method, args, kwargs)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     return func(*args, **kwargs)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 102, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     allocator.wake_up(tags)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 224, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     create_and_map(handle)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 77, in create_and_map
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459]     python_create_and_map(*allocation_handle)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
(AsyncVLLMInferenceEngine pid=719401) CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
(AsyncVLLMInferenceEngine pid=719404) INFO 07-08 19:50:40 [executor_base.py:226] It took 0.058697 seconds to wake up tags ['kv_cache'].
Traceback (most recent call last):
  File "/home/hyen/project/SkyRL/skyrl-train/examples/hle/main_hle.py", line 32, in main
    ray.get(skyrl_entrypoint.remote(cfg))
  File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/worker.py", line 2782, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/worker.py", line 929, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::skyrl_entrypoint() (pid=716206, ip=10.128.0.45)
  File "/home/hyen/project/SkyRL/skyrl-train/examples/hle/main_hle.py", line 23, in skyrl_entrypoint
    exp.run()
  File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/entrypoints/main_base.py", line 273, in run
    trainer.train()
  File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/trainer.py", line 243, in train
    with self.weights_manager:
         ^^^^^^^^^^^^^^^^^^^^
  File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/weights_manager.py", line 62, in __enter__
    asyncio.run(self.inference_engine_client.wake_up(tags=["kv_cache"]))
  File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/inference_engine_client.py", line 126, in wake_up
    return await self._run_on_all_engines("wake_up", *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/inference_engine_client.py", line 29, in _run_on_all_engines
    return await asyncio.gather(*awaitables)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/ray_wrapped_inference_engine.py", line 31, in wake_up
    return await self.inference_engine_actor.wake_up.remote(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError: ray::AsyncVLLMInferenceEngine.wake_up() (pid=719407, ip=10.128.0.45, actor_id=e7736d19d18e98710b54bbde01000000, repr=<skyrl_train.inference_engines.vllm.vllm_engine.AsyncVLLMInferenceEngine object at 0x7fac380a2390>)
  File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/vllm/vllm_engine.py", line 309, in wake_up
    await self.llm.wake_up(tags=kwargs.get("tags", None))
  File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 485, in wake_up
    await self.engine_core.wake_up_async(tags)
  File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 789, in wake_up_async
    await self.call_utility_async("wake_up", tags)
  File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 756, in call_utility_async
    return await self._call_utility_async(method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in _call_utility_async
    return await future
           ^^^^^^^^^^^^
Exception: Call to wake_up method failed: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(AsyncVLLMInferenceEngine pid=719406) INFO 07-08 19:50:35 [executor_base.py:226] It took 1.188242 seconds to wake up tags ['weights']. [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=719406) INFO 07-08 19:50:39 [block_pool.py:264] Successfully reset prefix cache [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] Invocation of wake_up method failed [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] Traceback (most recent call last): [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     output.result = method( [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]                     ^^^^^^^ [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 224, in wake_up [repeated 24x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     self.model_executor.wake_up(tags) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     self.collective_rpc("wake_up", kwargs=dict(tags=tags)) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     answer = run_method(self.driver_worker, method, args, kwargs) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     return func(*args, **kwargs) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]            ^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     allocator.wake_up(tags) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     create_and_map(handle) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]   File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 77, in create_and_map [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459]     python_create_and_map(*allocation_handle) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62 [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62 [repeated 6x across cluster]
+ ilization=0.8```

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions