-
Notifications
You must be signed in to change notification settings - Fork 265
Closed
Description
Hello, I'm following the GSM8K training script, and I'm getting Cuda memory error in the training step after a checkpoint is saved.
When running the script, it's able to complete all the training steps until the step where it saves the checkpoint. The checkpoint seems to be saved correctly and I'm able to restart the run and resume from training at that step.
I'm using Qwen3-8B as the model, and decreased the mini batch size accordingly. This is on a 8 x H100 node so I also changed the number of GPUs to 8.
The error log is here and I would be happy to provide more information as needed, thanks!
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:10.227 | INFO | skyrl_train.trainer:train:310 - Finished: 'save_checkpoints', time cost: 29.84s
Training Step Progress: 100%|██████████| 20/20 [1:58:31<00:00, 722.48s/it]
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:34.677 | INFO | skyrl_train.trainer:train:232 - Started: 'step'
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:35.931 | INFO | skyrl_train.weights_manager:__enter__:56 - Started: 'sync_weights_to_inference_engines'
(AsyncVLLMInferenceEngine pid=719401) INFO 07-08 19:50:35 [executor_base.py:226] It took 1.185601 seconds to wake up tags ['weights'].
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:39.509 | INFO | skyrl_train.weights_manager:__enter__:56 - Finished: 'sync_weights_to_inference_engines', time cost: 3.58s
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:39.509 | INFO | skyrl_train.weights_manager:__enter__:60 - Started: 'offload_policy_model_to_cpu'
(AsyncVLLMInferenceEngine pid=719401) INFO 07-08 19:50:39 [block_pool.py:264] Successfully reset prefix cache
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:40.385 | INFO | skyrl_train.weights_manager:__enter__:60 - Finished: 'offload_policy_model_to_cpu', time cost: 0.88s
(skyrl_entrypoint pid=716206) 2025-07-08 19:50:40.452 | INFO | skyrl_train.trainer:train:232 - Finished: 'step', time cost: 5.78s
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] Invocation of wake_up method failed
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] Traceback (most recent call last):
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] output.result = method(
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] ^^^^^^^
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 271, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] self.model_executor.wake_up(tags)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 224, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] self.collective_rpc("wake_up", kwargs=dict(tags=tags))
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] answer = run_method(self.driver_worker, method, args, kwargs)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] return func(*args, **kwargs)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] ^^^^^^^^^^^^^^^^^^^^^
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 102, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] allocator.wake_up(tags)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 224, in wake_up
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] create_and_map(handle)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpxIcVHA/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 77, in create_and_map
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] python_create_and_map(*allocation_handle)
(AsyncVLLMInferenceEngine pid=719401) ERROR 07-08 19:50:40 [core.py:459] RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
(AsyncVLLMInferenceEngine pid=719401) CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
(AsyncVLLMInferenceEngine pid=719404) INFO 07-08 19:50:40 [executor_base.py:226] It took 0.058697 seconds to wake up tags ['kv_cache'].
Traceback (most recent call last):
File "/home/hyen/project/SkyRL/skyrl-train/examples/hle/main_hle.py", line 32, in main
ray.get(skyrl_entrypoint.remote(cfg))
File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/worker.py", line 2782, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hyen/.cache/uv/builds-v0/.tmpfLf4tO/lib/python3.12/site-packages/ray/_private/worker.py", line 929, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::skyrl_entrypoint() (pid=716206, ip=10.128.0.45)
File "/home/hyen/project/SkyRL/skyrl-train/examples/hle/main_hle.py", line 23, in skyrl_entrypoint
exp.run()
File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/entrypoints/main_base.py", line 273, in run
trainer.train()
File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/trainer.py", line 243, in train
with self.weights_manager:
^^^^^^^^^^^^^^^^^^^^
File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/weights_manager.py", line 62, in __enter__
asyncio.run(self.inference_engine_client.wake_up(tags=["kv_cache"]))
File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/inference_engine_client.py", line 126, in wake_up
return await self._run_on_all_engines("wake_up", *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/inference_engine_client.py", line 29, in _run_on_all_engines
return await asyncio.gather(*awaitables)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/ray_wrapped_inference_engine.py", line 31, in wake_up
return await self.inference_engine_actor.wake_up.remote(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError: ray::AsyncVLLMInferenceEngine.wake_up() (pid=719407, ip=10.128.0.45, actor_id=e7736d19d18e98710b54bbde01000000, repr=<skyrl_train.inference_engines.vllm.vllm_engine.AsyncVLLMInferenceEngine object at 0x7fac380a2390>)
File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/hyen/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/tmp/ray/session_2025-07-08_17-45-19_603653_703632/runtime_resources/working_dir_files/_ray_pkg_b42fd2b2209700a7/skyrl_train/inference_engines/vllm/vllm_engine.py", line 309, in wake_up
await self.llm.wake_up(tags=kwargs.get("tags", None))
File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 485, in wake_up
await self.engine_core.wake_up_async(tags)
File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 789, in wake_up_async
await self.call_utility_async("wake_up", tags)
File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 756, in call_utility_async
return await self._call_utility_async(method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hyen/.cache/uv/builds-v0/.tmpZOhw0L/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in _call_utility_async
return await future
^^^^^^^^^^^^
Exception: Call to wake_up method failed: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(AsyncVLLMInferenceEngine pid=719406) INFO 07-08 19:50:35 [executor_base.py:226] It took 1.188242 seconds to wake up tags ['weights']. [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=719406) INFO 07-08 19:50:39 [block_pool.py:264] Successfully reset prefix cache [repeated 7x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] Invocation of wake_up method failed [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] Traceback (most recent call last): [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 456, in _handle_client_request [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] output.result = method( [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] ^^^^^^^ [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 224, in wake_up [repeated 24x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] self.model_executor.wake_up(tags) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] self.collective_rpc("wake_up", kwargs=dict(tags=tags)) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] answer = run_method(self.driver_worker, method, args, kwargs) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] return func(*args, **kwargs) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] ^^^^^^^^^^^^^^^^^^^^^ [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] allocator.wake_up(tags) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] create_and_map(handle) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] File "/home/hyen/.cache/uv/builds-v0/.tmpXspYIZ/lib/python3.12/site-packages/vllm/device_allocator/cumem.py", line 77, in create_and_map [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] python_create_and_map(*allocation_handle) [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) ERROR 07-08 19:50:40 [core.py:459] RuntimeError: CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62 [repeated 6x across cluster]
(AsyncVLLMInferenceEngine pid=719406) CUDA Error: out of memory at /workspace/csrc/cumem_allocator.cpp:62 [repeated 6x across cluster]
+ ilization=0.8```
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels