[FIX] Garbage collect temp buffers after checkpoint by tyler-griggs · Pull Request #94 · NovaSky-AI/SkyRL

tyler-griggs · 2025-07-16T23:34:35Z

What does this PR do?

Resolve issue where offloading optimizers failed after checkpointing.

As reported in #70, OOMs can occur during inference engine wake-up after checkpointing. The root-cause was the state_dict materialization in save_ckpt created temporary buffers that were not garbage collected before we try to wakeup the inference engine kv cache, causing an OOM.

This PR executes the garbage collection and resolves the OOM issue.

Tests

Added a GPU test to check for successful offloading after checkpointing. It fails before this PR.

What's next?

We should switch to using pytorch's distributed checkpointing APIs for checkpointing, which is much simpler.

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

into tgriggs/token-loss

SumanthRH · 2025-07-17T00:41:25Z

skyrl-train/tests/gpu/test_worker_offload.py

+        "fsdp2",
+    ],
+)
+def test_offload_after_ckpt(strategy):


QQ: does the test fail before this PR?

Yes indeed! Should have mentioned that

SumanthRH

Left a minor comment. Thanks!

## What does this PR do? Resolve issue where offloading optimizers failed after checkpointing. As reported in NovaSky-AI#70, OOMs can occur during inference engine wake-up after checkpointing. The root-cause was the `state_dict` materialization in `save_ckpt` created temporary buffers that were not garbage collected before we try to wakeup the inference engine kv cache, causing an OOM. This PR executes the garbage collection and resolves the OOM issue. ## Tests Added a GPU test to check for successful offloading after checkpointing. It fails before this PR. ## What's next? We should switch to using pytorch's distributed checkpointing APIs for checkpointing, which is much simpler. --------- Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

tyler-griggs and others added 19 commits July 15, 2025 01:59

init commit

2d70b1e

add testing, update names

a066dd7

fix

280a7c7

Update skyrl-train/tests/cpu/algorithms/test_losses.py

607ad9d

Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>

fix

09ef2ac

Merge branch 'tgriggs/token-loss' of https://github.com/NovaSky-AI/SkyRL

daaaa33

into tgriggs/token-loss

x

4a80e63

x

7e1dfdc

got it

e962830

x

621e7a5

x

1103fe1

small fixes

10c85ee

Merge branch 'main' into tgriggs/ckpt-debug

1cbdde1

x

cf013ee

ㅌ

2c2db1a

x

fa976ec

x

fa5b1fd

x

131f808

x

bdca535

tyler-griggs marked this pull request as ready for review July 17, 2025 00:37

SumanthRH reviewed Jul 17, 2025

View reviewed changes

SumanthRH approved these changes Jul 17, 2025

View reviewed changes

tyler-griggs changed the title ~~Tgriggs/ckpt debug~~ [FIX] Garbage collect temp buffers after checkpoint Jul 17, 2025

x

51adb64

tyler-griggs merged commit f556802 into main Jul 17, 2025
3 checks passed

tyler-griggs mentioned this pull request Jul 17, 2025

Cuda Memory Error After Saving Checkpoint #70

Closed

SumanthRH deleted the tgriggs/ckpt-debug branch July 23, 2025 08:25

tyler-griggs mentioned this pull request Sep 7, 2025

[fix] Bring back pretty log formatting #250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Garbage collect temp buffers after checkpoint#94

[FIX] Garbage collect temp buffers after checkpoint#94
tyler-griggs merged 20 commits intomainfrom
tgriggs/ckpt-debug

tyler-griggs commented Jul 16, 2025 •

edited

Loading

Uh oh!

SumanthRH Jul 17, 2025

Uh oh!

tyler-griggs Jul 17, 2025

Uh oh!

SumanthRH Jul 17, 2025

Uh oh!

SumanthRH left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tyler-griggs commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Tests

What's next?

Uh oh!

SumanthRH Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyler-griggs commented Jul 16, 2025 •

edited

Loading