[FIX] Garbage collect temp buffers after checkpoint#94
Merged
tyler-griggs merged 20 commits intomainfrom Jul 17, 2025
Merged
Conversation
Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
SumanthRH
reviewed
Jul 17, 2025
| "fsdp2", | ||
| ], | ||
| ) | ||
| def test_offload_after_ckpt(strategy): |
Member
There was a problem hiding this comment.
QQ: does the test fail before this PR?
Member
Author
There was a problem hiding this comment.
Yes indeed! Should have mentioned that
SumanthRH
approved these changes
Jul 17, 2025
Member
SumanthRH
left a comment
There was a problem hiding this comment.
Left a minor comment. Thanks!
fannie1208
pushed a commit
to vinid/SkyRL
that referenced
this pull request
Aug 19, 2025
## What does this PR do? Resolve issue where offloading optimizers failed after checkpointing. As reported in NovaSky-AI#70, OOMs can occur during inference engine wake-up after checkpointing. The root-cause was the `state_dict` materialization in `save_ckpt` created temporary buffers that were not garbage collected before we try to wakeup the inference engine kv cache, causing an OOM. This PR executes the garbage collection and resolves the OOM issue. ## Tests Added a GPU test to check for successful offloading after checkpointing. It fails before this PR. ## What's next? We should switch to using pytorch's distributed checkpointing APIs for checkpointing, which is much simpler. --------- Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Resolve issue where offloading optimizers failed after checkpointing.
As reported in #70, OOMs can occur during inference engine wake-up after checkpointing. The root-cause was the
state_dictmaterialization insave_ckptcreated temporary buffers that were not garbage collected before we try to wakeup the inference engine kv cache, causing an OOM.This PR executes the garbage collection and resolves the OOM issue.
Tests
Added a GPU test to check for successful offloading after checkpointing. It fails before this PR.
What's next?
We should switch to using pytorch's distributed checkpointing APIs for checkpointing, which is much simpler.