[Megatron] Add checkpointing support by tyler-griggs · Pull Request #298 · NovaSky-AI/SkyRL

tyler-griggs · 2025-09-15T22:25:56Z

What does this PR do?

This PR implements support for save_checkpoint and load_checkpoint for the Megatron training backend. We use Megatron's dist_checkpointing library to perform checkpointing in parallel across ranks, which also allows for reloading the checkpoints in a different parallelism scheme.

Other minor changes:

Rename save/load_ckpt to save/load_checkpoint
Removed unused arguments to save/load_checkpoint, primarily the inclusion of non-backend specific state (tag, client_state, global_step). This change keeps training backend checkpointing logic focused on the training backend's state.

Testing

Extended two GPU checkpointing tests to cover Megatron
- Also moves test_save_load_checkpoint.py into gpu_ci. Note, however, that Megatron tests are disabled in CI because they currently require a different flash-attn install.
Manually save-and-resumed several times:

What's next?

Test multi-node checkpointing
Implement save_hf_model

tyler-griggs · 2025-09-15T23:43:56Z

skyrl-train/skyrl_train/trainer.py

Note: These trainer updates may need to be changed after #297 is merged

tyler-griggs · 2025-09-15T23:44:54Z

skyrl-train/skyrl_train/trainer.py

+            self.init_weight_sync_state()
+
+        # Load policy model to GPU before loading checkpoint.
+        if self.cfg.trainer.placement.colocate_all:


Note: Policy model needs to be on GPU for Megatron load_checkpoint as required by Megatron's dist_checkpoint library

tyler-griggs · 2025-09-15T23:47:24Z

skyrl-train/tests/gpu/gpu_ci/test_trainer_full_checkpointing.py

-        print("Phase 3: Verify state consistency")
-
-        # Compare captured states
-        for key in state_before:


Note: this was only confirming global_step, which we already do

erictang000

pretty much LGTM! thanks. Let's just merge main after #297 is merged to make the trainer changes consistent

for multi-node checkpointing and save_hf_model, we'll want these but I can help test out multi-node as I get model training running, and we can rely on external scripts for converting megatron to HF for a little bit.

erictang000 · 2025-09-17T22:00:22Z

skyrl-train/skyrl_train/distributed/megatron/megatron_strategy.py

+        # All ranks wait for the checkpoint directory to be created before saving.
+        dist.barrier()
+
+        # Collect the sharded state dicts for model and optimizer, and full state dict for the scheduler.


so other than the scheduler there's no other additional memory load or communication here?

If I understand your question correctly, no! We should just be loading the sharded model and optimizer state dicts.

# What does this PR do? #298 broke GPU CI a bit: 1. Megatron related dependencies have not been resolved properly on `main` yet, and this test should be skipped. 2. We use a simple 4xL4 instance, but then the test was modified in #298 to request 8 GPUs (non-colocated training) --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

# What does this PR do? Fixes async trainer example after #298. We renamed `setup_policy_and_generator` to `init_weight_sync_state` but missed the update in some places. --------- Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

## What does this PR do? This PR implements support for `save_checkpoint` and `load_checkpoint` for the Megatron training backend. We use Megatron's `dist_checkpointing` library to perform checkpointing in parallel across ranks, which also allows for reloading the checkpoints in a different parallelism scheme. Other minor changes: * Rename `save/load_ckpt` to `save/load_checkpoint` * Removed unused arguments to `save/load_checkpoint`, primarily the inclusion of non-backend specific state (`tag`, `client_state`, `global_step`). This change keeps training backend checkpointing logic focused on the training backend's state. ## Testing * Extended two GPU checkpointing tests to cover Megatron * Also moves `test_save_load_checkpoint.py` into `gpu_ci`. Note, however, that Megatron tests are disabled in CI because they currently require a different `flash-attn` install. * Manually save-and-resumed several times: <img width="426" height="292" alt="Screenshot 2025-09-15 at 3 30 17 PM" src="https://github.com/user-attachments/assets/2a4170fe-fddd-4086-961f-ca3170e654ab" /> ## What's next? - [ ] Test multi-node checkpointing - [ ] Implement `save_hf_model`

# What does this PR do? NovaSky-AI#298 broke GPU CI a bit: 1. Megatron related dependencies have not been resolved properly on `main` yet, and this test should be skipped. 2. We use a simple 4xL4 instance, but then the test was modified in NovaSky-AI#298 to request 8 GPUs (non-colocated training) --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

# What does this PR do? Fixes async trainer example after NovaSky-AI#298. We renamed `setup_policy_and_generator` to `init_weight_sync_state` but missed the update in some places. --------- Signed-off-by: SumanthRH <sumanthrh@anyscale.com>

tyler-griggs added 12 commits September 14, 2025 19:55

WIP: initial commit

3386413

testing save and resume

1df7dab

onload to GPU before checkpointing

9dc6b28

save/resume seems to work!

0620ca8

working

95e4712

cleaning up

01e401a

format

c76fd70

formatting

e1237c0

test

63ada90

formatting and cleaning

396e892

revert run file

008aa8b

tests passing

e1eee33

tyler-griggs commented Sep 15, 2025

View reviewed changes

cleanup

f736220

tyler-griggs marked this pull request as ready for review September 15, 2025 23:48

formatting

2e296b1

erictang000 self-assigned this Sep 15, 2025

erictang000 mentioned this pull request Sep 10, 2025

Support Megatron training backend for MoE training #203

Open

17 tasks

erictang000 reviewed Sep 17, 2025

View reviewed changes

tyler-griggs added 2 commits September 18, 2025 05:00

Merge remote-tracking branch 'origin/main' into HEAD

f4cf0fe

Merge remote-tracking branch 'real/main' into tgriggs/megatron_ckpt

4895050

erictang000 approved these changes Sep 18, 2025

View reviewed changes

erictang000 merged commit 9f4dcb9 into NovaSky-AI:main Sep 18, 2025
3 checks passed

SumanthRH mentioned this pull request Sep 19, 2025

[Tests] Fix GPU CI failures after #298 #323

Merged

SumanthRH mentioned this pull request Sep 21, 2025

[Fix] Fix async trainer after #298 #331

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Megatron] Add checkpointing support#298

[Megatron] Add checkpointing support#298
erictang000 merged 16 commits intoNovaSky-AI:mainfrom
tyler-griggs:tgriggs/megatron_ckpt

tyler-griggs commented Sep 15, 2025 •

edited

Loading

Uh oh!

tyler-griggs Sep 15, 2025 •

edited

Loading

Uh oh!

tyler-griggs Sep 15, 2025

Uh oh!

tyler-griggs Sep 15, 2025

Uh oh!

erictang000 left a comment

Uh oh!

erictang000 Sep 17, 2025

Uh oh!

tyler-griggs Sep 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tyler-griggs commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

What's next?

Uh oh!

tyler-griggs Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

erictang000 left a comment

Choose a reason for hiding this comment

Uh oh!

erictang000 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyler-griggs commented Sep 15, 2025 •

edited

Loading

tyler-griggs Sep 15, 2025 •

edited

Loading