Skip to content

[Megatron] Add checkpointing support#298

Merged
erictang000 merged 16 commits intoNovaSky-AI:mainfrom
tyler-griggs:tgriggs/megatron_ckpt
Sep 18, 2025
Merged

[Megatron] Add checkpointing support#298
erictang000 merged 16 commits intoNovaSky-AI:mainfrom
tyler-griggs:tgriggs/megatron_ckpt

Conversation

@tyler-griggs
Copy link
Member

@tyler-griggs tyler-griggs commented Sep 15, 2025

What does this PR do?

This PR implements support for save_checkpoint and load_checkpoint for the Megatron training backend. We use Megatron's dist_checkpointing library to perform checkpointing in parallel across ranks, which also allows for reloading the checkpoints in a different parallelism scheme.

Other minor changes:

  • Rename save/load_ckpt to save/load_checkpoint
  • Removed unused arguments to save/load_checkpoint, primarily the inclusion of non-backend specific state (tag, client_state, global_step). This change keeps training backend checkpointing logic focused on the training backend's state.

Testing

  • Extended two GPU checkpointing tests to cover Megatron
    • Also moves test_save_load_checkpoint.py into gpu_ci. Note, however, that Megatron tests are disabled in CI because they currently require a different flash-attn install.
  • Manually save-and-resumed several times:
Screenshot 2025-09-15 at 3 30 17 PM

What's next?

  • Test multi-node checkpointing
  • Implement save_hf_model

Copy link
Member Author

@tyler-griggs tyler-griggs Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: These trainer updates may need to be changed after #297 is merged

self.init_weight_sync_state()

# Load policy model to GPU before loading checkpoint.
if self.cfg.trainer.placement.colocate_all:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Policy model needs to be on GPU for Megatron load_checkpoint as required by Megatron's dist_checkpoint library

print("Phase 3: Verify state consistency")

# Compare captured states
for key in state_before:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this was only confirming global_step, which we already do

@tyler-griggs tyler-griggs marked this pull request as ready for review September 15, 2025 23:48
Copy link
Collaborator

@erictang000 erictang000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much LGTM! thanks. Let's just merge main after #297 is merged to make the trainer changes consistent

for multi-node checkpointing and save_hf_model, we'll want these but I can help test out multi-node as I get model training running, and we can rely on external scripts for converting megatron to HF for a little bit.

# All ranks wait for the checkpoint directory to be created before saving.
dist.barrier()

# Collect the sharded state dicts for model and optimizer, and full state dict for the scheduler.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so other than the scheduler there's no other additional memory load or communication here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your question correctly, no! We should just be loading the sharded model and optimizer state dicts.

@erictang000 erictang000 merged commit 9f4dcb9 into NovaSky-AI:main Sep 18, 2025
3 checks passed
SumanthRH added a commit that referenced this pull request Sep 19, 2025
# What does this PR do?

#298 broke GPU CI a bit: 
1. Megatron related dependencies have not been resolved properly on
`main` yet, and this test should be skipped.
2. We use a simple 4xL4 instance, but then the test was modified in #298
to request 8 GPUs (non-colocated training)

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
SumanthRH added a commit that referenced this pull request Sep 21, 2025
# What does this PR do?


Fixes async trainer example after #298. We renamed
`setup_policy_and_generator` to `init_weight_sync_state` but missed the update in some places. 

---------

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
## What does this PR do?
This PR implements support for `save_checkpoint` and `load_checkpoint`
for the Megatron training backend. We use Megatron's
`dist_checkpointing` library to perform checkpointing in parallel across
ranks, which also allows for reloading the checkpoints in a different
parallelism scheme.

Other minor changes: 
* Rename `save/load_ckpt` to `save/load_checkpoint`
* Removed unused arguments to `save/load_checkpoint`, primarily the
inclusion of non-backend specific state (`tag`, `client_state`,
`global_step`). This change keeps training backend checkpointing logic
focused on the training backend's state.

## Testing
* Extended two GPU checkpointing tests to cover Megatron
* Also moves `test_save_load_checkpoint.py` into `gpu_ci`. Note,
however, that Megatron tests are disabled in CI because they currently
require a different `flash-attn` install.
* Manually save-and-resumed several times:
<img width="426" height="292" alt="Screenshot 2025-09-15 at 3 30 17 PM"
src="https://github.com/user-attachments/assets/2a4170fe-fddd-4086-961f-ca3170e654ab"
/>


## What's next?
- [ ]  Test multi-node checkpointing
- [ ]  Implement `save_hf_model`
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
# What does this PR do?

NovaSky-AI#298 broke GPU CI a bit: 
1. Megatron related dependencies have not been resolved properly on
`main` yet, and this test should be skipped.
2. We use a simple 4xL4 instance, but then the test was modified in NovaSky-AI#298
to request 8 GPUs (non-colocated training)

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
# What does this PR do?


Fixes async trainer example after NovaSky-AI#298. We renamed
`setup_policy_and_generator` to `init_weight_sync_state` but missed the update in some places. 

---------

Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants