Support Megatron training backend for MoE training

To support large-scale multi-node MoE training, we will integrate the Megatron training backend.

Tasks:
- [x] Initial Megatron Support for GRPO on TP + PP for dense models: #223 
- [x] Sequence packing + context parallel support: #274 
- [x] Checkpointing support #298
- [x] EP support for MoE models #285
- [ ] migration to Megatron-Bridge #453
- [ ] LoRA support via Megatron-Bridge
- [ ] unify gradient checkpointing flag
- [ ] Add guide for debugging and getting around OOM
- [ ] enable megatron fsdp
- [ ] Disaggregated Training
- [ ] PPO support (just need to add Critic and test)
- [x] Add megatron optimizer/scheduler pass through config options 
- [ ] Virtual Pipeline Parallel Support
- [ ] Dynamic Batch sizing
- [ ] Optimizing for efficiency (optimized kernels/bucketing weight updates) #635
- [ ] Support FlashRL + Megatron integration
- [x] Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Megatron training backend for MoE training #203

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support Megatron training backend for MoE training #203

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions