[trainer][megatron] Sequence packing + Context Parallel for Megatron by erictang000 · Pull Request #274 · NovaSky-AI/SkyRL

erictang000 · 2025-09-09T21:24:47Z

Overview

Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported.

Correctness Check

CP + TP + PP

Just Sequence Packing

Just CP + Sequence Packing

Timing

Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2 (but the micro batch size is only 4, so this might just be the overhead of handling packing)

SumanthRH

Looks good! As a sanity check, let's just benchmark seq packing time with the current packing impl

erictang000 · 2025-09-10T00:08:29Z

seems to just be faster in the forward pass without sequence packing for this micro batch size

without seq pack (~78ms)

with seq pack (~105 ms)

this is consistent with the ~25% slowdown in overall time

erictang000 · 2025-09-10T00:19:24Z

narrowing it down more, seems like the gap is in the application of rotary positional embeddings:

with seq pack:

without seq pack:

…(#274) # Overview Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported. ## Correctness Check ### CP + TP + PP <img width="368" height="275" alt="image" src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee" /> ### Just Sequence Packing <img width="366" height="278" alt="image" src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa" /> ### Just CP + Sequence Packing <img width="364" height="281" alt="image" src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d" /> ### Timing Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2. <img width="362" height="286" alt="image" src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2" /> This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI/SkyRL#274 (comment)

…274) # Overview Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported. ## Correctness Check ### CP + TP + PP <img width="368" height="275" alt="image" src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee" /> ### Just Sequence Packing <img width="366" height="278" alt="image" src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa" /> ### Just CP + Sequence Packing <img width="364" height="281" alt="image" src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d" /> ### Timing Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2. <img width="362" height="286" alt="image" src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2" /> This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI#274 (comment)

erictang000 added 3 commits September 9, 2025 21:19

seems to be working

e4e85b9

x

c5a8c04

revert uv lock

f6f9fbf

erictang000 marked this pull request as ready for review September 9, 2025 21:46

erictang000 requested a review from SumanthRH September 9, 2025 21:51

SumanthRH reviewed Sep 9, 2025

View reviewed changes

erictang000 mentioned this pull request Sep 8, 2025

Support Megatron training backend for MoE training #203

Open

17 tasks

SumanthRH merged commit b9cc7e8 into NovaSky-AI:main Sep 10, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer][megatron] Sequence packing + Context Parallel for Megatron#274

[trainer][megatron] Sequence packing + Context Parallel for Megatron#274
SumanthRH merged 3 commits intoNovaSky-AI:mainfrom
erictang000:pack_and_context_parallel

erictang000 commented Sep 9, 2025 •

edited

Loading

Uh oh!

SumanthRH left a comment

Uh oh!

erictang000 commented Sep 10, 2025

Uh oh!

erictang000 commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erictang000 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Correctness Check

CP + TP + PP

Just Sequence Packing

Just CP + Sequence Packing

Timing

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

erictang000 commented Sep 10, 2025

Uh oh!

erictang000 commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erictang000 commented Sep 9, 2025 •

edited

Loading