[trainer][megatron] Sequence packing + Context Parallel for Megatron#274
Merged
SumanthRH merged 3 commits intoNovaSky-AI:mainfrom Sep 10, 2025
Merged
Conversation
SumanthRH
reviewed
Sep 9, 2025
Member
SumanthRH
left a comment
There was a problem hiding this comment.
Looks good! As a sanity check, let's just benchmark seq packing time with the current packing impl
Collaborator
Author
Collaborator
Author
17 tasks
ztcanddota
added a commit
to ztcanddota/skyagent
that referenced
this pull request
Sep 28, 2025
…(#274) # Overview Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported. ## Correctness Check ### CP + TP + PP <img width="368" height="275" alt="image" src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee" /> ### Just Sequence Packing <img width="366" height="278" alt="image" src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa" /> ### Just CP + Sequence Packing <img width="364" height="281" alt="image" src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d" /> ### Timing Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2. <img width="362" height="286" alt="image" src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2" /> This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI/SkyRL#274 (comment)
SungjunlaLee
added a commit
to SungjunlaLee/SkyRL
that referenced
this pull request
Jan 3, 2026
…(#274) # Overview Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported. ## Correctness Check ### CP + TP + PP <img width="368" height="275" alt="image" src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee" /> ### Just Sequence Packing <img width="366" height="278" alt="image" src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa" /> ### Just CP + Sequence Packing <img width="364" height="281" alt="image" src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d" /> ### Timing Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2. <img width="362" height="286" alt="image" src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2" /> This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI/SkyRL#274 (comment)
dzorlu
pushed a commit
to fleet-ai/SkyRL
that referenced
this pull request
Feb 4, 2026
…274) # Overview Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported. ## Correctness Check ### CP + TP + PP <img width="368" height="275" alt="image" src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee" /> ### Just Sequence Packing <img width="366" height="278" alt="image" src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa" /> ### Just CP + Sequence Packing <img width="364" height="281" alt="image" src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d" /> ### Timing Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2. <img width="362" height="286" alt="image" src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2" /> This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI#274 (comment)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.





Overview
Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported.
Correctness Check
CP + TP + PP
Just Sequence Packing
Just CP + Sequence Packing
Timing
Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2 (but the micro batch size is only 4, so this might just be the overhead of handling packing)
