Skip to content

[trainer][megatron] Sequence packing + Context Parallel for Megatron#274

Merged
SumanthRH merged 3 commits intoNovaSky-AI:mainfrom
erictang000:pack_and_context_parallel
Sep 10, 2025
Merged

[trainer][megatron] Sequence packing + Context Parallel for Megatron#274
SumanthRH merged 3 commits intoNovaSky-AI:mainfrom
erictang000:pack_and_context_parallel

Conversation

@erictang000
Copy link
Collaborator

@erictang000 erictang000 commented Sep 9, 2025

Overview

Adds sequence packing + context parallel support for megatron backend. Note that context parallel without sequence packing is not supported.

Correctness Check

CP + TP + PP

image

Just Sequence Packing

image

Just CP + Sequence Packing

image

Timing

Adding CP is slower as expected, adding just sequence packing is also slightly slower for tp=2,pp=2 (but the micro batch size is only 4, so this might just be the overhead of handling packing)
image

@erictang000 erictang000 marked this pull request as ready for review September 9, 2025 21:46
Copy link
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! As a sanity check, let's just benchmark seq packing time with the current packing impl

@erictang000
Copy link
Collaborator Author

seems to just be faster in the forward pass without sequence packing for this micro batch size

without seq pack (~78ms)
image

with seq pack (~105 ms)
image

this is consistent with the ~25% slowdown in overall time
image

@erictang000
Copy link
Collaborator Author

narrowing it down more, seems like the gap is in the application of rotary positional embeddings:

with seq pack:
image

without seq pack:
image

@SumanthRH SumanthRH merged commit b9cc7e8 into NovaSky-AI:main Sep 10, 2025
3 checks passed
ztcanddota added a commit to ztcanddota/skyagent that referenced this pull request Sep 28, 2025
…(#274)

# Overview
Adds sequence packing + context parallel support for megatron backend.
Note that context parallel without sequence packing is not supported.


## Correctness Check
### CP + TP + PP
<img width="368" height="275" alt="image"
src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee"
/>

### Just Sequence Packing
<img width="366" height="278" alt="image"
src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa"
/>

### Just CP + Sequence Packing
<img width="364" height="281" alt="image"
src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d"
/>

### Timing
Adding CP is slower as expected, adding just sequence packing is also
slightly slower for tp=2,pp=2. 

<img width="362" height="286" alt="image"
src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2"
/>

This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI/SkyRL#274 (comment)
SungjunlaLee added a commit to SungjunlaLee/SkyRL that referenced this pull request Jan 3, 2026
…(#274)

# Overview
Adds sequence packing + context parallel support for megatron backend.
Note that context parallel without sequence packing is not supported.


## Correctness Check
### CP + TP + PP
<img width="368" height="275" alt="image"
src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee"
/>

### Just Sequence Packing
<img width="366" height="278" alt="image"
src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa"
/>

### Just CP + Sequence Packing
<img width="364" height="281" alt="image"
src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d"
/>

### Timing
Adding CP is slower as expected, adding just sequence packing is also
slightly slower for tp=2,pp=2. 

<img width="362" height="286" alt="image"
src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2"
/>

This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI/SkyRL#274 (comment)
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…274)

# Overview
Adds sequence packing + context parallel support for megatron backend.
Note that context parallel without sequence packing is not supported.


## Correctness Check
### CP + TP + PP
<img width="368" height="275" alt="image"
src="https://github.com/user-attachments/assets/53fdd009-3af9-4352-8e63-7604b2dfdeee"
/>

### Just Sequence Packing
<img width="366" height="278" alt="image"
src="https://github.com/user-attachments/assets/9a40dfdf-af8c-44e8-bc54-78e13d187daa"
/>

### Just CP + Sequence Packing
<img width="364" height="281" alt="image"
src="https://github.com/user-attachments/assets/c69522e8-52b1-4581-8a66-a579b29bbb0d"
/>

### Timing
Adding CP is slower as expected, adding just sequence packing is also
slightly slower for tp=2,pp=2. 

<img width="362" height="286" alt="image"
src="https://github.com/user-attachments/assets/9109ce98-0740-46ce-8a92-de5cd8cf2ec2"
/>

This seems to be because of overhead in computing rotary positional embeddings - without sequence packing, it's a batched call for a well formed tensor, while without sequence packing, it iterates over sequences one by one: NovaSky-AI#274 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants