-
Notifications
You must be signed in to change notification settings - Fork 744
[Executor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding #3223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
gongshaotian
merged 5 commits into
PaddlePaddle:develop
from
littledgg:long_seq_cudagraph
Aug 8, 2025
+47
−47
Merged
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个改动的原因是?如果是为了固定num_chunk 建议使用max_seq_len
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个不会导致launch kernel有资源冗余吗?性能因此有下降?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于资源冗余问题,首先申请的显存一定会冗余,这个避免起来比较困难,然后计算资源的话,在multi_query_append_attention_warp1_4_kernel中,这个kernel会根据num_chunks(当前batch中最长的seq算出来的)作为启动参数,确实会多启动一些,但是由于原本的设计本来就会面临一个batch中有不同num_chunks_this_seq(这个seq算出来的)的请求的情况,所以原本就有提前退出而避免浪费计算资源的情况。
来避免计算资源的浪费。
然后在merge_multi_chunks_decoder_kernel中,这个kernel的启动参数和num_chunks无关,这个比较cuda graph友好,内部关于num_chunks_this_seq的处理是循环处理。里面和num_chunks有关的就是去计算一些偏移量,这个kernel可以说改进前后的资源使用率是一致的,没有影响。
然后性能问题,前面贴出来测试结果表明性能确实解码速度有所下降,开启cuda graph后没有完全补充回来。但是由于并发数提高导致延迟降低了。至于为什么并发数会提高还有待分析。
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实是为了固定固定num_chunk,encoder_max_partition_size目前是用max_seq_len赋值的,目前可以认为是一个东西,但是使用encoder_max_partition_size的含义之后可能会更换,并且max_seq_len更好理解,应该使用max_seq_len。