Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions custom_ops/gpu_ops/append_attn/append_attention_c16_impl.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -1061,12 +1061,12 @@ void MultiQueryAppendAttention(
if (!is_decoder) {
chunk_size = static_cast<uint32_t>(encoder_max_partition_size);
}
const int num_chunks = div_up(max_dec_len, chunk_size);
const int num_chunks = div_up(encoder_max_partition_size, chunk_size);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个改动的原因是?如果是为了固定num_chunk 建议使用max_seq_len

Copy link
Copy Markdown
Collaborator

@yuanlehome yuanlehome Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个不会导致launch kernel有资源冗余吗?性能因此有下降?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于资源冗余问题,首先申请的显存一定会冗余,这个避免起来比较困难,然后计算资源的话,在multi_query_append_attention_warp1_4_kernel中,这个kernel会根据num_chunks(当前batch中最长的seq算出来的)作为启动参数,确实会多启动一些,但是由于原本的设计本来就会面临一个batch中有不同num_chunks_this_seq(这个seq算出来的)的请求的情况,所以原本就有提前退出而避免浪费计算资源的情况。

if (chunk_idx >= num_chunks_this_seq) {
    return;
  }

来避免计算资源的浪费。
然后在merge_multi_chunks_decoder_kernel中,这个kernel的启动参数和num_chunks无关,这个比较cuda graph友好,内部关于num_chunks_this_seq的处理是循环处理。里面和num_chunks有关的就是去计算一些偏移量,这个kernel可以说改进前后的资源使用率是一致的,没有影响。
然后性能问题,前面贴出来测试结果表明性能确实解码速度有所下降,开启cuda graph后没有完全补充回来。但是由于并发数提高导致延迟降低了。至于为什么并发数会提高还有待分析。

Copy link
Copy Markdown
Contributor Author

@littledgg littledgg Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实是为了固定固定num_chunk,encoder_max_partition_size目前是用max_seq_len赋值的,目前可以认为是一个东西,但是使用encoder_max_partition_size的含义之后可能会更换,并且max_seq_len更好理解,应该使用max_seq_len。


dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);

if (num_chunks <= 1) {
if (num_chunks <= 0) {
Comment thread
gongshaotian marked this conversation as resolved.
auto nosplit_kv_kernel =
multi_query_append_attention_warp1_4_kernel<NV_TYPE,
false,
Expand Down
4 changes: 2 additions & 2 deletions custom_ops/gpu_ops/append_attn/append_attention_c4_impl.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -1285,10 +1285,10 @@ void MultiQueryAppendC4Attention(
if (!is_decoder) {
chunk_size = static_cast<uint32_t>(encoder_max_partition_size);
}
const int num_chunks = div_up(max_dec_len, chunk_size);
const int num_chunks = div_up(encoder_max_partition_size, chunk_size);
dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);
if (num_chunks <= 1) {
if (num_chunks <= 0) {
auto nosplit_kv_kernel =
multi_query_append_attention_c4_warp1_4_kernel<NV_TYPE,
uint8_t,
Expand Down
4 changes: 2 additions & 2 deletions custom_ops/gpu_ops/append_attn/append_attention_c8_impl.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -1254,10 +1254,10 @@ void MultiQueryAppendC8Attention(
chunk_size = static_cast<uint32_t>(encoder_max_partition_size);
}

const int num_chunks = div_up(max_dec_len, chunk_size);
const int num_chunks = div_up(encoder_max_partition_size, chunk_size);
dim3 grids(num_blocks_x_cpu, num_chunks, kv_num_heads);
dim3 blocks(32, num_warps);
if (num_chunks <= 1) {
if (num_chunks <= 0) {
auto nosplit_kv_kernel =
multi_query_append_attention_c8_warp1_4_kernel<NV_TYPE,
uint8_t,
Expand Down
Loading