Skip to content

[Feat] Support FlashMLA backend with MTP and FP8 KV cache #6109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
May 15, 2025

Conversation

quinnrong94
Copy link
Contributor

@quinnrong94 quinnrong94 commented May 8, 2025

Motivation

This PR improves flashmla backend by accelerating decode stage with mtp. The implementation utilizes the feature that flashmla can handle seq_len_q > 1. To use flashmla with mtp, an example server args can be --attention-backend flashmla --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2.

This PR also supports flashmla backend with fp8 kv-cache, which halves kv cache usage and enables larger concurrency with longer input sequence lengths when memory is limited. To enable fp8 kv-cache, an additional server arg need to be added: --kv-cache-dtype fp8_e4m3.

The speedup of MTP + FP8 KV cache is about 30% with KV cache usage reducing by 50%:

Screenshot 2025-05-14 at 20 44 28

Modifications

  • FlashMLA backend supports MTP, compatiable with cuda graph
  • FlashMLA backend supports FP8 KV cache

Checklist

@quinnrong94 quinnrong94 requested a review from merrymercy as a code owner May 8, 2025 03:57
@Fridge003
Copy link
Collaborator

@PopSoda2002
Copy link
Contributor

PopSoda2002 commented May 14, 2025

Hi @quinnrong94 , can you take a look at this CI fail? https://github.com/sgl-project/sglang/actions/runs/14996032913/job/42130798605?pr=6109

Hi @Fridge003 , I saw flashMLA test failed in CI, I wonder if it's due to the same reason with #5587 ?

@zhaochenyang20
Copy link
Collaborator

@quinnrong94 Let us rerun the CI for you, no need to rebase. thanks

@Fridge003
Copy link
Collaborator

Hi @quinnrong94 , can you take a look at this CI fail? https://github.com/sgl-project/sglang/actions/runs/14996032913/job/42130798605?pr=6109

Hi @Fridge003 , I saw flashMLA test failed in CI, I wonder if it's due to the same reason with #5587 ?

It seems to be due to bug caused by this PR. By now it should be fixed.

@Fridge003
Copy link
Collaborator

For Future PRs:

  • Do some profiling and check whether there is any bubble caused by synchronization between CPU & GPU
  • Support speculative-num-steps > 1
  • Support topk > 1

@sleepcoo
Copy link
Collaborator

For Future PRs:

  • Do some profiling and check whether there is any bubble caused by synchronization between CPU & GPU
  • Support speculative-num-steps > 1
  • Support topk > 1

@zhyncs zhyncs merged commit 2e4babd into sgl-project:main May 15, 2025
104 of 115 checks passed
@mahaocong90
Copy link

mahaocong90 commented May 19, 2025

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end:

Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal
self.check_memory()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory
raise ValueError(msg)
ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224
self.token_to_kv_pool_allocator.available_size()=1058048
self.tree_cache.evictable_size()=0

My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

@quinnrong94
Copy link
Contributor Author

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end:

Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0

My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

speculative-num-steps > 1 is under development, please stay tuned!

@neiltian-tencent
Copy link
Contributor

neiltian-tencent commented May 20, 2025

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end:

Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0

My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

@mahaocong90 MTP 3,1,4 is fixed by neiltian-tencent@0987047
the memory leak is caused by page size 64, we are trying to fix this issue。FlashInfer MLA has the same problem。

@mahaocong90
Copy link

mahaocong90 commented May 20, 2025

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end:
Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0
My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

@mahaocong90 MTP 3,1,4 is fixed by neiltian-tencent@0987047 the memory leak is caused by page size 64, we are trying to fix this issue。FlashInfer MLA has the same problem。

@neiltian-tencent Does it work with flashinfer MLA? I merged it into main branch and test using flashmla, but it still reports mem leak.

@neiltian-tencent
Copy link
Contributor

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end:
Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0
My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

@mahaocong90 MTP 3,1,4 is fixed by neiltian-tencent@0987047 the memory leak is caused by page size 64, we are trying to fix this issue。FlashInfer MLA has the same problem。

@neiltian-tencent Does it work with flashinfer MLA? I merged it into main branch and test using flashmla, but it still reports mem leak.

@mahaocong90 The bugs of MTP 3, 1, 4 have been fixed, the memory leak will be fixed later。

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
…t#6109)

Co-authored-by: Yingyi <[email protected]>
Co-authored-by: neiltian <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: kexueyu <[email protected]>
Co-authored-by: vincentmeng <[email protected]>
Co-authored-by: pengmeng <[email protected]>
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
…t#6109)

Co-authored-by: Yingyi <[email protected]>
Co-authored-by: neiltian <[email protected]>
Co-authored-by: lukec <[email protected]>
Co-authored-by: kexueyu <[email protected]>
Co-authored-by: vincentmeng <[email protected]>
Co-authored-by: pengmeng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants