[Feat] Support FlashMLA backend with MTP and FP8 KV cache #6109

quinnrong94 · 2025-05-08T03:57:09Z

Motivation

This PR improves flashmla backend by accelerating decode stage with mtp. The implementation utilizes the feature that flashmla can handle seq_len_q > 1. To use flashmla with mtp, an example server args can be --attention-backend flashmla --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2.

This PR also supports flashmla backend with fp8 kv-cache, which halves kv cache usage and enables larger concurrency with longer input sequence lengths when memory is limited. To enable fp8 kv-cache, an additional server arg need to be added: --kv-cache-dtype fp8_e4m3.

The speedup of MTP + FP8 KV cache is about 30% with KV cache usage reducing by 50%:

Modifications

FlashMLA backend supports MTP, compatiable with cuda graph
FlashMLA backend supports FP8 KV cache

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Fridge003 · 2025-05-13T18:28:24Z

Hi @quinnrong94 , can you take a look at this CI fail?
https://github.com/sgl-project/sglang/actions/runs/14996032913/job/42130798605?pr=6109

PopSoda2002 · 2025-05-14T23:13:26Z

Hi @quinnrong94 , can you take a look at this CI fail? https://github.com/sgl-project/sglang/actions/runs/14996032913/job/42130798605?pr=6109

Hi @Fridge003 , I saw flashMLA test failed in CI, I wonder if it's due to the same reason with #5587 ?

zhaochenyang20 · 2025-05-15T00:49:31Z

@quinnrong94 Let us rerun the CI for you, no need to rebase. thanks

Fridge003 · 2025-05-15T01:35:48Z

Hi @quinnrong94 , can you take a look at this CI fail? https://github.com/sgl-project/sglang/actions/runs/14996032913/job/42130798605?pr=6109

Hi @Fridge003 , I saw flashMLA test failed in CI, I wonder if it's due to the same reason with #5587 ?

It seems to be due to bug caused by this PR. By now it should be fixed.

Fridge003 · 2025-05-15T01:42:01Z

For Future PRs:

Do some profiling and check whether there is any bubble caused by synchronization between CPU & GPU
Support speculative-num-steps > 1
Support topk > 1

sleepcoo · 2025-05-15T07:14:23Z

fix util error when egale config=314 @quinnrong94

For Future PRs:

Do some profiling and check whether there is any bubble caused by synchronization between CPU & GPU

Support speculative-num-steps > 1

Support topk > 1

mahaocong90 · 2025-05-19T11:37:38Z

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end：

Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal
self.check_memory()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory
raise ValueError(msg)
ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224
self.token_to_kv_pool_allocator.available_size()=1058048
self.tree_cache.evictable_size()=0

My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

quinnrong94 · 2025-05-20T02:51:12Z

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end：

Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0

My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

speculative-num-steps > 1 is under development, please stay tuned!

neiltian-tencent · 2025-05-20T06:51:31Z

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end：

Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0

My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

@mahaocong90 MTP 3,1,4 is fixed by neiltian-tencent@0987047
the memory leak is caused by page size 64, we are trying to fix this issue。FlashInfer MLA has the same problem。

mahaocong90 · 2025-05-20T08:53:13Z

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end：
Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0
My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

@mahaocong90 MTP 3,1,4 is fixed by neiltian-tencent@0987047 the memory leak is caused by page size 64, we are trying to fix this issue。FlashInfer MLA has the same problem。

@neiltian-tencent Does it work with flashinfer MLA? I merged it into main branch and test using flashmla, but it still reports mem leak.

neiltian-tencent · 2025-05-20T09:29:22Z

Hi @quinnrong94, I have a question, Can FlashMLA be used with mtp 3-1-4? I test on H20 141g, and the benchmark reported a memory leak at the end：
Scheduler hit an exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2270, in run_scheduler_process scheduler.event_loop_normal() File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 643, in event_loop_normal self.check_memory() File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1235, in check_memory raise ValueError(msg) ValueError: token_to_kv_pool_allocator memory leak detected! available_size=1058048, protected_size=0, self.max_total_num_tokens=1100224 self.token_to_kv_pool_allocator.available_size()=1058048 self.tree_cache.evictable_size()=0
My sglang start parameter is: python3 -m sglang.launch_server --model-path $R1_MODEL_PATH --tp $TP --trust-remote-code --port $PORT --host 0.0.0.0 --mem-fraction-static 0.85 --max-running-requests $max_running_requests --disable-radix-cache --attention-backend flashmla --speculative-algorithm NEXTN --speculative-draft $NextN_MODEL_PATH --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --kv-cache-dtype fp8_e4m3

@mahaocong90 MTP 3,1,4 is fixed by neiltian-tencent@0987047 the memory leak is caused by page size 64, we are trying to fix this issue。FlashInfer MLA has the same problem。

@neiltian-tencent Does it work with flashinfer MLA? I merged it into main branch and test using flashmla, but it still reports mem leak.

@mahaocong90 The bugs of MTP 3, 1, 4 have been fixed, the memory leak will be fixed later。

…t#6109) Co-authored-by: Yingyi <[email protected]> Co-authored-by: neiltian <[email protected]> Co-authored-by: lukec <[email protected]> Co-authored-by: kexueyu <[email protected]> Co-authored-by: vincentmeng <[email protected]> Co-authored-by: pengmeng <[email protected]>

yyihuang and others added 29 commits April 20, 2025 22:16

init draft

21aea66

upd

9573c49

upd

527662a

upd

5b22d19

upd

0b6b114

fmt

120d57e

upd

c6b157f

add ci (todo: cuda graph shape error)

a4623aa

upd disable cuda graph

df8a324

add print

25c5b6c

kv fp8 only for flashinfer mla

06db9d0

fix extend for main and draft model

4e723d5

fix flashmla bug (sgl-project#5272)

e925426

target_verify use flashinfer_mla, no cudagraph, result ok

6b62696

target_verify user flashmla, precision is low

5f6e167

add flashmla fp8

011eff1

update for conflict

4acdb6a

flash mla decode fp8

f4b0265

flashmla backend support mtp cuda graph

3d111cd

fix block_kv_indices cuda graph in mtp decode

72b96c2

fix flash_mla seq_lens error

3ef7731

fix MTP + FlashMLA seq_len bug

3c94f0e

fix multi draft crash

dcc7d17

fix mutli-batch flashmla error

ab91da0

protect for none type

87e58a4

lerge branch 'main' into mla_spec_dev

75c7637

remove debug info

634e033

remove flashmla backend unused

0332618

update remove todo

8244395

quinnrong94 requested a review from merrymercy as a code owner May 8, 2025 03:57

sleepcoo approved these changes May 12, 2025

View reviewed changes

Merge branch 'main' into mla_spec_dev

f7227c9

neiltian-tencent force-pushed the mla_spec_dev branch from 9550a0a to f7227c9 Compare May 13, 2025 03:17

neiltian-tencent and others added 4 commits May 13, 2025 11:19

Merge branch 'main' into mla_spec_dev

a50525f

remove unused page size test

60cd84f

Merge branch 'main' into mla_spec_dev

4fc17bd

update doc for flashmla mtp and kv fp8

0734a19

neiltian-tencent requested a review from zhaochenyang20 as a code owner May 13, 2025 12:00

sleepcoo and others added 5 commits May 14, 2025 10:22

Merge branch 'main' into mla_spec_dev

6870222

update test for flashmla 112

800584e

Merge branch 'main' into mla_spec_dev

5c08818

update avg_spec_accept_length

dccfd40

fix

0259eb6

Fridge003 approved these changes May 15, 2025

View reviewed changes

zhyncs merged commit 2e4babd into sgl-project:main May 15, 2025
104 of 115 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Support FlashMLA backend with MTP and FP8 KV cache #6109

[Feat] Support FlashMLA backend with MTP and FP8 KV cache #6109

Uh oh!

quinnrong94 commented May 8, 2025 •

edited

Loading

Uh oh!

Fridge003 commented May 13, 2025

Uh oh!

PopSoda2002 commented May 14, 2025 •

edited

Loading

Uh oh!

zhaochenyang20 commented May 15, 2025

Uh oh!

Fridge003 commented May 15, 2025

Uh oh!

Fridge003 commented May 15, 2025

Uh oh!

sleepcoo commented May 15, 2025

Uh oh!

Uh oh!

mahaocong90 commented May 19, 2025 •

edited

Loading

Uh oh!

quinnrong94 commented May 20, 2025

Uh oh!

neiltian-tencent commented May 20, 2025 •

edited

Loading

Uh oh!

mahaocong90 commented May 20, 2025 •

edited

Loading

Uh oh!

neiltian-tencent commented May 20, 2025

Uh oh!

Uh oh!

[Feat] Support FlashMLA backend with MTP and FP8 KV cache #6109

[Feat] Support FlashMLA backend with MTP and FP8 KV cache #6109

Uh oh!

Conversation

quinnrong94 commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Fridge003 commented May 13, 2025

Uh oh!

PopSoda2002 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented May 15, 2025

Uh oh!

Fridge003 commented May 15, 2025

Uh oh!

Fridge003 commented May 15, 2025

Uh oh!

sleepcoo commented May 15, 2025

Uh oh!

Uh oh!

mahaocong90 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quinnrong94 commented May 20, 2025

Uh oh!

neiltian-tencent commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahaocong90 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neiltian-tencent commented May 20, 2025

Uh oh!

Uh oh!

quinnrong94 commented May 8, 2025 •

edited

Loading

PopSoda2002 commented May 14, 2025 •

edited

Loading

mahaocong90 commented May 19, 2025 •

edited

Loading

neiltian-tencent commented May 20, 2025 •

edited

Loading

mahaocong90 commented May 20, 2025 •

edited

Loading