[Attention] Flash MLA for V1#13867
Conversation
This PR is co-authored with Lucas Wilkinson. ``` VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100 ... vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.66|± |0.0476| | | |strict-match | 5|exact_match|↑ | 0.66|± |0.0476| ``` Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> format Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> format Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> format Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
This PR is co-authored with Lucas Wilkinson. ``` VLLM_USE_V1="1" lm_eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True --task gsm8k --num_fewshot=5 --limit 100 ... vllm (pretrained=deepseek-ai/DeepSeek-V2-Lite-Chat,tensor_parallel_size=2,dtype=auto,gpu_memory_utilization=0.9,trust_remote_code=True,max_model_len=16384,enforce_eager=True), gen_kwargs: (None), limit: 100.0, num_fewshot: 5, batch_size: 1 |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ | 0.66|± |0.0476| | | |strict-match | 5|exact_match|↑ | 0.66|± |0.0476| ``` Signed-off-by: Yang Chen <yangche@fb.com>
|
QQ: Just curiosity, what's the main reason that V1 is slower than V0 (say all use Flash MLA, and we look at ITL)? Is it because of chunked prefill? |
Its mostly likely because in V1 CUDA graphs are not used for attention, we need to keep optimizing away alot of the small operations in MLA (for low QPS, for the throughput case it may be chunked prefill) |
mgoin
left a comment
There was a problem hiding this comment.
This looks clean to me, nice work! Have you run an accuracy smoke test?
|
Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Yang Chen <yangche@fb.com>
@LucasWilkinson Hello! Which GPU hardware did these tests run on? It seems that out-of-memory (OOM) errors occur on most NVIDIA GPUs when performing the "Long context" test. |
|
Hi, @LucasWilkinson, I got errors while running deepseek r1 awq with FlashMLA and V1 engine in 8 * H100: command: |
@zuozi2810 can you please provide the full log and hugging face link to a failing model, that would be really helpful in debugging |
8xH200, what setup are you seeing OOM on? |
8xH200 have enough GPU memory. My setup is 8xH20. Thank you for your reply. |
Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Yang Chen <yangche@fb.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Yang Chen <yangche@fb.com>
Signed-off-by: Yang Chen <yangche@fb.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Yang Chen <yangche@fb.com>


use via:
VLLM_ATTENTION_BACKEND=FLASHMLA VLLM_USE_V1=1Results:
https://docs.google.com/spreadsheets/d/1toxQVaA7UPhmY57kv08Wdq0xtIcU8oBbqb9RE3Wy2_E/edit?usp=sharing