[Feature] Support Deepseek-vl2-tiny model, in which mla is disabled

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

According to https://github.com/sgl-project/sglang/issues/2653, the Deepseek-vl2 models are supported, but not all models in the series are supported as I used. Deepseek-vl2's models series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. The tiny models has different model structure with small and normal model, which MLA (Multi-Head Latent Attention) is disabled. And if using DeepseekV2ForCasualLLM as a language model, the qk_nope_head_dim and qk_rope_head_dim  added as qk_head_dim  will cause a ZeroDivisionError in the later sampling var calculation for **-0.5 operation (https://github.com/sgl-project/sglang/blob/bfa392245159147a2b7dbd67178c825e5035c329/python/sglang/srt/models/deepseek_v2.py#L427). There is a stardard solution in vllm, which replaces DeepseekV2ForCasualLLM with DeepseekForCasualLLM.

Besides, the chat template in deepseek-vl2 is also not aligned with vllm. A `<image>` will be added in the end of prompt automatically during the chat conversation generation process. If user pass `<image>`  in their prompt to denote the image, then there will be a dismatch between real image counts and prompt `<image>` counts.

I have already validated the deepseek-vl2-tiny model locally, ensuring that the output results are consistent with those from vllm. Additionally, I’ve proved some performance improvements, with speeds potentially being 5% to 20% faster depending on the number of decoding steps (thanks to the excellent sglang backend).  **I’m wondering if I can contribute to this feature.** with just a little more work of reorganizing code elegantly and adding some tests. I’m really looking forward to receiving feedback from the community. Thanks!

### Related resources

Difference between tiny and normal size models.
[deepseek-vl2] https://huggingface.co/deepseek-ai/deepseek-vl2/blob/main/config.json
[deepseek-vl2-tiny] https://huggingface.co/deepseek-ai/deepseek-vl2-tiny/blob/main/config.json

Special token maps:
https://huggingface.co/deepseek-ai/deepseek-vl2-small/blob/main/special_tokens_map.json

Deepseek-vl2 chat examples (with \<image\> input):
https://github.com/deepseek-ai/DeepSeek-VL2?tab=readme-ov-file#simple-inference-example-with-one-image

Some code suggestions  and examples from vllm:
https://github.com/vllm-project/vllm/blob/686623c5e7a0ee0c7679c052ced565dd83055709/vllm/model_executor/models/deepseek_vl2.py#L355


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support Deepseek-vl2-tiny model, in which mla is disabled #5537

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support Deepseek-vl2-tiny model, in which mla is disabled #5537

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions