clean up chat-template for VLMs

JustinTong0323 · JustinTong0323 · commit 065168759c9f · 2025-05-08T08:16:15.000Z
Signed-off-by: Xinyuan Tong &lt;justinning0323@outlook.com&gt;
diff --git a/benchmark/mmmu/README.md b/benchmark/mmmu/README.md
@@ -5,7 +5,7 @@
 Host the VLM:
 
 ```
-python -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --chat-template qwen2-vl --port 30000
+python -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --port 30000
 ```
 
 It's recommended to reduce the memory usage by appending something like `--mem-fraction-static 0.6` to the command above.
diff --git a/benchmark/mmmu/bench_sglang.py b/benchmark/mmmu/bench_sglang.py
@@ -2,7 +2,7 @@
 Bench the sglang-hosted vLM with benchmark MMMU
 
 Usage:
-    Host the VLM: python -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --chat-template qwen2-vl --port 30000
+    Host the VLM: python -m sglang.launch_server --model-path Qwen/Qwen2-VL-7B-Instruct --port 30000
 
     Benchmark: python benchmark/mmmu/bench_sglang.py --port 30000 --concurrency 16
 
diff --git a/docs/backend/openai_api_vision.ipynb b/docs/backend/openai_api_vision.ipynb
@@ -27,11 +27,7 @@
    "source": [
     "## Launch A Server\n",
     "\n",
-    "Launch the server in your terminal and wait for it to initialize.\n",
-    "\n",
-    "**Remember to add** `--chat-template` **for example** `--chat-template=qwen2-vl` **to specify the [vision chat template](https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template), otherwise, the server will only support text (images won’t be passed in), which can lead to degraded performance.**\n",
-    "\n",
-    "We need to specify `--chat-template` for vision language models because the chat template provided in Hugging Face tokenizer only supports text."
+    "Launch the server in your terminal and wait for it to initialize."
    ]
   },
   {
@@ -51,8 +47,7 @@
     "\n",
     "vision_process, port = launch_server_cmd(\n",
     "    \"\"\"\n",
-    "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct \\\n",
-    "    --chat-template=qwen2-vl\n",
+    "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct\n",
     "\"\"\"\n",
     ")\n",
     "\n",
@@ -255,9 +250,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Chat Template\n",
+    "## Chat Template (for sglang version < 0.4.6.post2)\n",
     "\n",
-    "As mentioned before, if you do not specify a vision model's `--chat-template`, the server uses Hugging Face's default template, which only supports text.\n",
+    "If you do not specify a vision model's `--chat-template`, the server uses Hugging Face's default template, which only supports text, and may lead to degraded performance.\n",
     "\n",
     "We list popular vision models with their chat templates:\n",
     "\n",
diff --git a/docs/backend/sampling_params.md b/docs/backend/sampling_params.md
@@ -135,7 +135,7 @@ Detailed example in [openai compatible api](https://docs.sglang.ai/backend/opena
 Launch a server:
 
 ```bash
-python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
+python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov
 ```
 
 Download an image:
diff --git a/docs/supported_models/embedding_models.md b/docs/supported_models/embedding_models.md
@@ -3,7 +3,7 @@
 SGLang provides robust support for embedding models by integrating efficient serving mechanisms with its flexible programming interface. This integration allows for streamlined handling of embedding tasks, facilitating faster and more accurate retrieval and semantic search operations. SGLang's architecture enables better resource utilization and reduced latency in embedding model deployment.
 
 ```{important}
-They are executed with `--is-embedding` and some may require `--trust-remote-code` and/or `--chat-template`
+They are executed with `--is-embedding` and some may require `--trust-remote-code`
 ```
 
 ## Example launch Command
@@ -13,7 +13,6 @@ python3 -m sglang.launch_server \
   --model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct \  # example HF/local path
   --is-embedding \
   --host 0.0.0.0 \
-  --chat-template gme-qwen2-vl \                     # set chat template
   --port 30000 \
 ```
 
diff --git a/docs/supported_models/vision_language_models.md b/docs/supported_models/vision_language_models.md
@@ -2,16 +2,11 @@
 
 These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models with visual encoders and require a specific chat template for handling vision prompts.
 
-```{important}
-We need to specify `--chat-template` for VLMs because the chat template provided in HuggingFace tokenizer only supports text. If you do not specify a vision model’s `--chat-template`, the server uses HuggingFace’s default template, which only supports text and the images won’t be passed in.
-```
-
 ## Example launch Command
 
 ```shell
 python3 -m sglang.launch_server \
   --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \  # example HF/local path
-  --chat-template llama_3_vision \                        # required chat template
   --host 0.0.0.0 \
   --port 30000 \
 ```
diff --git a/examples/runtime/engine/offline_batch_inference_vlm.py b/examples/runtime/engine/offline_batch_inference_vlm.py
@@ -1,6 +1,6 @@
 """
 Usage:
-python offline_batch_inference_vlm.py --model-path Qwen/Qwen2-VL-7B-Instruct --chat-template=qwen2-vl
+python offline_batch_inference_vlm.py --model-path Qwen/Qwen2-VL-7B-Instruct
 """
 
 import argparse
diff --git a/examples/runtime/llava_onevision/http_llava_onevision_test.py b/examples/runtime/llava_onevision/http_llava_onevision_test.py
@@ -1,7 +1,7 @@
 """
 Usage:
 
-python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava
+python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8
 
 python3 http_llava_onevision_test.py
 """
diff --git a/examples/runtime/multimodal_embedding.py b/examples/runtime/multimodal_embedding.py
@@ -1,5 +1,5 @@
 # launch server
-# python -m sglang.launch_server --model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct --is-embedding --chat-template gme-qwen2-vl
+# python -m sglang.launch_server --model-path Alibaba-NLP/gme-Qwen2-VL-2B-Instruct --is-embedding
 
 import requests
 
diff --git a/test/srt/models/test_vlm_models.py b/test/srt/models/test_vlm_models.py
@@ -19,17 +19,12 @@
 
 # VLM models for testing
 MODELS = [
-    SimpleNamespace(
-        model="google/gemma-3-27b-it", chat_template="gemma-it", mmmu_accuracy=0.45
-    ),
+    SimpleNamespace(model="google/gemma-3-27b-it", mmmu_accuracy=0.45),
     SimpleNamespace(
         model="Qwen/Qwen2.5-VL-3B-Instruct",
-        chat_template="qwen2-vl",
         mmmu_accuracy=0.4,
     ),
-    SimpleNamespace(
-        model="openbmb/MiniCPM-V-2_6", chat_template="minicpmv", mmmu_accuracy=0.4
-    ),
+    SimpleNamespace(model="openbmb/MiniCPM-V-2_6", mmmu_accuracy=0.4),
 ]
 
 
@@ -50,7 +45,6 @@ def setUpClass(cls):
     def run_mmmu_eval(
         self,
         model_version: str,
-        chat_template: str,
         output_path: str,
         *,
         env: dict | None = None,
@@ -69,11 +63,7 @@ def run_mmmu_eval(
         os.makedirs(output_path, exist_ok=True)
 
         # -------- compose --model_args --------
-        model_args = (
-            f'model_version="{model_version}",'
-            f'chat_template="{chat_template}",'
-            f"tp={tp}"
-        )
+        model_args = f'model_version="{model_version}",' f"tp={tp}"
 
         # -------- build command list --------
         cmd = [
@@ -122,8 +112,6 @@ def test_vlm_mmmu_benchmark(self):
                     timeout=self.time_out,
                     api_key=self.api_key,
                     other_args=[
-                        "--chat-template",
-                        model.chat_template,
                         "--trust-remote-code",
                         "--cuda-graph-max-bs",
                         "32",
@@ -134,7 +122,7 @@ def test_vlm_mmmu_benchmark(self):
                 )
 
                 # Run evaluation
-                self.run_mmmu_eval(model.model, model.chat_template, "./logs")
+                self.run_mmmu_eval(model.model, "./logs")
 
                 # Get the result file
                 result_file_path = glob.glob("./logs/*.json")[0]
diff --git a/test/srt/test_bench_serving.py b/test/srt/test_bench_serving.py
@@ -156,8 +156,6 @@ def test_vlm_offline_throughput(self):
             num_prompts=200,
             request_rate=float("inf"),
             other_server_args=[
-                "--chat-template",
-                DEFAULT_VLM_CHAT_TEMPLATE_FOR_TEST,
                 "--mem-fraction-static",
                 "0.7",
             ],
@@ -181,8 +179,6 @@ def test_vlm_online_latency(self):
             num_prompts=50,
             request_rate=1,
             other_server_args=[
-                "--chat-template",
-                DEFAULT_VLM_CHAT_TEMPLATE_FOR_TEST,
                 "--mem-fraction-static",
                 "0.7",
             ],
diff --git a/test/srt/test_bnb.py b/test/srt/test_bnb.py
@@ -252,8 +252,6 @@ def test_vlm(self):
         for model, template in models_to_test:
             with self.subTest(model=model):
                 other_args = [
-                    "--chat-template",
-                    template,
                     "--mem-fraction-static",
                     "0.6",
                     "--load-format",
diff --git a/test/srt/test_openai_server.py b/test/srt/test_openai_server.py
@@ -688,7 +688,6 @@ def setUpClass(cls):
             cls.base_url,
             timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
             api_key=cls.api_key,
-            other_args=["--chat-template=llama_3_vision"],
         )
         cls.base_url += "/v1"
         cls.tokenizer = get_tokenizer(DEFAULT_SMALL_MODEL_NAME_FOR_TEST)
diff --git a/test/srt/test_vision_openai_server.py b/test/srt/test_vision_openai_server.py
@@ -614,7 +614,7 @@ def setUpClass(cls):
             cls.model,
             cls.base_url,
             timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
-            other_args=["--trust-remote-code", "--chat-template", "internvl-2-5"],
+            other_args=["--trust-remote-code"],
         )
         cls.base_url += "/v1"
 
@@ -676,8 +676,6 @@ def setUpClass(cls):
             timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
             other_args=[
                 "--trust-remote-code",
-                "--chat-template",
-                "deepseek-vl2",
                 "--context-length",
                 "4096",
             ],
@@ -775,8 +773,6 @@ def setUpClass(cls):
             timeout=DEFAULT_TIMEOUT_FOR_SERVER_LAUNCH,
             other_args=[
                 "--trust-remote-code",
-                "--chat-template",
-                "kimi-vl",
                 "--context-length",
                 "4096",
                 "--dtype",