PaddlePaddle · EmmonsCurse · Sep 8, 2025 · Sep 8, 2025 · Sep 8, 2025
diff --git a/README.md b/README.md
@@ -57,8 +57,9 @@ FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**,
 - [Iluvatar GPU](./docs/get_started/installation/iluvatar_gpu.md)
 - [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md)
 - [Hygon DCU](./docs/get_started/installation/hygon_dcu.md)
+- [MetaX GPU](./docs/get_started/installation/metax_gpu.md.md)
 
-**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates!
+**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates!
 
 ## Get Started
 
@@ -68,20 +69,12 @@ Learn how to use FastDeploy through our documentation:
 - [ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md)
 - [Offline Inference Development](./docs/offline_inference.md)
 - [Online Service Deployment](./docs/online_serving/README.md)
-- [Full Supported Models List](./docs/supported_models.md)
 - [Best Practices](./docs/best_practices/README.md)
 
 ## Supported Models
 
-| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching |  MTP | CUDA Graph | Maximum Context Length |
-|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
-|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ✅ | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌  | ✅|128K |
-|ERNIE-4.5-0.3B | BF16/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌ | ✅| 128K |
+Learn how to download models, enable support for Torch weights, and calculate minimum resource requirements, and more:
+- [Full Supported Models List](./docs/supported_models.md)
 
 ## Advanced Usage
 

diff --git a/README_CN.md b/README_CN.md
@@ -55,8 +55,9 @@ FastDeploy 支持在**英伟达（NVIDIA）GPU**、**昆仑芯（Kunlunxin）XPU
 - [天数 CoreX](./docs/zh/get_started/installation/iluvatar_gpu.md)
 - [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md)
 - [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md)
+- [沐曦 GPU](./docs/zh/get_started/installation/metax_gpu.md.md)
 
-**注意:** 我们正在积极拓展硬件支持范围。目前，包括昇腾（Ascend）NPU 和 沐曦（MetaX）GPU 在内的其他硬件平台正在开发测试中。敬请关注更新！
+**注意:** 我们正在积极拓展硬件支持范围。目前，包括昇腾（Ascend）NPU 等其他硬件平台正在开发测试中。敬请关注更新！
 
 ## 入门指南
 
@@ -66,20 +67,12 @@ FastDeploy 支持在**英伟达（NVIDIA）GPU**、**昆仑芯（Kunlunxin）XPU
 - [ERNIE-4.5-VL 部署](./docs/zh/get_started/ernie-4.5-vl.md)
 - [离线推理](./docs/zh/offline_inference.md)
 - [在线服务](./docs/zh/online_serving/README.md)
-- [模型支持列表](./docs/zh/supported_models.md)
 - [最佳实践](./docs/zh/best_practices/README.md)
 
 ## 支持模型列表
 
-| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching |  MTP | CUDA Graph | Maximum Context Length |
-|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K |
-|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ✅ | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌  | ✅|128K |
-|ERNIE-4.5-0.3B | BF16/WINT8/FP8  |  ✅ |  ✅ |  ✅ | ❌ | ✅| 128K |
+通过我们的文档了解如何下载模型，如何支持Torch 权重，如何计算最小资源部署等：
+- [模型支持列表](./docs/zh/supported_models.md)
 
 ## 进阶用法
 

diff --git a/docs/assets/images/favicon.ico b/docs/assets/images/favicon.ico
diff --git a/docs/assets/images/logo.jpg b/docs/assets/images/logo.jpg
diff --git a/docs/get_started/quick_start_qwen.md b/docs/get_started/quick_start_qwen.md
@@ -0,0 +1,99 @@
+# Deploy QWEN3-0.6b in 10 Minutes
+
+Before deployment, ensure your environment meets the following requirements:
+
+- GPU Driver ≥ 535
+- CUDA ≥ 12.3
+- cuDNN ≥ 9.5
+- Linux X86_64
+- Python ≥ 3.10
+
+This guide uses the lightweight QWEN3-0.6b model for demonstration, which can be deployed on most hardware configurations. Docker deployment is recommended.
+
+For more information about how to install FastDeploy, refer to the [installation document](installation/README.md).
+
+## 1. Launch Service
+After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md)
+
+> ⚠️ **Note:**
+> When using HuggingFace models (torch format), you need to enable `--load_choices "default_v1"`.
+
+```
+export ENABLE_V1_KVCACHE_SCHEDULER=1
+python -m fastdeploy.entrypoints.openai.api_server \
+       --model Qwen/QWEN3-0.6b \
+       --port 8180 \
+       --metrics-port 8181 \
+       --engine-worker-queue-port 8182 \
+       --max-model-len 32768 \
+       --max-num-seqs 32 \
+       --load_choices "default_v1"
+```
+
+> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```Qwen/QWEN3-0.6b```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md).
+```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service.
+```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service.
+
+**Related Documents**
+- [Service Deployment](../online_serving/README.md)
+- [Service Monitoring](../online_serving/metrics.md)
+
+## 2. Request the Service
+After starting the service, the following output indicates successful initialization:
+
+```shell
+api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics
+api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions
+api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions
+INFO:     Started server process [13909]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit)
+```
+
+### Health Check
+
+Verify service status (HTTP 200 indicates success):
+
+```shell
+curl -i http://0.0.0.0:8180/health
+```
+
+### cURL Request
+
+Send requests to the service with the following command:
+
+```shell
+curl -X POST "http://0.0.0.0:1822/v1/chat/completions" \
+-H "Content-Type: application/json" \
+-d '{
+  "messages": [
+    {"role": "user", "content": "Write me a poem about large language model."}
+  ],
+  "stream": true
+}'
+```
+
+### Python Client (OpenAI-compatible API)
+
+FastDeploy's API is OpenAI-compatible. You can also use Python for requests:
+
+```python
+import openai
+host = "0.0.0.0"
+port = "8180"
+client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null")
+
+response = client.chat.completions.create(
+    model="null",
+    messages=[
+        {"role": "system", "content": "I'm a helpful AI assistant."},
+        {"role": "user", "content": "Write me a poem about large language model."},
+    ],
+    stream=True,
+)
+for chunk in response:
+    if chunk.choices[0].delta:
+        print(chunk.choices[0].delta.content, end='')
+print('\n')
+```
diff --git a/docs/index.md b/docs/index.md
@@ -11,15 +11,39 @@
 
 ## Supported Models
 
-| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching |  MTP | CUDA Graph | Maximum Context Length |
+| Model | Data Type |[PD Disaggregation](./features/disaggregated.md) | [Chunked Prefill](./features/chunked_prefill.md) | [Prefix Caching](./features/prefix_caching.md) |  [MTP](./features/speculative_decoding.md) | [CUDA Graph](./features/graph_optimization.md) | Maximum Context Length |
 |:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- |
-|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K |
-|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| WIP | 128K |
-|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K |
-|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ✅ | ✅|128K |
-|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ❌ | ✅|128K |
-|ERNIE-4.5-0.3B | BF16/WINT8/FP8  |  ❌ |  ✅ |  ✅ | ❌ | ✅| 128K |
+|ERNIE-4.5-300B-A47B|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|✅|✅|✅|✅|✅|128K|
+|ERNIE-4.5-300B-A47B-Base|BF16/WINT4/WINT8|✅|✅|✅|⛔|✅|128K|
+|ERNIE-4.5-VL-424B-A47B|BF16/WINT4/WINT8|🚧|✅|🚧|⛔|🚧|128K|
+|ERNIE-4.5-VL-28B-A3B|BF16/WINT4/WINT8|⛔|✅|🚧|⛔|🚧|128K|
+|ERNIE-4.5-21B-A3B|BF16/WINT4/WINT8/FP8|⛔|✅|✅|✅|✅|128K|
+|ERNIE-4.5-21B-A3B-Base|BF16/WINT4/WINT8/FP8|⛔|✅|✅|⛔|✅|128K|
+|ERNIE-4.5-0.3B|BF16/WINT8/FP8|⛔|✅|✅|⛔|✅|128K|
+|QWEN3-MOE|BF16/WINT4/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
+|QWEN3|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
+|QWEN-VL|BF16/WINT8/FP8|⛔|✅|✅|🚧|⛔|128K|
+|QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
+|DEEPSEEK-V3|BF16/WINT4|⛔|✅|✅|🚧|✅|128K|
+|DEEPSEEK-R1|BF16/WINT4|⛔|✅|✅|🚧|✅|128K|
+
+```
+✅ Supported 🚧 In Progress ⛔ No Plan
+```
+
+## Supported Hardware
+
+| Model | [NVIDIA GPU](./get_started/installation/nvidia_gpu.md) |[Kunlunxin XPU](./get_started/installation/kunlunxin_xpu.md) | Ascend NPU | [Hygon DCU](./get_started/installation/hygon_dcu.md) | [Iluvatar GPU](./get_started/installation/iluvatar_gpu.md) | [MetaX GPU](./get_started/installation/metax_gpu.md.md) | [Enflame GCU](./get_started/installation/Enflame_gcu.md) |
+|:------|---------|------------|----------|-------------|-----------|-------------|-------------|
+| ERNIE4.5-VL-424B-A47B | ✅ | 🚧 | 🚧 | ⛔ | ⛔ | ⛔ | ⛔ |
+| ERNIE4.5-300B-A47B | ✅ | ✅ | 🚧 | ✅ | ✅ | 🚧 | ✅ |
+| ERNIE4.5-VL-28B-A3B | ✅ | 🚧 | 🚧 | ⛔ | 🚧 | ⛔ | ⛔ |
+| ERNIE4.5-21B-A3B | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ |
+| ERNIE4.5-0.3B | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ |
+
+```
+✅ Supported 🚧 In Progress ⛔ No Plan
+```
 
 ## Documentation
 

diff --git a/docs/parameters.md b/docs/parameters.md
@@ -34,10 +34,10 @@ When using FastDeploy to deploy models (including offline inference and service
 | ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 |
 | ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 |
 | ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 |
-| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output, refer [reasoning output](features/reasoning_output.md) for more details |
+| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output |
 | ```use_cudagraph```                | `bool`      | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. |
 | ```graph_optimization_config```    | `dict[str]`       | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }'，Detailed description reference [graph_optimization.md](./features/graph_optimization.md)|
-| ```disable_custom_all_reduce``` | `bool` | Disable Custom all-reduce, default: False |
+| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False |
 | ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] |
 | ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None |
 | ```guided_decoding_backend``` | `str` | Specify the guided decoding backend to use, supports `auto`, `xgrammar`, `off`, default: `off` |
@@ -51,7 +51,7 @@ When using FastDeploy to deploy models (including offline inference and service
 | ```chat_template``` | `str` | Specify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used. |
 | ```tool_call_parser``` | `str` | Specify the function call parser to be used for extracting function call content from the model's output. |
 | ```tool_parser_plugin``` | `str` | Specify the file path of the tool parser to be registered, so as to register parsers that are not in the code repository. The code format within these parsers must adhere to the format used in the code repository. |
-| ```lm_head_fp32```       | `bool`      | Specify the dtype of the lm_head layer as FP32. |
+| ```load_choices```       | `str`      | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.|
 
 ## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```?