diff --git a/README.md b/README.md index 0c20629ffc4..cc78f6fd250 100644 --- a/README.md +++ b/README.md @@ -57,8 +57,9 @@ FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**, - [Iluvatar GPU](./docs/get_started/installation/iluvatar_gpu.md) - [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md) - [Hygon DCU](./docs/get_started/installation/hygon_dcu.md) +- [MetaX GPU](./docs/get_started/installation/metax_gpu.md.md) -**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU and MetaX GPU are currently under development and testing. Stay tuned for updates! +**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates! ## Get Started @@ -68,20 +69,12 @@ Learn how to use FastDeploy through our documentation: - [ERNIE-4.5-VL Multimodal Model Deployment](./docs/get_started/ernie-4.5-vl.md) - [Offline Inference Development](./docs/offline_inference.md) - [Online Service Deployment](./docs/online_serving/README.md) -- [Full Supported Models List](./docs/supported_models.md) - [Best Practices](./docs/best_practices/README.md) ## Supported Models -| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length | -|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- | -|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K | -|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K | -|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K | -|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅|128K | -|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅| 128K | +Learn how to download models, enable support for Torch weights, and calculate minimum resource requirements, and more: +- [Full Supported Models List](./docs/supported_models.md) ## Advanced Usage diff --git a/README_CN.md b/README_CN.md index 6cebc527a2a..944ece19d7c 100644 --- a/README_CN.md +++ b/README_CN.md @@ -55,8 +55,9 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU - [天数 CoreX](./docs/zh/get_started/installation/iluvatar_gpu.md) - [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md) - [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md) +- [沐曦 GPU](./docs/zh/get_started/installation/metax_gpu.md.md) -**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 和 沐曦(MetaX)GPU 在内的其他硬件平台正在开发测试中。敬请关注更新! +**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 等其他硬件平台正在开发测试中。敬请关注更新! ## 入门指南 @@ -66,20 +67,12 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU - [ERNIE-4.5-VL 部署](./docs/zh/get_started/ernie-4.5-vl.md) - [离线推理](./docs/zh/offline_inference.md) - [在线服务](./docs/zh/online_serving/README.md) -- [模型支持列表](./docs/zh/supported_models.md) - [最佳实践](./docs/zh/best_practices/README.md) ## 支持模型列表 -| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length | -|:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- | -|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| ✅ |128K | -|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| ✅ | 128K | -|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K | -|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅|128K | -|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ✅ | ✅ | ✅ | ❌ | ✅| 128K | +通过我们的文档了解如何下载模型,如何支持Torch 权重,如何计算最小资源部署等: +- [模型支持列表](./docs/zh/supported_models.md) ## 进阶用法 diff --git a/docs/assets/images/favicon.ico b/docs/assets/images/favicon.ico new file mode 100644 index 00000000000..80cc3b06d43 Binary files /dev/null and b/docs/assets/images/favicon.ico differ diff --git a/docs/assets/images/logo.jpg b/docs/assets/images/logo.jpg new file mode 100644 index 00000000000..0a05d685b8b Binary files /dev/null and b/docs/assets/images/logo.jpg differ diff --git a/docs/get_started/quick_start_qwen.md b/docs/get_started/quick_start_qwen.md new file mode 100644 index 00000000000..95a31d71696 --- /dev/null +++ b/docs/get_started/quick_start_qwen.md @@ -0,0 +1,99 @@ +# Deploy QWEN3-0.6b in 10 Minutes + +Before deployment, ensure your environment meets the following requirements: + +- GPU Driver ≥ 535 +- CUDA ≥ 12.3 +- cuDNN ≥ 9.5 +- Linux X86_64 +- Python ≥ 3.10 + +This guide uses the lightweight QWEN3-0.6b model for demonstration, which can be deployed on most hardware configurations. Docker deployment is recommended. + +For more information about how to install FastDeploy, refer to the [installation document](installation/README.md). + +## 1. Launch Service +After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md) + +> ⚠️ **Note:** +> When using HuggingFace models (torch format), you need to enable `--load_choices "default_v1"`. + +``` +export ENABLE_V1_KVCACHE_SCHEDULER=1 +python -m fastdeploy.entrypoints.openai.api_server \ + --model Qwen/QWEN3-0.6b \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --max-model-len 32768 \ + --max-num-seqs 32 \ + --load_choices "default_v1" +``` + +> 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```Qwen/QWEN3-0.6b```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md). +```--max-model-len``` indicates the maximum number of tokens supported by the currently deployed service. +```--max-num-seqs``` indicates the maximum number of concurrent processing supported by the currently deployed service. + +**Related Documents** +- [Service Deployment](../online_serving/README.md) +- [Service Monitoring](../online_serving/metrics.md) + +## 2. Request the Service +After starting the service, the following output indicates successful initialization: + +```shell +api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics +api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions +api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions +INFO: Started server process [13909] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit) +``` + +### Health Check + +Verify service status (HTTP 200 indicates success): + +```shell +curl -i http://0.0.0.0:8180/health +``` + +### cURL Request + +Send requests to the service with the following command: + +```shell +curl -X POST "http://0.0.0.0:1822/v1/chat/completions" \ +-H "Content-Type: application/json" \ +-d '{ + "messages": [ + {"role": "user", "content": "Write me a poem about large language model."} + ], + "stream": true +}' +``` + +### Python Client (OpenAI-compatible API) + +FastDeploy's API is OpenAI-compatible. You can also use Python for requests: + +```python +import openai +host = "0.0.0.0" +port = "8180" +client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null") + +response = client.chat.completions.create( + model="null", + messages=[ + {"role": "system", "content": "I'm a helpful AI assistant."}, + {"role": "user", "content": "Write me a poem about large language model."}, + ], + stream=True, +) +for chunk in response: + if chunk.choices[0].delta: + print(chunk.choices[0].delta.content, end='') +print('\n') +``` diff --git a/docs/index.md b/docs/index.md index b1e3c336fd2..bff311362e5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -11,15 +11,39 @@ ## Supported Models -| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length | +| Model | Data Type |[PD Disaggregation](./features/disaggregated.md) | [Chunked Prefill](./features/chunked_prefill.md) | [Prefix Caching](./features/prefix_caching.md) | [MTP](./features/speculative_decoding.md) | [CUDA Graph](./features/graph_optimization.md) | Maximum Context Length | |:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- | -|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K | -|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| WIP | 128K | -|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K | -|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅|128K | -|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅| 128K | +|ERNIE-4.5-300B-A47B|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|✅|✅|✅|✅|✅|128K| +|ERNIE-4.5-300B-A47B-Base|BF16/WINT4/WINT8|✅|✅|✅|⛔|✅|128K| +|ERNIE-4.5-VL-424B-A47B|BF16/WINT4/WINT8|🚧|✅|🚧|⛔|🚧|128K| +|ERNIE-4.5-VL-28B-A3B|BF16/WINT4/WINT8|⛔|✅|🚧|⛔|🚧|128K| +|ERNIE-4.5-21B-A3B|BF16/WINT4/WINT8/FP8|⛔|✅|✅|✅|✅|128K| +|ERNIE-4.5-21B-A3B-Base|BF16/WINT4/WINT8/FP8|⛔|✅|✅|⛔|✅|128K| +|ERNIE-4.5-0.3B|BF16/WINT8/FP8|⛔|✅|✅|⛔|✅|128K| +|QWEN3-MOE|BF16/WINT4/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| +|QWEN3|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| +|QWEN-VL|BF16/WINT8/FP8|⛔|✅|✅|🚧|⛔|128K| +|QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| +|DEEPSEEK-V3|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| +|DEEPSEEK-R1|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| + +``` +✅ Supported 🚧 In Progress ⛔ No Plan +``` + +## Supported Hardware + +| Model | [NVIDIA GPU](./get_started/installation/nvidia_gpu.md) |[Kunlunxin XPU](./get_started/installation/kunlunxin_xpu.md) | Ascend NPU | [Hygon DCU](./get_started/installation/hygon_dcu.md) | [Iluvatar GPU](./get_started/installation/iluvatar_gpu.md) | [MetaX GPU](./get_started/installation/metax_gpu.md.md) | [Enflame GCU](./get_started/installation/Enflame_gcu.md) | +|:------|---------|------------|----------|-------------|-----------|-------------|-------------| +| ERNIE4.5-VL-424B-A47B | ✅ | 🚧 | 🚧 | ⛔ | ⛔ | ⛔ | ⛔ | +| ERNIE4.5-300B-A47B | ✅ | ✅ | 🚧 | ✅ | ✅ | 🚧 | ✅ | +| ERNIE4.5-VL-28B-A3B | ✅ | 🚧 | 🚧 | ⛔ | 🚧 | ⛔ | ⛔ | +| ERNIE4.5-21B-A3B | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ | +| ERNIE4.5-0.3B | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ | + +``` +✅ Supported 🚧 In Progress ⛔ No Plan +``` ## Documentation diff --git a/docs/parameters.md b/docs/parameters.md index 52327780e2e..7070c8fd741 100644 --- a/docs/parameters.md +++ b/docs/parameters.md @@ -34,10 +34,10 @@ When using FastDeploy to deploy models (including offline inference and service | ```max_long_partial_prefills``` | `int` | When Chunked Prefill is enabled, maximum number of long requests in concurrent partial prefill batches, default: 1 | | ```long_prefill_token_threshold``` | `int` | When Chunked Prefill is enabled, requests with token count exceeding this value are considered long requests, default: max_model_len*0.04 | | ```static_decode_blocks``` | `int` | During inference, each request is forced to allocate corresponding number of blocks from Prefill's KVCache for Decode use, default: 2 | -| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output, refer [reasoning output](features/reasoning_output.md) for more details | +| ```reasoning_parser``` | `str` | Specify the reasoning parser to extract reasoning content from model output | | ```use_cudagraph``` | `bool` | Whether to use cuda graph, default False. It is recommended to read [graph_optimization.md](./features/graph_optimization.md) carefully before opening. Custom all-reduce needs to be enabled at the same time in multi-card scenarios. | | ```graph_optimization_config``` | `dict[str]` | Can configure parameters related to calculation graph optimization, the default value is'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',Detailed description reference [graph_optimization.md](./features/graph_optimization.md)| -| ```disable_custom_all_reduce``` | `bool` | Disable Custom all-reduce, default: False | +| ```enable_custom_all_reduce``` | `bool` | Enable Custom all-reduce, default: False | | ```splitwise_role``` | `str` | Whether to enable splitwise inference, default value: mixed, supported parameters: ["mixed", "decode", "prefill"] | | ```innode_prefill_ports``` | `str` | Internal engine startup ports for prefill instances (only required for single-machine PD separation), default: None | | ```guided_decoding_backend``` | `str` | Specify the guided decoding backend to use, supports `auto`, `xgrammar`, `off`, default: `off` | @@ -51,7 +51,7 @@ When using FastDeploy to deploy models (including offline inference and service | ```chat_template``` | `str` | Specify the template used for model concatenation, It supports both string input and file path input. The default value is None. If not specified, the model's default template will be used. | | ```tool_call_parser``` | `str` | Specify the function call parser to be used for extracting function call content from the model's output. | | ```tool_parser_plugin``` | `str` | Specify the file path of the tool parser to be registered, so as to register parsers that are not in the code repository. The code format within these parsers must adhere to the format used in the code repository. | -| ```lm_head_fp32``` | `bool` | Specify the dtype of the lm_head layer as FP32. | +| ```load_choices``` | `str` | By default, the "default" loader is used for weight loading. To load Torch weights or enable weight acceleration, "default_v1" must be used.| ## 1. Relationship between KVCache allocation, ```num_gpu_blocks_override``` and ```block_size```? diff --git a/docs/supported_models.md b/docs/supported_models.md index c6bb969ae16..37548225340 100644 --- a/docs/supported_models.md +++ b/docs/supported_models.md @@ -2,9 +2,9 @@ FastDeploy currently supports the following models, which can be downloaded automatically during FastDeploy deployment.Specify the ``model`` parameter as the model name in the table below to automatically download model weights (all supports resumable downloads). The following three download sources are supported: -- 1. Search for corresponding Paddle-version ERNIE models on [AIStudio/PaddlePaddle](https://aistudio.baidu.com/modelsoverview), e.g., `ERNIE-4.5-0.3B-Paddle` -- 2. Download Paddle-version ERNIE models from [HuggingFace/baidu/models](https://huggingface.co/baidu/models), e.g., `baidu/ERNIE-4.5-0.3B-Paddle` -- 3. Search for corresponding Paddle-version ERNIE models on [ModelScope/PaddlePaddle](https://www.modelscope.cn/models?name=PaddlePaddle&page=1&tabKey=task), e.g., `ERNIE-4.5-0.3B-Paddle` +- [AIStudio](https://aistudio.baidu.com/modelsoverview) +- [ModelScope](https://www.modelscope.cn/models) +- [HuggingFace](https://huggingface.co/models) When using automatic download, the default download source is AIStudio. Users can modify the default download source by setting the ``FD_MODEL_SOURCE`` environment variable, which can be set to “AISTUDIO”, ‘MODELSCOPE’ or “HUGGINGFACE”. The default download path is ``~/`` (i.e., the user's home directory). Users can modify the default download path by setting the ``FD_MODEL_CACHE`` environment variable, e.g.: @@ -13,25 +13,61 @@ export FD_MODEL_SOURCE=AISTUDIO # "AISTUDIO", "MODELSCOPE" or "HUGGINGFACE" export FD_MODEL_CACHE=/ssd1/download_models ``` -| Model Name | Context Length | Quantization | Minimum Deployment Resources | Notes | -| :------------------------------------------ | :------------- | :----------- | :--------------------------- | :----------------------------------------------------------------------------------------- | -| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | 32K/128K | WINT4 | 4*80G GPU VRAM/1T RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | 32K/128K | WINT8 | 8*80G GPU VRAM/1T RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-300B-A47B-Paddle | 32K/128K | WINT4 | 4*64G GPU VRAM/600G RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-300B-A47B-Paddle | 32K/128K | WINT8 | 8*64G GPU VRAM/600G RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle | 32K/128K | WINT2 | 1*141G GPU VRAM/600G RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle | 32K/128K | W4A8C8 | 4*64G GPU VRAM/160G RAM | Fixed 4-GPU setup, Chunked Prefill recommended | -| baidu/ERNIE-4.5-300B-A47B-FP8-Paddle | 32K/128K | FP8 | 8*64G GPU VRAM/600G RAM | Chunked Prefill recommended, only supports PD Disaggragated Deployment with EP parallelism | -| baidu/ERNIE-4.5-300B-A47B-Base-Paddle | 32K/128K | WINT4 | 4*64G GPU VRAM/600G RAM | Chunked Prefill recommended | -| baidu/ERNIE-4.5-300B-A47B-Base-Paddle | 32K/128K | WINT8 | 8*64G GPU VRAM/600G RAM | Chunked Prefill recommended | -| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | 32K | WINT4 | 1*24G GPU VRAM/128G RAM | Chunked Prefill required | -| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | 128K | WINT4 | 1*48G GPU VRAM/128G RAM | Chunked Prefill required | -| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | 32K/128K | WINT8 | 1*48G GPU VRAM/128G RAM | Chunked Prefill required | -| baidu/ERNIE-4.5-21B-A3B-Paddle | 32K/128K | WINT4 | 1*24G GPU VRAM/128G RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-21B-A3B-Paddle | 32K/128K | WINT8 | 1*48G GPU VRAM/128G RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-21B-A3B-Base-Paddle | 32K/128K | WINT4 | 1*24G GPU VRAM/128G RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-21B-A3B-Base-Paddle | 32K/128K | WINT8 | 1*48G GPU VRAM/128G RAM | Chunked Prefill required for 128K | -| baidu/ERNIE-4.5-0.3B-Paddle | 32K/128K | BF16 | 1*6G/12G GPU VRAM/2G RAM | | -| baidu/ERNIE-4.5-0.3B-Base-Paddle | 32K/128K | BF16 | 1*6G/12G GPU VRAM/2G RAM | | +> ⭐ **Note**: Models marked with an asterisk can directly use **HuggingFace Torch weights** and support **FP8/WINT8/WINT4** as well as **BF16**. When running inference, you need to enable **`--load_choices "default_v1"`**. + +> Example launch Command using baidu/ERNIE-4.5-21B-A3B-PT: +``` +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-0.3B-PT \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --max-model-len 32768 \ + --max-num-seqs 32 \ + --load_choices "default_v1" +``` + +## Large Language Models + +These models accept text input. + +|Models|DataType|Example HF Model| +|-|-|-| +|⭐ERNIE|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|baidu/ERNIE-4.5-VL-424B-A47B-Paddle;
baidu/ERNIE-4.5-300B-A47B-Paddle
 [quick start](./get_started/ernie-4.5.md)   [best practice](./best_practices/ERNIE-4.5-300B-A47B-Paddle.md);
baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle;
baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle;
baidu/ERNIE-4.5-300B-A47B-FP8-Paddle;
baidu/ERNIE-4.5-300B-A47B-Base-Paddle;
[baidu/ERNIE-4.5-21B-A3B-Paddle](./best_practices/ERNIE-4.5-21B-A3B-Paddle.md);
baidu/ERNIE-4.5-21B-A3B-Base-Paddle;
baidu/ERNIE-4.5-0.3B-Paddle
 [quick start](./get_started/quick_start.md)   [best practice](./best_practices/ERNIE-4.5-0.3B-Paddle.md);
baidu/ERNIE-4.5-0.3B-Base-Paddle, etc.| +|⭐QWEN3-MOE|BF16/WINT4/WINT8/FP8|Qwen/Qwen3-235B-A22B;
Qwen/Qwen3-30B-A3B, etc.| +|⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;
Qwen/qwen3-14B;
Qwen/qwen3-8B;
Qwen/qwen3-4B;
Qwen/qwen3-1.7B;
[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.| +|⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;
Qwen/qwen2.5-32B;
Qwen/qwen2.5-14B;
Qwen/qwen2.5-7B;
Qwen/qwen2.5-3B;
Qwen/qwen2.5-1.5B;
Qwen/qwen2.5-0.5B, etc.| +|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;
Qwen/Qwen/qwen2-7B;
Qwen/qwen2-1.5B;
Qwen/qwen2-0.5B;
Qwen/QwQ-32, etc.| +|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.| + +## Multimodal Language Models + +These models accept multi-modal inputs (e.g., images and text). + +|Models|DataType|Example HF Model| +|-|-|-| +| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle
 [quick start](./get_started/ernie-4.5-vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;
baidu/ERNIE-4.5-VL-28B-A3B-Paddle
 [quick start](./get_started/quick_start_vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;| +| QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;
Qwen/Qwen2.5-VL-32B-Instruct;
Qwen/Qwen2.5-VL-7B-Instruct;
Qwen/Qwen2.5-VL-3B-Instruct| + +## Minimum Resource Deployment Instruction + +There is no universal formula for minimum deployment resources; it depends on both context length and quantization method. We recommend estimating the required GPU memory using the following formula: +``` +Required GPU Memory = Number of Parameters × Quantization Byte factor +``` +> (The factor list is provided below.) + +And the final number of GPUs depends on: +``` +Number of GPUs = Total Memory Requirement ÷ Memory per GPU +``` + +| Quantization Method | Bytes per Parameter factor | +| :--- | :--- | +|BF16 |2 | +|FP8 |1 | +|WINT8 |1 | +|WINT4 |0.5 | +|W4A8C8 |0.5 | More models are being supported. You can submit requests for new model support via [Github Issues](https://github.com/PaddlePaddle/FastDeploy/issues). diff --git a/docs/zh/get_started/quick_start_qwen.md b/docs/zh/get_started/quick_start_qwen.md new file mode 100644 index 00000000000..70127b52e04 --- /dev/null +++ b/docs/zh/get_started/quick_start_qwen.md @@ -0,0 +1,97 @@ +# 10分钟完成 Qwen3-0.6b 模型部署 + +本文档讲解如何部署Qwen3-0.6b模型,在开始部署前,请确保你的硬件环境满足如下条件: + +- GPU驱动 >= 535 +- CUDA >= 12.3 +- CUDNN >= 9.5 +- Linux X86_64 +- Python >= 3.10 +- 运行模型满足最低硬件配置要求,参考[支持模型列表文档](supported_models.md) + +为了快速在各类硬件部署,本文档采用 ```Qwen3-0.6b``` 模型作为示例,可在大部分硬件上完成部署。 + +安装FastDeploy方式参考[安装文档](./installation/README.md)。 +## 1. 启动服务 +安装FastDeploy后,在终端执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](parameters.md) + +> ⚠️ **注意:** +> 当使用HuggingFace 模型(torch格式)时, 需要开启 `--load_choices "default_v1"` + +```shell +export ENABLE_V1_KVCACHE_SCHEDULER=1 +python -m fastdeploy.entrypoints.openai.api_server \ + --model Qwen/Qwen3-0.6B\ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --max-model-len 32768 \ + --max-num-seqs 32 \ + --load_choices "default_v1" +``` + +>💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```Qwen/Qwen3-0.6B```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](supported_models.md)。 +```--max-model-len``` 表示当前部署的服务所支持的最长Token数量。 +```--max-num-seqs``` 表示当前部署的服务所支持的最大并发处理数量。 + +**相关文档** + +- [服务部署配置](online_serving/README.md) +- [服务监控metrics](online_serving/metrics.md) + +## 2. 用户发起服务请求 + +执行启动服务指令后,当终端打印如下信息,说明服务已经启动成功。 + +``` +api_server.py[line:91] Launching metrics service at http://0.0.0.0:8181/metrics +api_server.py[line:94] Launching chat completion service at http://0.0.0.0:8180/v1/chat/completions +api_server.py[line:97] Launching completion service at http://0.0.0.0:8180/v1/completions +INFO: Started server process [13909] +INFO: Waiting for application startup. +INFO: Application startup complete. +INFO: Uvicorn running on http://0.0.0.0:8180 (Press CTRL+C to quit) +``` + +FastDeploy提供服务探活接口,用以判断服务的启动状态,执行如下命令返回 ```HTTP/1.1 200 OK``` 即表示服务启动成功。 + +```shell +curl -i http://0.0.0.0:8180/health +``` + +通过如下命令发起服务请求 + +```shell +curl -X POST "http://0.0.0.0:8180/v1/chat/completions" \ +-H "Content-Type: application/json" \ +-d '{ + "messages": [ + {"role": "user", "content": "把李白的静夜思改写为现代诗"} + ] +}' +``` + +FastDeploy服务接口兼容OpenAI协议,可以通过如下Python代码发起服务请求。 + +```python +import openai +host = "0.0.0.0" +port = "8180" +client = openai.Client(base_url=f"http://{host}:{port}/v1", api_key="null") + +response = client.chat.completions.create( + model="null", + messages=[ + {"role": "system", "content": "I'm a helpful AI assistant."}, + {"role": "user", "content": "把李白的静夜思改写为现代诗"}, + ], + stream=True, +) +for chunk in response: + if chunk.choices[0].delta: + print(chunk.choices[0].delta.content, end='') +print('\n') +``` +📌 +⚙️ +✕ diff --git a/docs/zh/index.md b/docs/zh/index.md index 73bf10fa962..73721bbc0b5 100644 --- a/docs/zh/index.md +++ b/docs/zh/index.md @@ -11,15 +11,39 @@ ## 支持模型 -| Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length | +| Model | Data Type |[PD Disaggregation](./features/disaggregated.md) | [Chunked Prefill](./features/chunked_prefill.md) | [Prefix Caching](./features/prefix_caching.md) | [MTP](./features/speculative_decoding.md) | [CUDA Graph](./features/graph_optimization.md) | Maximum Context Length | |:--- | :------- | :---------- | :-------- | :-------- | :----- | :----- | :----- | -|ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | ✅| ✅ | ✅|✅| WIP |128K | -|ERNIE-4.5-300B-A47B-Base| BF16/WINT4/WINT8 | ✅| ✅ | ✅|❌| WIP | 128K | -|ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | ❌ | ✅ | WIP | ❌ | WIP |128K | -|ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ✅ | ✅|128K | -|ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅|128K | -|ERNIE-4.5-0.3B | BF16/WINT8/FP8 | ❌ | ✅ | ✅ | ❌ | ✅| 128K | +|ERNIE-4.5-300B-A47B|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|✅|✅|✅|✅|✅|128K| +|ERNIE-4.5-300B-A47B-Base|BF16/WINT4/WINT8|✅|✅|✅|⛔|✅|128K| +|ERNIE-4.5-VL-424B-A47B|BF16/WINT4/WINT8|🚧|✅|🚧|⛔|🚧|128K| +|ERNIE-4.5-VL-28B-A3B|BF16/WINT4/WINT8|⛔|✅|🚧|⛔|🚧|128K| +|ERNIE-4.5-21B-A3B|BF16/WINT4/WINT8/FP8|⛔|✅|✅|✅|✅|128K| +|ERNIE-4.5-21B-A3B-Base|BF16/WINT4/WINT8/FP8|⛔|✅|✅|⛔|✅|128K| +|ERNIE-4.5-0.3B|BF16/WINT8/FP8|⛔|✅|✅|⛔|✅|128K| +|QWEN3-MOE|BF16/WINT4/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| +|QWEN3|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| +|QWEN-VL|BF16/WINT8/FP8|⛔|✅|✅|🚧|⛔|128K| +|QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| +|DEEPSEEK-V3|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| +|DEEPSEEK-R1|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| + +``` +✅ 已支持 🚧 适配中 ⛔ 暂无计划 +``` + +## 支持硬件 + +| 模型 | [英伟达GPU](./get_started/installation/nvidia_gpu.md) |[昆仑芯P800](./get_started/installation/kunlunxin_xpu.md) | 昇腾910B | [海光K100-AI](./get_started/installation/hygon_dcu.md) | [天数天垓150](./get_started/installation/iluvatar_gpu.md) | [沐曦曦云C550](./get_started/installation/metax_gpu.md.md) | [燧原S60/L600](./get_started/installation/Enflame_gcu.md) | +|:------|---------|------------|----------|-------------|-----------|-------------|-------------| +| ERNIE4.5-VL-424B-A47B | ✅ | 🚧 | 🚧 | ⛔ | ⛔ | ⛔ | ⛔ | +| ERNIE4.5-300B-A47B | ✅ | ✅ | 🚧 | ✅ | ✅ | 🚧 | ✅ | +| ERNIE4.5-VL-28B-A3B | ✅ | 🚧 | 🚧 | ⛔ | 🚧 | ⛔ | ⛔ | +| ERNIE4.5-21B-A3B | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ | +| ERNIE4.5-0.3B | ✅ | ✅ | 🚧 | ✅ | ✅ | ✅ | ✅ | + +``` +✅ 已支持 🚧 适配中 ⛔ 暂无计划 +``` ## 文档说明 diff --git a/docs/zh/parameters.md b/docs/zh/parameters.md index ce5e5d89f2e..72300638adc 100644 --- a/docs/zh/parameters.md +++ b/docs/zh/parameters.md @@ -32,10 +32,10 @@ | ```max_long_partial_prefills``` | `int` | 开启Chunked Prefill时,Prefill阶段并发中包启的最多长请求数,默认1 | | ```long_prefill_token_threshold``` | `int` | 开启Chunked Prefill时,请求Token数超过此值的请求被视为长请求,默认为max_model_len*0.04 | | ```static_decode_blocks``` | `int` | 推理过程中,每条请求强制从Prefill的KVCache分配对应块数给Decode使用,默认2| -| ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容,详见[思考链输出](features/reasoning_output.md) | +| ```reasoning_parser``` | `str` | 指定要使用的推理解析器,以便从模型输出中提取推理内容 | | ```use_cudagraph``` | `bool` | 是否使用cuda graph,默认False。开启前建议仔细阅读 [graph_optimization.md](./features/graph_optimization.md),在多卡场景需要同时开启 Custom all-reduce。 | | ```graph_optimization_config``` | `dict[str]` | 可以配置计算图优化相关的参数,默认值为'{"use_cudagraph":false, "graph_opt_level":0, "cudagraph_capture_sizes": null }',详细说明参考 [graph_optimization.md](./features/graph_optimization.md)| -| ```disable_custom_all_reduce``` | `bool` | 关闭Custom all-reduce,默认False | +| ```enable_custom_all_reduce``` | `bool` | 开启Custom all-reduce,默认False | | ```splitwise_role``` | `str` | 是否开启splitwise推理,默认值mixed, 支持参数为["mixed", "decode", "prefill"] | | ```innode_prefill_ports``` | `str` | prefill 实例内部引擎启动端口 (仅单机PD分离需要),默认值None | | ```guided_decoding_backend``` | `str` | 指定要使用的guided decoding后端,支持 `auto`、`xgrammar`、`off`, 默认为 `off` | @@ -49,7 +49,7 @@ | ```chat_template``` | `str` | 指定模型拼接使用的模板,支持字符串与文件路径,默认为None,如未指定,则使用模型默认模板 | | ```tool_call_parser``` | `str` | 指定要使用的function call解析器,以便从模型输出中抽取 function call内容| | ```tool_parser_plugin``` | `str` | 指定要注册的tool parser文件路径,以便注册不在代码库中的parser,parser中代码格式需遵循代码库中格式| -| ```lm_head_fp32``` | `bool` | 指定lm_head层的类型为 FP32 | +| ```load_choices``` | `str` | 默认使用"default" loader进行权重加载,加载torch权重/权重加速需开启 "default_v1"| ## 1. KVCache分配与```num_gpu_blocks_override```、```block_size```的关系? diff --git a/docs/zh/supported_models.md b/docs/zh/supported_models.md index f7b95541f95..61f353b334f 100644 --- a/docs/zh/supported_models.md +++ b/docs/zh/supported_models.md @@ -2,9 +2,9 @@ FastDeploy目前支持模型列表如下,在FastDeploy部署时,指定 ``model``参数为如下表格中的模型名,即可自动下载模型权重(均支持断点续传),支持如下3种下载源, -- 1. [AIStudio/PaddlePaddle](https://aistudio.baidu.com/modelsoverview) 搜索相应Paddle后缀ERNIE模型,如ERNIE-4.5-0.3B-Paddle -- 2. [ModelScope/PaddlePaddle](https://www.modelscope.cn/models?name=PaddlePaddle&page=1&tabKey=task) 搜索相应Paddle后缀ERNIE模型,如ERNIE-4.5-0.3B-Paddle -- 3. [HuggingFace/baidu/models](https://huggingface.co/baidu/models) 下载Paddle后缀ERNIE模型,如baidu/ERNIE-4.5-0.3B-Paddle +- [AIStudio](https://aistudio.baidu.com/modelsoverview) +- [ModelScope](https://www.modelscope.cn/models) +- [HuggingFace](https://huggingface.co/models) 使用自动下载时,默认从AIStudio下载,用户可以通过配置环境变量 ``FD_MODEL_SOURCE``修改默认下载来源,可取值"AISTUDIO","MODELSCOPE"或"HUGGINGFACE";默认下载路径为 ``~/``(即用户主目录),用户可以通过配置环境变量 ``FD_MODEL_CACHE``修改默认下载的路径,例如 @@ -13,25 +13,51 @@ export FD_MODEL_SOURCE=AISTUDIO # "AISTUDIO", "MODELSCOPE" or "HUGGINGFACE" export FD_MODEL_CACHE=/ssd1/download_models ``` -| 模型名 | 上下文长度 | 量化方式 | 最小部署资源 | 说明 | -| :------------------------------------------ | :--------- | :------- | :-------------------- | :---------------------------------------------- | -| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | 32K/128K | WINT4 | 4卡*80G显存/1T内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-VL-424B-A47B-Paddle | 32K/128K | WINT8 | 8卡*80G显存/1T内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-300B-A47B-Paddle | 32K/128K | WINT4 | 4卡*64G显存/600G内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-300B-A47B-Paddle | 32K/128K | WINT8 | 8卡*64G显存/600G内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle | 32K/128K | WINT2 | 1卡*141G显存/600G内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle | 32K/128K | W4A8C8 | 4卡*64G显存/160G内存 | 限定4卡,建议开启Chunked Prefill | -| baidu/ERNIE-4.5-300B-A47B-FP8-Paddle | 32K/128K | FP8 | 8卡*64G显存/600G内存 | 建议开启Chunked Prefill,仅在PD分离EP并行下支持 | -| baidu/ERNIE-4.5-300B-A47B-Base-Paddle | 32K/128K | WINT4 | 4卡*64G显存/600G内存 | 建议开启Chunked Prefill | -| baidu/ERNIE-4.5-300B-A47B-Base-Paddle | 32K/128K | WINT8 | 8卡*64G显存/600G内存 | 建议开启Chunked Prefill | -| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | 32K | WINT4 | 1卡*24G/128G内存 | 需要开启Chunked Prefill | -| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | 128K | WINT4 | 1卡*48G/128G内存 | 需要开启Chunked Prefill | -| baidu/ERNIE-4.5-VL-28B-A3B-Paddle | 32K/128K | WINT8 | 1卡*48G/128G内存 | 需要开启Chunked Prefill | -| baidu/ERNIE-4.5-21B-A3B-Paddle | 32K/128K | WINT4 | 1卡*24G/128G内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-21B-A3B-Paddle | 32K/128K | WINT8 | 1卡*48G/128G内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-21B-A3B-Base-Paddle | 32K/128K | WINT4 | 1卡*24G/128G内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-21B-A3B-Base-Paddle | 32K/128K | WINT8 | 1卡*48G/128G内存 | 128K需要开启Chunked Prefill | -| baidu/ERNIE-4.5-0.3B-Paddle | 32K/128K | BF16 | 1卡*6G/12G显存/2G内存 | | -| baidu/ERNIE-4.5-0.3B-Base-Paddle | 32K/128K | BF16 | 1卡*6G/12G显存/2G内存 | | +> ⭐ **说明**:带星号的模型可直接使用 **HuggingFace Torch 权重**,支持 **FP8/WINT8/WINT4 动态量化** 和 **BF16 精度** 推理,推理时需启用 **`--load_choices "default_v1"`**。 + +> 以baidu/ERNIE-4.5-21B-A3B-PT为例启动命令如下 +``` +python -m fastdeploy.entrypoints.openai.api_server \ + --model baidu/ERNIE-4.5-0.3B-PT \ + --port 8180 \ + --metrics-port 8181 \ + --engine-worker-queue-port 8182 \ + --max-model-len 32768 \ + --max-num-seqs 32 \ + --load_choices "default_v1" +``` + +## 纯文本模型列表 + +|模型|DataType|模型案例| +|-|-|-| +|⭐ERNIE|BF16\WINT4\WINT8\W4A8C8\WINT2\FP8|baidu/ERNIE-4.5-VL-424B-A47B-Paddle;
baidu/ERNIE-4.5-300B-A47B-Paddle
 [快速部署](./get_started/ernie-4.5.md)   [最佳实践](./best_practices/ERNIE-4.5-300B-A47B-Paddle.md);
baidu/ERNIE-4.5-300B-A47B-2Bits-Paddle;
baidu/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle;
baidu/ERNIE-4.5-300B-A47B-FP8-Paddle;
baidu/ERNIE-4.5-300B-A47B-Base-Paddle;
[baidu/ERNIE-4.5-21B-A3B-Paddle](./best_practices/ERNIE-4.5-21B-A3B-Paddle.md);
baidu/ERNIE-4.5-21B-A3B-Base-Paddle;
baidu/ERNIE-4.5-0.3B-Paddle
 [快速部署](./get_started/quick_start.md)   [最佳实践](./best_practices/ERNIE-4.5-0.3B-Paddle.md);
baidu/ERNIE-4.5-0.3B-Base-Paddle, etc.| +|⭐QWEN3-MOE|BF16/WINT4/WINT8/FP8|Qwen/Qwen3-235B-A22B;
Qwen/Qwen3-30B-A3B, etc.| +|⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;
Qwen/qwen3-14B;
Qwen/qwen3-8B;
Qwen/qwen3-4B;
Qwen/qwen3-1.7B;
[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.| +|⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;
Qwen/qwen2.5-32B;
Qwen/qwen2.5-14B;
Qwen/qwen2.5-7B;
Qwen/qwen2.5-3B;
Qwen/qwen2.5-1.5B;
Qwen/qwen2.5-0.5B, etc.| +|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;
Qwen/Qwen/qwen2-7B;
Qwen/qwen2-1.5B;
Qwen/qwen2-0.5B;
Qwen/QwQ-32, etc.| +|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.| + +## 多模态语言模型列表 + +根据模型不同,支持多种模态(文本、图像等)组合: + +|模型|DataType|模型案例| +|-|-|-| +| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle
 [快速部署](./get_started/ernie-4.5-vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;
baidu/ERNIE-4.5-VL-28B-A3B-Paddle
 [快速部署](./get_started/quick_start_vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;| +| QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;
Qwen/Qwen2.5-VL-32B-Instruct;
Qwen/Qwen2.5-VL-7B-Instruct;
Qwen/Qwen2.5-VL-3B-Instruct| + +## 最小资源部署说明 + +最小部署资源没有普适公式,需要根据上下文长度 和 量化方式 +我们推荐计算显存需求 = 参数量 × 量化方式字节系数(系数列表如下),最终 GPU 数量取决于 总显存需求 ÷ 单卡显存 + +|量化方式 |对应每参数字节系数 | +| :--- | :--- | +|BF16 |2 | +|FP8 |1 | +|WINT8 |1 | +|WINT4 |0.5 | +|W4A8C8 |0.5 | 更多模型同步支持中,你可以通过[Github Issues](https://github.com/PaddlePaddle/FastDeploy/issues)向我们提交新模型的支持需求。 diff --git a/mkdocs.yml b/mkdocs.yml index a0a0a2446b1..16237f04d58 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -2,11 +2,13 @@ site_name: 'FastDeploy : Large Language Model Deployement' repo_url: https://github.com/PaddlePaddle/FastDeploy repo_name: FastDeploy +copyright: Copyright © 2025 Maintained by FastDeploy + theme: name: material highlightjs: true - icon: - repo: fontawesome/brands/github + favicon: assets/images/favicon.ico + logo: assets/images/logo.jpg palette: - media: "(prefers-color-scheme: light)" # 浅色 scheme: default @@ -50,10 +52,12 @@ plugins: HYGON DCU: 海光 DCU Enflame S60: 燧原 S60 Iluvatar CoreX: 天数 CoreX + Metax C550: 沐曦 C550 Quick Deployment For ERNIE-4.5-0.3B: ERNIE-4.5-0.3B快速部署 Quick Deployment for ERNIE-4.5-VL-28B-A3B: ERNIE-4.5-VL-28B-A3B快速部署 ERNIE-4.5-300B-A47B: ERNIE-4.5-300B-A47B快速部署 ERNIE-4.5-VL-424B-A47B: ERNIE-4.5-VL-424B-A47B快速部署 + Quick Deployment For QWEN: Qwen3-0.6b快速部署 Online Serving: 在线服务 OpenAI-Compitable API Server: 兼容 OpenAI 协议的服务化部署 Monitor Metrics: 监控Metrics @@ -85,6 +89,7 @@ plugins: MultiNode Deployment: 多机部署 Graph Optimization: 图优化 Data Parallelism: 数据并行 + PLAS: PLAS Supported Models: 支持模型列表 Benchmark: 基准测试 Usage: 用法 @@ -93,24 +98,26 @@ plugins: Environment Variables: 环境变量 nav: - - 'FastDeploy': index.md - - 'Quick Start': + - FastDeploy: index.md + - Quick Start: - Installation: - - 'Nvidia GPU': get_started/installation/nvidia_gpu.md - - 'KunlunXin XPU': get_started/installation/kunlunxin_xpu.md - - 'HYGON DCU': get_started/installation/hygon_dcu.md - - 'Enflame S60': get_started/installation/Enflame_gcu.md - - 'Iluvatar CoreX': get_started/installation/iluvatar_gpu.md - - 'Quick Deployment For ERNIE-4.5-0.3B': get_started/quick_start.md - - 'Quick Deployment for ERNIE-4.5-VL-28B-A3B': get_started/quick_start_vl.md - - 'ERNIE-4.5-300B-A47B': get_started/ernie-4.5.md - - 'ERNIE-4.5-VL-424B-A47B': get_started/ernie-4.5-vl.md - - 'Online Serving': - - 'OpenAI-Compitable API Server': online_serving/README.md - - 'Monitor Metrics': online_serving/metrics.md - - 'Scheduler': online_serving/scheduler.md - - 'Graceful Shutdown': online_serving/graceful_shutdown_service.md - - 'Offline Inference': offline_inference.md + - Nvidia GPU: get_started/installation/nvidia_gpu.md + - KunlunXin XPU: get_started/installation/kunlunxin_xpu.md + - HYGON DCU: get_started/installation/hygon_dcu.md + - Enflame S60: get_started/installation/Enflame_gcu.md + - Iluvatar CoreX: get_started/installation/iluvatar_gpu.md + - Metax C550: get_started/installation/metax_gpu.md + - Quick Deployment For ERNIE-4.5-0.3B: get_started/quick_start.md + - Quick Deployment for ERNIE-4.5-VL-28B-A3B: get_started/quick_start_vl.md + - ERNIE-4.5-300B-A47B: get_started/ernie-4.5.md + - ERNIE-4.5-VL-424B-A47B: get_started/ernie-4.5-vl.md + - Quick Deployment For QWEN: get_started/quick_start_qwen.md + - Online Serving: + - OpenAI-Compitable API Server: online_serving/README.md + - Monitor Metrics: online_serving/metrics.md + - Scheduler: online_serving/scheduler.md + - Graceful Shutdown: online_serving/graceful_shutdown_service.md + - Offline Inference: offline_inference.md - Best Practices: - ERNIE-4.5-0.3B: best_practices/ERNIE-4.5-0.3B-Paddle.md - ERNIE-4.5-21B-A3B: best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -119,27 +126,27 @@ nav: - ERNIE-4.5-VL-424B-A47B: best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md - FAQ: best_practices/FAQ.md - Quantization: - - 'Overview': quantization/README.md - - 'Online Quantization': quantization/online_quantization.md - - 'WINT2 Quantization': quantization/wint2.md + - Overview: quantization/README.md + - Online Quantization: quantization/online_quantization.md + - WINT2 Quantization: quantization/wint2.md - Features: - - 'Prefix Caching': features/prefix_caching.md - - 'Disaggregation': features/disaggregated.md - - 'Chunked Prefill': features/chunked_prefill.md - - 'Load Balance': features/load_balance.md - - 'Speculative Decoding': features/speculative_decoding.md - - 'Structured Outputs': features/structured_outputs.md - - 'Reasoning Output': features/reasoning_output.md - - 'Early Stop': features/early_stop.md - - 'Plugins': features/plugins.md - - 'Sampling': features/sampling.md - - 'MultiNode Deployment': features/multi-node_deployment.md - - 'Graph Optimization': features/graph_optimization.md - - 'Data Parallelism': features/data_parallel_service.md - - 'Supported Models': supported_models.md + - Prefix Caching: features/prefix_caching.md + - Disaggregation: features/disaggregated.md + - Chunked Prefill: features/chunked_prefill.md + - Load Balance: features/load_balance.md + - Speculative Decoding: features/speculative_decoding.md + - Structured Outputs: features/structured_outputs.md + - Reasoning Output: features/reasoning_output.md + - Early Stop: features/early_stop.md + - Plugins: features/plugins.md + - Sampling: features/sampling.md + - MultiNode Deployment: features/multi-node_deployment.md + - Graph Optimization: features/graph_optimization.md + - Data Parallelism: features/data_parallel_service.md + - PLAS: features/plas_attention.md + - Supported Models: supported_models.md - Benchmark: benchmark.md - Usage: - - 'Log Description': usage/log.md - - 'Code Overview': usage/code_overview.md - - 'Environment Variables': usage/environment_variables.md - - 'FastDeploy Unit Test Guide': usage/fastdeploy_unit_test_guide.md + - Log Description: usage/log.md + - Code Overview: usage/code_overview.md + - Environment Variables: usage/environment_variables.md