Skip to content
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ English | [简体中文](README_CN.md)
- 🤝 **OpenAI API Server and vLLM Compatible**: One-command deployment with [vLLM](https://github.com/vllm-project/vllm/) interface compatibility.
- 🧮 **Comprehensive Quantization Format Support**: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
- ⏩ **Advanced Acceleration Techniques**: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
- 🖥️ **Multi-Hardware Support**: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.
- 🖥️ **Multi-Hardware Support**: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU, Intel Gaudi etc.

## Requirements

Expand All @@ -60,6 +60,7 @@ FastDeploy supports inference deployment on **NVIDIA GPUs**, **Kunlunxin XPUs**,
- [Enflame GCU](./docs/get_started/installation/Enflame_gcu.md)
- [Hygon DCU](./docs/get_started/installation/hygon_dcu.md)
- [MetaX GPU](./docs/get_started/installation/metax_gpu.md)
- [Intel Gaudi](./docs/get_started/installation/intel_gaudi.md)

**Note:** We are actively working on expanding hardware support. Additional hardware platforms including Ascend NPU are currently under development and testing. Stay tuned for updates!

Expand Down
3 changes: 2 additions & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
- 🤝 **OpenAI API服务与vLLM兼容**:单命令部署,兼容[vLLM](https://github.com/vllm-project/vllm/)接口
- 🧮 **全量化格式支持**:W8A16、W8A8、W4A16、W4A8、W2A16、FP8等
- ⏩ **高级加速技术**:推测解码、多令牌预测(MTP)及分块预填充
- 🖥️ **多硬件支持**:NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU等
- 🖥️ **多硬件支持**:NVIDIA GPU、昆仑芯XPU、海光DCU、昇腾NPU、天数智芯GPU、燧原GCU、沐曦GPU、英特尔Gaudi等

## 要求

Expand All @@ -58,6 +58,7 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU
- [燧原 S60](./docs/zh/get_started/installation/Enflame_gcu.md)
- [海光 DCU](./docs/zh/get_started/installation/hygon_dcu.md)
- [沐曦 GPU](./docs/zh/get_started/installation/metax_gpu.md)
- [英特尔 Gaudi](./docs/zh/get_started/installation/intel_gaudi.md)

**注意:** 我们正在积极拓展硬件支持范围。目前,包括昇腾(Ascend)NPU 等其他硬件平台正在开发测试中。敬请关注更新!

Expand Down
10 changes: 9 additions & 1 deletion build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,12 @@ function copy_ops(){
echo -e "MACA ops have been copy to fastdeploy"
return
fi
is_intel_hpu=`$python -c "import paddle; print(paddle.is_compiled_with_custom_device('intel_hpu'))"`
if [ "$is_intel_hpu" = "True" ]; then
DEVICE_TYPE="intel-hpu"
echo -e "intel_hpu ops have been copy to fastdeploy"
return
fi

DEVICE_TYPE="cpu"
cd ../../../../
Expand Down Expand Up @@ -159,7 +165,9 @@ function build_and_install_ops() {
else
FD_BUILDING_ARCS=${FD_BUILDING_ARCS} ${python} setup_ops.py install --install-lib ${OPS_TMP_DIR}
fi
find ${OPS_TMP_DIR} -type f -name "*.o" -exec rm -f {} \;
if [ -d "${OPS_TMP_DIR}" ]; then
find ${OPS_TMP_DIR} -type f -name "*.o" -exec rm -f {} \;
fi
else
echo "Error: Invalid parameter '$FD_CPU_USE_BF16'. Please use true or false."
exit 1
Expand Down
2 changes: 2 additions & 0 deletions custom_ops/setup_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -623,6 +623,8 @@ def find_end_files(directory, end_str):
],
),
)
elif paddle.is_compiled_with_custom_device("intel_hpu"):
pass
else:
use_bf16 = envs.FD_CPU_USE_BF16 == "True"

Expand Down
1 change: 1 addition & 0 deletions docs/get_started/installation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ FastDeploy currently supports installation on the following hardware platforms:
- [Enflame S60 GCU Installation](Enflame_gcu.md)
- [Iluvatar GPU Installation](iluvatar_gpu.md)
- [Hygon DCU Installation](hygon_dcu.md)
- [Intel Gaudi Installation](intel_gaudi.md)
75 changes: 75 additions & 0 deletions docs/get_started/installation/intel_gaudi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Intel Gaudi Installation for running ERNIE 4.5 Series Models

The following installation methods are available when your environment meets these requirements:

- Python 3.10
- Intel Gaudi 2
- Intel Gaudi software version 1.22.0
- Linux X86_64

## 1. Run Docker Container

Use the following commands to run a Docker container. Make sure to update the versions below as listed in the [Support Matrix](https://docs.habana.ai/en/latest/Support_Matrix/Support_Matrix.html):

```{.console}
$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
```

### 2. Install PaddlePaddle

```bash
python -m pip install paddlepaddle==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
```

### 3. Install PaddleCustomDevice
```shell
git clone https://github.com/PaddlePaddle/PaddleCustomDevice
cd PaddleCustomDevice/backends/intel_hpu/
mkdir -p build
cd build
cmake ..
make -j
pip install --force-reinstall dist/paddle_intel_hpu*.whl
cd PaddleCustomDevice/backends/intel_hpu/custom_ops
python setup.py install
```

### 4. Install FastDeploy

```shell
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh
```

## Prepare the inference demo

### 1. Start inference service
```shell
export GC_KERNEL_PATH=/usr/lib/habanalabs/libtpc_kernels.so
export GC_KERNEL_PATH=/usr/local/lib/python3.10/dist-packages/paddle_custom_device/intel_hpu/libcustom_tpc_perf_lib.so:$GC_KERNEL_PATH
export INTEL_HPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PADDLE_DISTRI_BACKEND=xccl
export PADDLE_XCCL_BACKEND=intel_hpu
export HABANA_PROFILE=0
export HPU_VISIBLE_DEVICES=0

HPU_WARMUP_BUCKET=1 HPU_WARMUP_MODEL_LEN=4096 FD_ATTENTION_BACKEND=HPU_ATTN python -m fastdeploy.entrypoints.openai.api_server --model ERNIE-4.5-21B-A3B-Paddle --tensor-parallel-size 1 --max-model-len 32768 --max-num-seqs 128
```

### 2. Launch the request
```bash
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is AI?"}
], "max_tokens": 24
}'
```

### 3. Successfully returns the result
```json
{"id":"chatcmpl-3bd98ae2-fafe-46ae-a552-d653a8526503","object":"chat.completion","created":1757653575,"model":"ERNIE-4.5-21B-A3B-Paddle","choices":[{"index":0,"message":{"role":"assistant","content":"**AI (Artificial Intelligence)** refers to the development of computer systems that can perform tasks typically requiring human intelligence.","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"text_after_process":null,"raw_prediction":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":11,"total_tokens":35,"completion_tokens":24,"prompt_tokens_details":{"cached_tokens":0}}}
```
1 change: 1 addition & 0 deletions docs/zh/get_started/installation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ FastDeploy支持如下硬件平台:
- [Enflame S60 GCU Installation](Enflame_gcu.md)
- [Iluvatar GPU Installation](iluvatar_gpu.md)
- [Hygon DCU Installation](hygon_dcu.md)
- [Intel Gaudi Installation](intel_gaudi.md)
75 changes: 75 additions & 0 deletions docs/zh/get_started/installation/intel_gaudi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# 使用 Intel Gaudi 运行ERNIE 4.5 系列模型

在环境满足如下条件前提下

- Python 3.10
- Intel Gaudi 2
- Intel Gaudi software version 1.22.0
- Linux X86_64

## 1. 运行Docker容器

使用下面命令运行Docker容器. 确保更新的版本在如下列表中 [Support Matrix](https://docs.habana.ai/en/latest/Support_Matrix/Support_Matrix.html):

```{.console}
$ docker pull vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:latest
```

### 2. 安装 PaddlePaddle

```bash
python -m pip install paddlepaddle==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
```

### 3. 安装 PaddleCustomDevice
```shell
git clone https://github.com/PaddlePaddle/PaddleCustomDevice
cd PaddleCustomDevice/backends/intel_hpu/
mkdir -p build
cd build
cmake ..
make -j
pip install --force-reinstall dist/paddle_intel_hpu*.whl
cd PaddleCustomDevice/backends/intel_hpu/custom_ops
python setup.py install
```

### 4. 安装 FastDeploy

```shell
git clone https://github.com/PaddlePaddle/FastDeploy
cd FastDeploy
bash build.sh
```

## 准备推理示例

### 1. 启动推理服务
```shell
export GC_KERNEL_PATH=/usr/lib/habanalabs/libtpc_kernels.so
export GC_KERNEL_PATH=/usr/local/lib/python3.10/dist-packages/paddle_custom_device/intel_hpu/libcustom_tpc_perf_lib.so:$GC_KERNEL_PATH
export INTEL_HPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PADDLE_DISTRI_BACKEND=xccl
export PADDLE_XCCL_BACKEND=intel_hpu
export HABANA_PROFILE=0
export HPU_VISIBLE_DEVICES=0

HPU_WARMUP_BUCKET=1 HPU_WARMUP_MODEL_LEN=4096 FD_ATTENTION_BACKEND=HPU_ATTN python -m fastdeploy.entrypoints.openai.api_server --model ERNIE-4.5-21B-A3B-Paddle --tensor-parallel-size 1 --max-model-len 32768 --max-num-seqs 128
```

### 2. 发送请求
```bash
curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What is AI?"}
], "max_tokens": 24
}'
```

### 3. 成功返回结果
```json
{"id":"chatcmpl-3bd98ae2-fafe-46ae-a552-d653a8526503","object":"chat.completion","created":1757653575,"model":"ERNIE-4.5-21B-A3B-Paddle","choices":[{"index":0,"message":{"role":"assistant","content":"**AI (Artificial Intelligence)** refers to the development of computer systems that can perform tasks typically requiring human intelligence.","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"text_after_process":null,"raw_prediction":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":11,"total_tokens":35,"completion_tokens":24,"prompt_tokens_details":{"cached_tokens":0}}}
```
2 changes: 2 additions & 0 deletions fastdeploy/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -1498,6 +1498,8 @@ def __init__(
self.device_ids = os.getenv("CUDA_VISIBLE_DEVICES", self.device_ids)
if current_platform.is_xpu():
self.device_ids = os.getenv("XPU_VISIBLE_DEVICES", self.device_ids)
if current_platform.is_intel_hpu():
self.device_ids = os.getenv("HPU_VISIBLE_DEVICES", self.device_ids)

self.read_from_config()
self.postprocess()
Expand Down
23 changes: 23 additions & 0 deletions fastdeploy/distributed/communication.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,26 @@ def tensor_model_parallel_all_reduce(

except:
tensor_model_parallel_all_reduce = None

from paddle.distributed.communication import stream
from paddle.distributed.communication.reduce import ReduceOp


def all_reduce(
tensor,
op,
group,
sync_op: bool = True,
):
return stream.all_reduce(tensor, op=op, group=group, sync_op=sync_op, use_calc_stream=True)


@paddle.jit.marker.unified
def tensor_model_parallel_all_reduce_custom(input_: paddle.Tensor) -> paddle.Tensor:
"""All-reduce the input tensor across model parallel group on calc stream."""
if paddle.in_dynamic_mode():
hcg = dist.fleet.get_hybrid_communicate_group()
mp_group = hcg.get_model_parallel_group()
all_reduce(input_, op=ReduceOp.SUM, group=mp_group)
else:
dist.all_reduce(input_)
117 changes: 116 additions & 1 deletion fastdeploy/model_executor/forward_meta.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,14 @@
import logging
from dataclasses import dataclass
from enum import IntEnum, auto
from typing import Optional
from typing import TYPE_CHECKING, Dict, Optional

import paddle

from fastdeploy.model_executor.layers.attention import AttentionBackend

if TYPE_CHECKING:
from fastdeploy.model_executor.layers.attention import AttentionBackend_HPU
logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -240,3 +242,116 @@ class DCUForwardMeta(ForwardMeta):

# Accumulated offset
cum_offsets: Optional[paddle.Tensor] = None


@dataclass
class ForwardMeta_HPU:
Copy link
Copy Markdown
Collaborator

@yuanlehome yuanlehome Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

命名是否可以与上面其他硬件保持一致呢,HPUForwardMeta

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改为HPUForwardMeta

"""
ForwardMeta_HPU is used to store the global meta information of the forward on intel HPU.
"""

#
input_ids: paddle.Tensor

# attention meta
forward_mode: ForwardMode = ForwardMode.MIXED

#
ids_remove_padding: paddle.Tensor = None

#
seq_lens_encoder: Optional[paddle.Tensor] = None

#
seq_lens_decoder: Optional[paddle.Tensor] = None

#
seq_lens_this_time: Optional[paddle.Tensor] = None

#
cum_offsets: Optional[paddle.Tensor] = None

#
block_tables: Optional[paddle.Tensor] = None

#
block_groups: Optional[paddle.Tensor] = None

#
block_list: Optional[paddle.Tensor] = None

#
block_indices: Optional[paddle.Tensor] = None

#
block_offsets: Optional[paddle.Tensor] = None

#
block_mapping: Optional[paddle.Tensor] = None

#
attention_mask: Optional[paddle.Tensor] = None

#
block_size: Optional[paddle.Tensor] = None

#
batch_ids: Optional[paddle.Tensor] = None

#
total_batch: Optional[paddle.Tensor] = None

#
is_prompt: Optional[paddle.Tensor] = None

#
attn_backend: "AttentionBackend_HPU" = None

#
rotary_embs: Optional[paddle.Tensor] = None

#
caches: Optional[paddle.Tensor] = None

#
attn_mask: Optional[paddle.Tensor] = None

#
pre_caches_length: int = 0

@classmethod
def init_forward_meta(cls, share_inputs: Dict, attn_backend: "AttentionBackend_HPU"):
"""init forward meta"""
# TODO(gongshaotian): delete this func
is_prompt = share_inputs["is_prompt"]
forward_mode = ForwardMode.DECODE
if is_prompt:
forward_mode = ForwardMode.EXTEND
ret = cls(
forward_mode=forward_mode,
input_ids=share_inputs["input_ids"],
ids_remove_padding=share_inputs["ids_remove_padding"],
seq_lens_encoder=share_inputs["seq_lens_encoder"],
seq_lens_decoder=share_inputs["seq_lens_decoder"],
seq_lens_this_time=share_inputs["seq_lens_this_time"],
block_tables=share_inputs["block_tables"],
block_groups=share_inputs["block_groups"],
block_list=share_inputs["block_list"],
block_indices=share_inputs["block_indices"],
block_offsets=share_inputs["block_offsets"],
block_mapping=share_inputs["block_mapping"],
attention_mask=share_inputs["block_bias"],
block_size=share_inputs["block_size"],
total_batch=share_inputs["total_batch"],
batch_ids=share_inputs["batch_ids"],
is_prompt=share_inputs["is_prompt"],
attn_backend=attn_backend,
rotary_embs=share_inputs["rotary_embs"],
caches=share_inputs["caches"],
)
return ret

def clear_caches(self):
"""safe clear caches"""
if self.caches:
del self.caches
Loading
Loading