Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions dockerfiles/Dockerfile.gpu
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0
FROM ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.1
ARG PADDLE_VERSION=3.2.0
ARG FD_VERSION=2.2.0
ARG FD_VERSION=2.2.1

ENV DEBIAN_FRONTEND=noninteractive

Expand Down
4 changes: 2 additions & 2 deletions docs/features/plas_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ In terms of training efficiency, the training cost is very low because only the
Following the approaches of NSA and MoBA, we partition the KV into multiple blocks. During both the prefill and decode stages, instead of performing attention computation over all KV, we dynamically select the top-K blocks with the highest attention scores for each query token, thereby enabling efficient sparse attention computation.

<div align="center">
<img src="images/plas_training_distill.png" alt="Attention Gate Module" width="60%">
<img src="./images/plas_training_distill.png" alt="Attention Gate Module" width="60%">
</div>

* **Attention Gate Module**: As illustrated in the figure above, to estimate the importance of each block with low computational overhead, we design a lightweight attention gate module. This module first compresses each K block via a MLP layer to generate a representative low-dimensional representation: $K_c^T=W_{kp}K^T$, where $W_{kp}$ denotes the MLP layer weights. Compared to directly applying mean pooling, the learnable MLP can more effectively capture semantic relationships and importance distributions among different tokens, thereby providing a refined representation of each block. After obtaining the compressed representation $K_c$, the importance of each query token with respect to each block is estimated via: $Softmax(Q\cdot K_c^T)$. To enhance the discriminative ability of the MLP layer, we use the full attention result after 1D max pooling $1DMaxPooling(Softmax(Q \cdot K^T))$ as the ground truth. By minimizing the distribution divergence between the two, the MLP layer is guided to learn feature representations that better align with the true attention distribution.
Expand All @@ -27,7 +27,7 @@ Following the approaches of NSA and MoBA, we partition the KV into multiple bloc
During sparse attention computation, each query token may dynamically select different KV blocks, leading to highly irregular memory access patterns in HBM. It is feasible to simply process each query token separately, but it will lead to excessively fine-grained computing, which cannot make full use of the tensor core, thus significantly reducing the GPU computing efficiency.

<div align="center">
<img src="images/plas_inference_union.png" alt="Token/Head Union" width="60%">
<img src="./images/plas_inference_union.png" alt="Token/Head Union" width="60%">
</div>

To optimize performance in both the prefill and decode stages, we design a special joint strategy to adapt to their respective characteristics:
Expand Down
2 changes: 1 addition & 1 deletion docs/get_started/installation/nvidia_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The following installation methods are available when your environment meets the
**Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800),if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdeploy-gpu``` after you create the container.

```shell
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.1
```

## 2. Pre-built Pip Installation
Expand Down
4 changes: 2 additions & 2 deletions docs/zh/features/plas_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
借鉴 NSA 和 MoBA 的方法,我们将键值对 (KV) 划分为多个块。在预填充和解码阶段,我们不再对所有键值进行注意力计算,而是动态地为每个查询 token 选择注意力得分最高的前 K 个块,从而实现高效的稀疏注意力计算。

<div align="center">
<img src="images/plas_training_distill.png" alt="Attention Gate Module" width="60%">
<img src="./images/plas_training_distill.png" alt="Attention Gate Module" width="60%">
</div>

* **Attention Gate Module**: 如上图所示,为了以较低的计算开销估计每个块的重要性,我们设计了一个轻量级的注意力门模块。该模块首先通过一个MLP层压缩每个K个块,生成一个具有代表性的低维表示: $K_c^T=W_{kp}K^T$ ,其中 $W_{kp}$ 表示 MLP 层的权重。与直接应用均值池化相比,可学习的 MLP 可以更有效地捕捉不同 token 之间的语义关系和重要性分布,从而提供每个块的精细表示。在获得压缩表示 $K_c$ 之后,通过以下公式估计每个查询 token 相对于每个块的重要性:$Softmax(Q\cdot K_c^T)$。为了增强 MLP 层的判别能力,我们使用一维最大池化后的完整注意力结果 $1DMaxPooling(Softmax(Q \cdot K^T))$ 作为 ground truth。通过最小化两者之间的分布差异,引导 MLP 层学习更符合真实注意力分布的特征表示。
Expand All @@ -29,7 +29,7 @@
在稀疏注意力计算过程中,每个查询 token 可能会动态选择不同的 KV 块,导致 HBM 的内存访问模式非常不规则。简单地对每个查询 token 进行单独处理是可行的,但这会导致计算粒度过细,无法充分利用张量核,从而显著降低 GPU 的计算效率。

<div align="center">
<img src="images/plas_inference_union.png" alt="Token/Head Union" width="60%">
<img src="./images/plas_inference_union.png" alt="Token/Head Union" width="60%">
</div>

为了优化预填充和解码阶段的性能,我们设计了一种特殊的联合策略来适应各自的特点:
Expand Down
2 changes: 1 addition & 1 deletion docs/zh/get_started/installation/nvidia_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
**注意**: 如下镜像仅支持SM 80/90架构GPU(A800/H800等),如果你是在L20/L40/4090等SM 86/69架构的GPU上部署,请在创建容器后,卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。

``` shell
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.1
```

## 2. 预编译Pip安装
Expand Down
Loading