-
Notifications
You must be signed in to change notification settings - Fork 5.8k
add MKL Packed design doc for RNN optimazation #6636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要在文中的某一个地方说一下,tensorflow或者其他框架使用/未使用packed这样的技术么?
doc/design/mkl/mkl_packed.md
Outdated
@@ -0,0 +1,91 @@ | |||
# Intel® MKL Packed Optimization on PaddlePaddle: Design Doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
标题里面可以去掉Optimization,现在标题占两行,有点长。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的。
doc/design/mkl/mkl_packed.md
Outdated
|
||
## Overview | ||
我们计划将 Intel® MKL 中引入的 GEMM Packed APIs\[[1](#references)\] 集成到 PaddlePaddle 中,充分发挥英特尔平台的优势,有效提升PaddlePaddle在英特尔架构上的性能。 | ||
现阶段的优化主要针对 Recurrent Neural Network(以下简称RNN)相关层(包括`RecurrentLayer`, `GatedRecurrentLayer`和`LstmLayer`), 以及 PaddlePaddle V1 API。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- (以下简称RNN)换成中文括号
- V1和V2 API应该都适用吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(以下简称RNN)换成中文括号
好的,thx
V1和V2 API应该都适用吧。
应该是都适用,只是目前的验证主要工作还是集中在V1上。所以先写V1。
doc/design/mkl/mkl_packed.md
Outdated
## Key Points | ||
|
||
### Background | ||
为了达到最佳性能, Intel® MKL 中的 cblas_?gemm 会在计算前将原数据转换为更适合英特尔平台的Packed格式, 这一数据格式的转换操作 (Packing),在问题本身的计算量比较小的时候显得相对来说较为耗时。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- cblas_?gemm,中间的问号是笔误?下同。
- 25行可以改成:目前PaddlePaddle采用 Intel® MKL的cblas_gemm库,这个库会在计算前将原数据转换为更适合英特尔平台的Packed格式。但这一数据格式的转换操作 (Packing),在问题本身的计算量比较小的时候(例如RNN,矩阵大小一般是XXX)显得相对来说较为耗时。
可再组织一下语句。
- Packed中文是转换,应该是打包吧?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cblas_?gemm,中间的问号是笔误?
这个不是笔误,官网就是这样写的:https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm
表示的精度可选。
25行可以改成:目前PaddlePaddle采用 Intel® MKL的cblas_gemm库,这个库会在计算前将原数据转换为更适合英特尔平台的Packed格式。但这一数据格式的转换操作 (Packing),在问题本身的计算量比较小的时候(例如RNN,矩阵大小一般是XXX)显得相对来说较为耗时。
cblas_gemm 这是MKL库里的一个函数,它本身不是库。
语句本身确实可以整理成你说的这样。Thx。
Packed中文是转换,应该是打包吧?
官网是把转换后的数据格式,直接称为 internal packed format, 转换的过程叫Packing。
doc/design/mkl/mkl_packed.md
Outdated
|
||
### Background | ||
为了达到最佳性能, Intel® MKL 中的 cblas_?gemm 会在计算前将原数据转换为更适合英特尔平台的Packed格式, 这一数据格式的转换操作 (Packing),在问题本身的计算量比较小的时候显得相对来说较为耗时。 | ||
在现有的某些情况下(例如RNN),多次调用 cblas_?gemm 时会使用相同的原数据,每次调用时对原数据的重复Packing便成为了冗余。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同时,由于RNN多次调用 cblas_gemm 时会使用相同的原数据,因此,每次调用时对原数据的重复Packing便成为了冗余。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, 我来整理下,thx。
doc/design/mkl/mkl_packed.md
Outdated
通过使用这些API,我们可以先完成对原数据的Packing操作,再把已转换为Packed格式的数据传递给那些复用同一数据的gemm_compute函数,从而避免了Packing冗余。 | ||
|
||
### Solution | ||
在RNN的case下,同一次 forward/backward 过程中所有time state共享同一个weight矩阵。当只做 inference 时,各次 forward 之间也都使用相同的weight矩阵,没有必要在每次forward中每个time state的计算时对weight进行重复的packing操作。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
尽量都用中文名词:
- forward/backward:前向、后向
- time state:时间步
- weight:权重
- inference:推断
下同。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thx。
doc/design/mkl/mkl_packed.md
Outdated
### Solution | ||
在RNN的case下,同一次 forward/backward 过程中所有time state共享同一个weight矩阵。当只做 inference 时,各次 forward 之间也都使用相同的weight矩阵,没有必要在每次forward中每个time state的计算时对weight进行重复的packing操作。 | ||
|
||
我们通过使用新引入的GEMM Packed APIs,在layer init时先完成对weight的packing操作,然后在 forward/backward 时复用已pack过后的weight,并在每次weight更新后重新Packing。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GEMM Packed APIs,是指上面四个API么?上面第一次列出时,可以说一下他们是GEMM Packed APIs。
doc/design/mkl/mkl_packed.md
Outdated
|
||
我们通过使用新引入的GEMM Packed APIs,在layer init时先完成对weight的packing操作,然后在 forward/backward 时复用已pack过后的weight,并在每次weight更新后重新Packing。 | ||
|
||
* 优化前,对于sequence length = `T` 的model, `N` 次iteration执行的Packing次数为: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- sequence_length:序列长度
- iteration:迭代
- 这里用model,正确么?这里应该没有model的概念。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,thx。
model ==> 网络模型。
Thanks, done. |
2. 转换冗余 \ | ||
由于在现有的某些情况下(例如RNN),多次调用 cblas_?gemm 会使用相同的原数据,因此,每次调用时对原数据的重复Packing便成为了冗余。 | ||
|
||
为了最大程度减少多次调用 cblas_?gemm 在Packing上的耗时,Intel® MKL 引入了以下四个API: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cblas_?gemm可以加一个参考文献:https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm 说明?是代表四种不同精度。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的没问题。可以把这个直接作成链接。
doc/design/mkl/mkl_packed.md
Outdated
通过使用这些API,我们可以先完成对原数据的Packing操作,再把已转换为Packed格式的数据传递给那些复用同一数据的gemm_compute函数,从而避免了Packing冗余。 | ||
|
||
### Solution | ||
在RNN的情况下,同一次**前向/后向**(forward/backward)过程中所有**时间步**(time step)共享同一个**权重**(weight)。当只做**预测**(inference)时,各次**前向**之间也都使用了相同的**权重**,没有必要在每次**前向**中每个**时间步**的计算时对**权重**进行重复的Packing操作。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这些关键字没必要加粗,都很常见。下同。
- inference是推断:比如 https://github.com/PaddlePaddle/Mobile/blob/develop/Demo/iOS/AICamera/README_cn.md#目录结构 里面写的“如何调用Paddle C API进行离线推断”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加粗是想要强调,并且放起来方便。如果觉得没有必要也可以去掉。
OK。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done
2. 转换冗余 \ | ||
由于在现有的某些情况下(例如RNN),多次调用 cblas_?gemm 会使用相同的原数据,因此,每次调用时对原数据的重复Packing便成为了冗余。 | ||
|
||
为了最大程度减少多次调用 cblas_?gemm 在Packing上的耗时,Intel® MKL 引入了以下四个API: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的没问题。可以把这个直接作成链接。
doc/design/mkl/mkl_packed.md
Outdated
通过使用这些API,我们可以先完成对原数据的Packing操作,再把已转换为Packed格式的数据传递给那些复用同一数据的gemm_compute函数,从而避免了Packing冗余。 | ||
|
||
### Solution | ||
在RNN的情况下,同一次**前向/后向**(forward/backward)过程中所有**时间步**(time step)共享同一个**权重**(weight)。当只做**预测**(inference)时,各次**前向**之间也都使用了相同的**权重**,没有必要在每次**前向**中每个**时间步**的计算时对**权重**进行重复的Packing操作。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加粗是想要强调,并且放起来方便。如果觉得没有必要也可以去掉。
OK。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 下次加python API时,可以把cblas_?gemm_alloc等四个函数也加上相应的链接。
好的,没问题。 |
fix #6553
Click here to review MD.
The Python API part need be discussed #6612.