-
Notifications
You must be signed in to change notification settings - Fork 5.9k
inference engine related design #10198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
da08e12
a6f374e
94d445d
77a95ce
2cf5fe3
3f7f58f
b27a16d
61459f2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,252 @@ | ||||||||||||||||||||||
| # Utilize Engines to Accelerate Inference | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The inference phase need to support some special hardware for acceleration, | ||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
=>
|
||||||||||||||||||||||
| such as GPU, FPGA, and ARM. | ||||||||||||||||||||||
| Special softwares power some of these hardwares and the inner states are hidden, for example, the TensorRT is released by NVidia to improve the inference performance on GPUs, it takes a computation graph as input, | ||||||||||||||||||||||
| optimize and execute it, but the users can't directly modify its internal logics. | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Special softwares power some of these hardwares and the inner states are hidden. For example, TensorRT is released by NVIDIA to improve the inference performance on GPUs. It takes a computation graph as input, optimizes and executes it, while users can't directly modify its internal logic. |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| In other words, these software acts like a black box, the external logics prepare its inputs, execute it and process its output. | ||||||||||||||||||||||
| In the Paddle inference module, we call such software an engine, and the inference phase will partition sub-blocks(sub graph) and execute them on the engines to improve performance. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Use Engines to Execute Sub-blocks | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| Compared to Paddle Fluid, the engines covers limited number of operators and can only power several kinds of models. In other words, the engines can only support a part of Fluid. | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Motivation of sub-blocks methodline 13 + some information from tensorflow/models#4028, in order to tell people why we use sub-blocks method, not directly use TensorRT. Use Engines to Execute Sub-blockslind 14 |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The Block in Fluid acts like a computation graph and it is natural to partition the Block into several sub-blocks which are powered by several different engines. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| <p align="center"> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| <img src="./images/inference engine.jpg"/> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| </p> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| It is easy to parallelize the computation by scheduling several engines on different devices, for example, the CPU and GPU engines can be dispatched in the meantime | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| <p align="center"> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| <img src="./images/parallel engine.png"/> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| </p> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Partition the Sub-blocks supported by a Specific Engine | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| As mentioned above, one engine can support a partition of Fluid operators, the sub-block dispatched should be composed by the operators this engine fully supports. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The Inference framework needs a mechanism to mark the sub-block and deliver it to an engine. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| We use a `with-statement` to mark the sub-block as follows. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| with infer.power_by_engine('tensorrt'): | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's type of In my mind, the interface for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
import paddle.inference as infer |
||||||||||||||||||||||
| o = some_op() | ||||||||||||||||||||||
| o = some_op() | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's meaning of
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No practical meaning, just shows that there are several operators there. |
||||||||||||||||||||||
| ... | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The operators inside the `infer.power_by_tensorrt` code block will combine into a sub-block and transfers to a TensorRT engine. We call this API-powered sub-block marking the way the **manul sub-block marking**, which means the users directly decide the operators runs on some engine. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| For large models, it is trivial to mark all the sub-blocks, so an elicitation method is raised to make it automatic, we call this way the **sub-block automatic detection mode**. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ```python | ||||||||||||||||||||||
| # if min_ops is set, turn on sub-block automatic detection mode | ||||||||||||||||||||||
| # if the more than two adjacent operators are supported by some engine, combine them to | ||||||||||||||||||||||
| # a sub-block and transmit to some engine. | ||||||||||||||||||||||
| infer.init_subblock_optimizer(min_ops=2) | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| o = some_op() | ||||||||||||||||||||||
| o = some_op() | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the automatic detection mode, should we use |
||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # one can still set one or a code block to use some specific engine | ||||||||||||||||||||||
| with infer.power_by_engine('X'): | ||||||||||||||||||||||
| o = op1() | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| o = some_op() | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| # several different engines can be utilized in one model, the elicitation method | ||||||||||||||||||||||
| # will greedily detect more adjacent operators that powered by the specified engine, | ||||||||||||||||||||||
| #, and partition to a larger sub-block. | ||||||||||||||||||||||
| with infer.power_by_engine('Y'): | ||||||||||||||||||||||
| o = op2() | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| o = some_op() | ||||||||||||||||||||||
| ``` | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## Transmit the sub-blocks to an Engine | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The marked code blocks will be written into a `BlockDesc`, to make the inference phase more clear to support the engine execution, we break up the whole architecture into three layers: | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - Frontend, the python syntax, generate the basic fluid model description with some inference customized configurations. | ||||||||||||||||||||||
| - Optimizer, rewrite the fluid description, such as pruning the unused operators, reuse some variable memory. | ||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the concept unity, |
||||||||||||||||||||||
| - Backend, simply execute the fluid description. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| <p align="center"> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| <img src="images/inference architecture.png"/> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| </p> | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| To support engines, there are following phases. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| 1. the user uses the APIs supplied by **Frontend**, generate a Fluid model description with some adjacent operators marked with some engine label | ||||||||||||||||||||||
| 2. the optimizer found out the operators marked with the engine labels | ||||||||||||||||||||||
| 1. Extract these operators and combine them to sub-blocks | ||||||||||||||||||||||
| 2. Delete them from the original Fluid model description | ||||||||||||||||||||||
| 3. Insert a new `EngineOp` into the original Fluid model description with the sub-block set as an attribute | ||||||||||||||||||||||
| 3. the **Backend** get the optimized Fluid description | ||||||||||||||||||||||
| - the `EngineOp` is treated as normal operators | ||||||||||||||||||||||
| - the Backend execute each operator one by one if in sync mode | ||||||||||||||||||||||
| - the operators(especially different `EngineOp` on different devices) can execute parallelly in async mode | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| ## How Engine works with the Fluid framework | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| The `EngineOp` described above is the key, it acts like normal Fluid operators, but embed with an engine. When an `EngineOp` is created, the engine inside will build a network which acts equivalent functions with the Fluid sub-block describes. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| When the whole Fluid description is executed by Backend, the engine inside will run its runtime engine. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| There is a tradeoff between sub-block size and the number of `EngineOp,` each `EngineOp` need a pair of input and output data format converters, those results in additional latency. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| So bigger sub-block with less `EngineOp` is better, but some Fluid operators without alternative ones in the engine will break up the big block into small subblocks, whether to execute these sub-blocks on engines or just on Fluid, that needs more consideration. | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| To help convert input/output data format between any Fluid operators and some `EngineOp`, a pair of `EngineInputConvertOp` and `EngineOutputConvertOp` needs to insert into the Fluid description. The reason for these converters is an operator, not a method are as follows | ||||||||||||||||||||||
|
|
||||||||||||||||||||||
| - the converter works between the Fluid operators and `EngineOp`s, both of their data formats might be different and need to specify, for example | ||||||||||||||||||||||
| - `RNNOp -> xEngineOp`, the input is from an `LoDTensor` to `xTensor` | ||||||||||||||||||||||
| - `MulOp -> xEngineOp`, the input is from an `Tensor` to `xTensor` | ||||||||||||||||||||||
| - but `RNNOp -> MulOp -> xEngineOp`, the input is from an `LoDTensor` to `xTensor` | ||||||||||||||||||||||
| - the `EngineOp` can not get the external operators those link to it (the `RNNOp` and `MulOp` above), so it is impossible to deduce the data's format interact with the external Fluid framework. | ||||||||||||||||||||||
| - the converter will result in additional overhead, to make it an operator is more clear and more flexible for further optimization. | ||||||||||||||||||||||
|
||||||||||||||||||||||
| static const Tensor* GetTensorFromVar(Variable* var) { | |
| if (var->IsType<LoDTensor>()) { | |
| return var->GetMutable<LoDTensor>(); | |
| } else if (var->IsType<SelectedRows>()) { | |
| return var->GetMutable<SelectedRows>()->mutable_value(); | |
| } else { | |
| PADDLE_THROW("Variable type_id %s, expect LoDTensor/SelectedRows.", | |
| var->Type().name()); | |
| } | |
| } |
Thus, 1,2,3 can be treated in the same way, which is much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subblock->sub_block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU=1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enum syntax just needs to set the first element, and following elements will increase automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimizer->Transpiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An optimizer is not a Transpiler. It corresponds to the optimization in a compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input a program desc, but output maybe a series of sub-block desc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input a program desc, output a program desc with several newly inserted EngineOp with their attribute set with the sub-blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are CleanUselessOptimizer and PruneOpOptimizer ?
We already have prune method of inference. see paddle\fluid\framework\prune.cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I this a factory pattern of Operators is a better interface, maybe we'd better refactor those codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的engines指的是什么呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看上去是指 TensorRT?我看到后面提出一个base class,也在另一个code的PR里看到了这个base class。这是为了将来derive除了 TensorRT 之外的其他的“engine”对应的class吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TensorRT, Anajin, 或者其他类似自带完整优化的库