Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/fluid/design/images/inference engine.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/fluid/design/images/parallel engine.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
254 changes: 254 additions & 0 deletions doc/fluid/design/inference_engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
# Utilize Engines to Accelerate Inference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的engines指的是什么呢?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看上去是指 TensorRT?我看到后面提出一个base class,也在另一个code的PR里看到了这个base class。这是为了将来derive除了 TensorRT 之外的其他的“engine”对应的class吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorRT, Anajin, 或者其他类似自带完整优化的库


The inference phase need to support some special hardware for acceleration,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inference phase need to support some special hardware

=>

We want to utilize DL chips to accelerate the inference of Fluid models.

such as GPU, FPGA, and ARM.
Special softwares power some of these hardwares and the inner states are hidden, for example, the TensorRT is released by NVidia to improve the inference performance on GPUs, it takes a computation graph as input,
optimize and execute it, but the users can't directly modify its internal logics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special softwares power some of these hardwares and the inner states are hidden. For example, TensorRT is released by NVIDIA to improve the inference performance on GPUs. It takes a computation graph as input, optimizes and executes it, while users can't directly modify its internal logic.


In other words, these software acts like a black box, the external logics prepare its inputs, execute it and process its output.
In the Paddle inference module, we call such software an engine, and the inference phase will partition sub-blocks(sub graph) and execute them on the engines to improve performance.

## Use Engines to Execute Sub-blocks

Compared to Paddle Fluid, the engines covers limited number of operators and can only power several kinds of models. In other words, the engines can only support a part of Fluid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation of sub-blocks method

line 13 + some information from tensorflow/models#4028, in order to tell people why we use sub-blocks method, not directly use TensorRT.

Use Engines to Execute Sub-blocks

lind 14
...


The Block in Fluid acts like a computation graph and it is natural to partition the Block into several sub-blocks which are powered by several different engines.

<p align="center">

<img src="./images/inference engine.jpg"/>

</p>

It is easy to parallelize the computation by scheduling several engines on different devices, for example, the CPU and GPU engines can be dispatched in the meantime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add . after mentime.


<p align="center">

<img src="./images/parallel engine.png"/>

</p>



## Partition the Sub-blocks supported by a Specific Engine

As mentioned above, one engine can support a partition of Fluid operators, the sub-block dispatched should be composed by the operators this engine fully supports.

The Inference framework needs a mechanism to mark the sub-block and deliver it to an engine.

We use a `with-statement` to mark the sub-block as follows.

```python
with infer.power_by_engine('tensorrt'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's type of infer, ProgramDesc? Followings are current trainspiler inferface, whose parameter is a ProgramDesc.

t = fluid.InferenceTranspiler()
t.transpile(inference_transpiler_program, place)

In my mind, the interface for automatic detection mode is:

t = fluid.InferenceTranspiler()
t.transpile(inference_transpiler_program, place, engine = 'tensorrt' )

def transpile(inference_transpiler_program, place, engine):
     if engine == "tensorrt":
        power_by_tensorrt_engine(inference_transpiler_program);
     else:
        ..

Copy link
Contributor Author

@Superjomn Superjomn Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer is a module.

import paddle.inference as infer

o = some_op()
o = some_op()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's meaning of o = some_op()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No practical meaning, just shows that there are several operators there.

...
```

The operators inside the `infer.power_by_tensorrt` code block will combine into a sub-block and transfers to a TensorRT engine. We call this API-powered sub-block marking the way the **manul sub-block marking**, which means the users directly decide the operators runs on some engine.

For large models, it is trivial to mark all the sub-blocks, so an elicitation method is raised to make it automatic, we call this way the **sub-block automatic detection mode**.

```python
# if min_ops is set, turn on sub-block automatic detection mode
# if the more than two adjacent operators are supported by some engine, combine them to
# a sub-block and transmit to some engine.
infer.init_subblock_optimizer(min_ops=2)

o = some_op()
o = some_op()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the automatic detection mode, should we use o = some_op() again?


# one can still set one or a code block to use some specific engine
with infer.power_by_engine('X'):
o = op1()

o = some_op()

# several different engines can be utilized in one model, the elicitation method
# will greedily detect more adjacent operators that powered by the specified engine,
#, and partition to a larger sub-block.
with infer.power_by_engine('Y'):
o = op2()

o = some_op()
```

## Transmit the sub-blocks to an Engine

The marked code blocks will be written into a `BlockDesc`, to make the inference phase more clear to support the engine execution, we break up the whole architecture into three layers:

- Frontend, the python syntax, generate the basic fluid model description with some inference customized configurations.
- Optimizer, rewrite the fluid description, such as pruning the unused operators, reuse some variable memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the concept unity, transpiler is more suitable than Optimizer?

- Backend, simply execute the fluid description.

<p align="center">

<img src="images/inference architecture.png"/>

</p>

To support engines, there are following phases.

1. the user uses the APIs supplied by **Frontend**, generate a Fluid model description with some adjacent operators marked with some engine label
2. the optimizer found out the operators marked with the engine labels
1. Extract these operators and combine them to sub-blocks
2. Delete them from the original Fluid model description
3. Insert a new `EngineOp` into the original Fluid model description with the sub-block set as an attribute
3. the **Backend** get the optimized Fluid description
- the `EngineOp` is treated as normal operators
- the Backend execute each operator one by one if in sync mode
- the operators(especially different `EngineOp` on different devices) can execute parallelly in async mode

## How Engine works with the Fluid framework

The `EngineOp` described above is the key, it acts like normal Fluid operators, but embed with an engine. When an `EngineOp` is created, the engine inside will build a network which acts equivalent functions with the Fluid sub-block describes.

When the whole Fluid description is executed by Backend, the engine inside will run its runtime engine.

There is a tradeoff between sub-block size and the number of `EngineOp,` each `EngineOp` need a pair of input and output data format converters, those results in additional latency.

So bigger sub-block with less `EngineOp` is better, but some Fluid operators without alternative ones in the engine will break up the big block into small subblocks, whether to execute these sub-blocks on engines or just on Fluid, that needs more consideration.

To help convert input/output data format between any Fluid operators and some `EngineOp`, a pair of `EngineInputConvert` and `EngineOutputConvert` interface is proposed

- the converter works between the Fluid operators and `EngineOp`s, both of their data formats might be different and need to specify, for example
- `RNNOp -> xEngineOp`, the input is from an `LoDTensor` to `xTensor`
- `MulOp -> xEngineOp`, the input is from an `Tensor` to `xTensor`
- but `RNNOp -> MulOp -> xEngineOp`, the input is from an `LoDTensor` to `xTensor`
- the `EngineOp` can not get the external operators those link to it (the `RNNOp` and `MulOp` above), so it is impossible to deduce the data's format interact with the external Fluid framework.
- the converter will result in additional overhead, to make it an indenpendent interface is more clear and more flexible for further optimization.

## Engine-related Design

### EngineOp

`EngineOp` is just a normal Fluid operator, which has an attribute called `subblock` to get the Fluid description about a sub-block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subblock->sub_block


To unify the behavior of different `EngineOp`s, an underlying interface over all the engines is raised as follows

```c++
/*
EngineBase is used like
class xEngine : EngineBase { ... }

xEngine engine;
engine.Build(some_desc);

for (const auto& batch : dataset) {
auto& input0 = engine.buffer("input0");
if (input0.device == DeviceType::CPU) {
std::memcpy(input0.buffer, batch.data(), batch.size());
input0.size = batch.size();
} else if (input0.device == DeviceType::GPU) {
cudaMemCpy(input0.buffer, batch.data(), batch.size());
input0.size = batch.size();
}
engine.Execute();

const auto& output0 = engine.buffer("output0");
// ...
}
*/
enum class DeviceType {
CPU = 0,
GPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU=1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum syntax just needs to set the first element, and following elements will increase automatically.

};

struct Buffer {
void* buffer;
int max_size;
int size;
DeviceType device;
};

class EngineBase {
public:
// Build once.
virtual Build(const BlockDesc& desc) = 0;
virtual Execute() = 0;
// Clone an engine instance with weights shared.
virtual EngineBase* Clone() const = 0;
// Expose the engine's internal buffer to write/read data directly
virtual Buffer& buffer(const std::string& tensor) = 0;

virtual ~EngineBase() {}
};
```



### Data format convert operators

Both the `EngineInputConvertOp` and `EngineOutputConvertOp` has similar interfaces

The operator has following attributes.

1. `input_op_type`
2. `output_op_type`

For a convert op that input from `RNNOp` and outputs to `xEngineOp`, the values of these two attributes are *RNNOp* and *xEngineOp*.

to make the implementation of the input and output combination more extensible, the functor and register should be included:

```c++
struct EngineInputConveterBase {
// the `out` is a cuda memory that has been allocated.
virtual void operator()(LoDTensor& in, void* out, size_t max_size) = 0;
static void Execute(const std::string& in_op_type, const std::string& out_op_type,
const LoDTensor& in, void* out, size_t max_size) {
conveters[in_op_type + "_to_" + out_op_type](in, out, max_size);
}

template<typename T>
static Register(const std::string& key) { conveters[key] = T(); }

static std::map<std::string, EngineInputConveterBase> conveters;
};

// some specific implementations
struct RNN2xEngineConveter : public EngineInputConveterBase {
void operator()(const LoDTensor& in, void* out, size_t max_size) override;
};

#define REGISTER_INPUT_CONVERTER(in_op_type__, out_op_type__, Conveter__) \
... some logics \
EngineInputConveterBase::Register<Conveter__>(#in_op_type__ "_to_" #out_op_type__); \
... some logics

REGISTER_INPUT_CONVERTER(RNNOp, xEngineOp, RNN2xEngineConveter);
```

The `EngineOutputConvertOp` is similar.

### Optimizer for sub-block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer->Transpiler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimizer is not a Transpiler. It corresponds to the optimization in a compiler.


```c++
// The InferenceOptimizers input a program desc and output a block desc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input a program desc, but output maybe a series of sub-block desc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input a program desc, output a program desc with several newly inserted EngineOp with their attribute set with the sub-blocks.

// Different implementations will rewrite the original program desc by different logics.
// There might be many different optimizers, such as
// - CleanUselessOptimizer
// - PruneOpOptimizer
Copy link
Contributor

@luotao1 luotao1 Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are CleanUselessOptimizer and PruneOpOptimizer ?
We already have prune method of inference. see paddle\fluid\framework\prune.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I this a factory pattern of Operators is a better interface, maybe we'd better refactor those codes.

// - SubblockToEngineOptimizer
struct InferenceOptimizer {
virtual operator()(const ProgramDesc& desc, ProgramDesc* out) = 0;

void RunALL() {
for (auto& o : optimizers) { o(); }
}

template<typename T>
static void Register(const T& t) { optimizers.append(t); }
static std::vector<InfernceOptimizer> optimizers;
};

// Extract the subblock from the Program Desc, insert a xEngineOp and set its attribute
// to a sub-block description.
struct SubblockToEngineOptimizer : public InferenceOptimizer {
virtual operator() (const ProgramDesc& desc, ProgramDesc* out) override;
};

REGISTER_INFERENCE_OPTIMIZER(SubblockToEngineOptimizer);

#define REGISTER_INFERENCE_OPTIMIZER(Optimizer__) \
InferenceOptimizer::Register(Optimizer__());
```