Skip to content

Commit 0767ff0

Browse files
Add README.md
1 parent cf1529e commit 0767ff0

File tree

1 file changed

+134
-1
lines changed

1 file changed

+134
-1
lines changed

README.md

Lines changed: 134 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,134 @@
1-
# yolov4-triton-tensorrt
1+
# YOLOv4 on Triton Inference Server with TensorRT
2+
3+
![GitHub release (latest by date including pre-releases)](https://img.shields.io/github/v/release/Isarsoft/yolov4-triton-tensorrt?include_prereleases)
4+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5+
6+
This repository shows how to deploy YOLOv4 as an optimized [TensorRT](https://github.com/NVIDIA/tensorrt) engine to [Triton Inference Server](https://github.com/NVIDIA/triton-inference-server).
7+
8+
Triton Inference Server takes care of model deployment with many out-of-the-box benefits, like a GRPC and HTTP interface, automatic scheduling on multiple GPUs, shared memory (even on GPU), health metrics and memory resource management.
9+
10+
TensorRT will automatically optimize throughput and latency of our model by fusing layers and chosing the fastest layer implementations for our specific hardware. We will use the TensorRT API to generate the network from scratch and add all non-supported layers as a plugin.
11+
12+
## Build TensorRT engine
13+
14+
There are no dependencies needed to run this code, except a working docker environment with GPU support. We will run all compilation inside the TensorRT NGC container to avoid having to install TensorRT natively.
15+
16+
Run the following to get a running TensorRT container with our repo code:
17+
18+
```bash
19+
cd yourworkingdirectoryhere
20+
git clone [email protected]:isarsoft/yolov4-triton-tensorrt.git
21+
docker run --gpus all -it --rm -v $(pwd)/yolov4-triton-tensorrt:/yolov4-triton-tensorrt nvcr.io/nvidia/tensorrt:20.06-py3
22+
```
23+
24+
Docker will download the TensorRT container. You need to select the version (in this case 20.06) according to the version of Triton that you want to use later to ensure the TensorRT versions match. Matching NGC version tags use the same TensorRT version.
25+
26+
Inside the container run the following to compile our code:
27+
28+
```bash
29+
cd /yolov4-triton-tensorrt
30+
mkdir build
31+
cd build
32+
cmake ..
33+
make
34+
```
35+
36+
This will generate two files (`liblayerplugin.so` and `main`). The library contains all unsupported TensorRT layers and the executable will build us an optimized engine in a second.
37+
38+
Download the weights for this network from [Google Drive](https://drive.google.com/drive/folders/1YUDVgEefnk2HENpGMwq599Yj45i_7-iL?usp=sharing). Instructions on how to generate this weight file from the original darknet config and weights can be found [here](https://github.com/wang-xinyu/tensorrtx/tree/master/yolov4). Place the weight file in the same folder as the executable `main`. Then run the following to generate a serialized TensorRT engine optimized for your GPU:
39+
40+
```bash
41+
./main
42+
```
43+
44+
This will generate a file called `yolov4.engine`, which is our serialized TensorRT engine. Together with `liblayerplugin.so` we can now deploy to Triton Inference Server.
45+
46+
Before we do this we can test the engine with standalone TensorRT by running:
47+
48+
```bash
49+
cd /workspace/tensorrt/bin
50+
./trtexec --loadEngine=/yolov4-triton-tensorrt/build/yolov4.engine --plugins=/yolov4-triton-tensorrt/build/liblayerplugin.so
51+
```
52+
53+
```
54+
(...)
55+
[I] Starting inference threads
56+
[I] Warmup completed 1 queries over 200 ms
57+
[I] Timing trace has 204 queries over 3.00185 s
58+
[I] Trace averages of 10 runs:
59+
[I] Average on 10 runs - GPU latency: 14.5469 ms - Host latency: 16.1718 ms (end to end 16.1964 ms, enqueue 2.69769 ms)
60+
[I] Average on 10 runs - GPU latency: 13.1222 ms - Host latency: 14.7452 ms (end to end 14.7681 ms, enqueue 2.89363 ms)
61+
(...)
62+
[I] GPU Compute
63+
[I] min: 12.241 ms
64+
[I] max: 15.0692 ms
65+
[I] mean: 13.1447 ms
66+
```
67+
68+
## Deploy to Triton Inference Server
69+
70+
We need to create our model repository file structure first:
71+
72+
```bash
73+
# Create model repository
74+
cd yourworkingdirectoryhere
75+
mkdir -p triton-deploy/model_repository/yolov4/1/
76+
mkdir triton-deploy/plugins
77+
78+
# Copy engine and plugins
79+
cp yolov4-triton-tensorrt/build/yolov4.engine triton-deploy/models/yolov4/1/model.plan
80+
cp yolov4-triton-tensorrt/build/liblayerplugin.so triton-deploy/plugins/
81+
```
82+
83+
Now we can start Triton with this model repository:
84+
85+
```bash
86+
docker run --gpus all --rm --shm-size=1g --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/triton-deploy/models:/models -v$(pwd)/triton-deploy/plugins:/plugins --env LD_PRELOAD=/plugins/liblayerplugin.so nvcr.io/nvidia/tritonserver:20.06-py3 tritonserver --model-repository=/models --strict-model-config=false --grpc-infer-allocation-pool-size=16 --log-verbose 1
87+
```
88+
89+
This should give us a running Triton instance with our yolov4 model loaded. You can check out what to do next in the [Triton Documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html).
90+
91+
## How to run model in your code
92+
93+
Triton has a very easy C++, Go and Python SDK with examples on how to run inference when the model is deployed on the server. It supports shared memory for basically zero copy latency when you run the code on the same device. This repo will be extended with a full implementation of such a client in the future, but it's really not hard to do by looking at the examples: https://github.com/NVIDIA/triton-inference-server/tree/master/src/clients
94+
95+
## Benchmark
96+
97+
To benchmark the performance of the model, we can run [Tritons Performance Client](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/optimization.html#perf-client).
98+
99+
To run the perf_client, get the Triton Client SDK docker container.
100+
101+
```bash
102+
docker run -it --ipc=host --net=host nvcr.io/nvidia/tritonserver:20.06-py3-clientsdk /bin/bash
103+
cd install/bin
104+
./perf_client (...argumentshere)
105+
# Example
106+
./perf_client -m yolov4 -u 127.0.0.1:8001 -i grpc --concurrency-range 4
107+
```
108+
109+
The following benchmarks were taken on a system with `2 x Nvidia 2080 TI` GPUs and an `AMD Ryzen 9 3950X` 16 Core CPU and with batchsize 1.
110+
111+
| concurrency / precision | FP32 | FP16 |
112+
|-------------------------|-------------------------------------|-------------------------------------|
113+
| 1 | 44 infer/sec, latency 22633 usec | 62.4 infer/sec, latency 15986 usec |
114+
| 2 | 84.2 infer/sec, latency 23677 usec | 136.2 infer/sec, latency 14675 usec |
115+
| 4 | 100.2 infer/sec, latency 39946 usec | 154.2 infer/sec, latency 19443 usec |
116+
| 8 | 99.2 infer/sec, latency 80552 usec | 171 infer/sec, latency 46780 usec |
117+
118+
119+
## Tasks in this repo
120+
121+
- [x] Layer plugin working with trtexec and Triton
122+
- [x] FP16 optimization
123+
- [ ] INT8 optimization
124+
- [ ] Implement calibrator
125+
- [ ] Add support for INT8 in custom layers
126+
- [ ] Optional: use ReLU instead of Mish for layer fusion speedup
127+
- [ ] Add Triton client code (not sure if this will be public sourced yet)
128+
- [ ] Add image pre and postprocessing code
129+
130+
INT8 will give another big boost (maybe 2x - 3x ?) in performance, as the Tensor Cores on Nvidia GPUs will be activated. A first naive implementation did not result in performance improvements, because the custom layers do not support INT8 and have FP32 outputs, which breaks the optimization at multiple stages in the network. Optionally we can deactivate Mish and use standard ReLU instead. The weights and config for this are in the darknet repo.
131+
132+
## Acknowledgments
133+
134+
The initial codebase is from [Wang Xinyu](https://github.com/wang-xinyu) in his [TensorRTx](https://github.com/wang-xinyu/tensorrtx) repo. He had the idea to implement YOLO using only the TensorRT API and its very nice he shares this code. This repo has the purpose to deploy this engine and plugin to Triton and to add additional perfomance improvements as well as benchmarks.

0 commit comments

Comments
 (0)