Skip to content

Commit 05c8997

Browse files
authored
V1.0 dev (#2)
* feat: Add v1.0 codes * fix: Fix the repo name of README and tools * fix: Add the '1' folder into all_models/transformer
1 parent e3d56f8 commit 05c8997

File tree

5 files changed

+25
-27
lines changed

5 files changed

+25
-27
lines changed

README.md

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,7 @@
3030

3131
The Triton backend for the [FasterTransformer](https://github.com/NVIDIA/FasterTransformer). This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
3232

33-
Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend). Ask questions or report problems on the issues page in this FasterTransformer_backend repo.
34-
35-
<!-- TODO Add the FasterTransformer_backend issue link -->
33+
Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend). Ask questions or report problems on the [issues page](https://github.com/triton-inference-server/fastertransformer_backend/issues) in this FasterTransformer_backend repo.
3634

3735
## Table Of Contents
3836

@@ -50,7 +48,7 @@ We provide a docker file, which bases on Triton image `nvcr.io/nvidia/tritonserv
5048
```bash
5149
mkdir workspace && cd workspace
5250
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
53-
nvidia-docker build --tag ft_backend --file transformer_backend/Dockerfile .
51+
nvidia-docker build --tag ft_backend --file fastertransformer_backend/Dockerfile .
5452
nvidia-docker run --gpus=all -it --rm --volume $HOME:$HOME --volume $PWD:$PWD -w $PWD --name ft-work ft_backend
5553
cd workspace
5654
export WORKSPACE=$(pwd)
@@ -71,8 +69,8 @@ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp
7169
cd $WORKSPACE
7270
git clone https://github.com/triton-inference-server/server.git
7371
export PATH=/usr/local/mpi/bin:$PATH
74-
source transformer_backend/build.env
75-
mkdir -p transformer_backend/build && cd $WORKSPACE/transformer_backend/build
72+
source fastertransformer_backend/build.env
73+
mkdir -p fastertransformer_backend/build && cd $WORKSPACE/fastertransformer_backend/build
7674
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 .. && make -j32
7775
```
7876

@@ -87,20 +85,20 @@ mkdir -p models/megatron-models/345m
8785
unzip megatron_lm_345m_v0.0.zip -d models/megatron-models/345m
8886
python ../sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 8
8987
python _deps/repo-ft-src/sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 8
90-
cp ./models/megatron-models/c-model/345m/8-gpu $WORKSPACE/transformer_backend/all_models/transformer/1/ -r
88+
cp ./models/megatron-models/c-model/345m/8-gpu $WORKSPACE/fastertransformer_backend/all_models/transformer/1/ -r
9189
```
9290

9391
## Run Serving
9492

9593
* Run servning directly
9694

9795
```bash
98-
cp $WORKSPACE/transformer_backend/build/libtriton_transformer.so $WORKSPACE/transformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/transformer
96+
cp $WORKSPACE/fastertransformer_backend/build/libtriton_transformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/transformer
9997
cd $WORKSPACE && ln -s server/qa/common .
100-
# Recommend to modify the SERVER_TIMEOUT of common/utils.sh to longer time
101-
cd $WORKSPACE/transformer_backend/build/
102-
bash $WORKSPACE/transformer_backend/tools/run_server.sh
103-
bash $WORKSPACE/transformer_backend/tools/run_client.sh
98+
# Recommend to modify the SERVER_TIMEOUT of common/util.sh to longer time
99+
cd $WORKSPACE/fastertransformer_backend/build/
100+
bash $WORKSPACE/fastertransformer_backend/tools/run_server.sh
101+
bash $WORKSPACE/fastertransformer_backend/tools/run_client.sh
104102
python _deps/repo-ft-src/sample/pytorch/utils/convert_gpt_token.py --out_file=triton_out # Used for checking result
105103
```
106104

@@ -120,4 +118,4 @@ The model configuration for Triton server is put in `all_models/transformer/conf
120118
- vocab_size: size of vocabulary
121119
- decoder_layers: number of transformer layers
122120
- batch_size: max supported batch size
123-
- is_fuse_QKV: fusing QKV in one matrix multiplication or not. It also depends on the weights of QKV.
121+
- is_fuse_QKV: fusing QKV in one matrix multiplication or not. It also depends on the weights of QKV.

all_models/transformer/1/.tmp

Whitespace-only changes.

tools/identity_test.py

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -45,15 +45,15 @@
4545

4646
start_id = 220
4747
end_id = 50256
48-
# random_start_ids = np.random.randint(0, 50255, size=(BATCH_SIZE, START_LEN), dtype=np.uint32)
49-
random_start_ids = np.array([[9915, 27221, 59, 77, 383, 1853, 3327, 1462],
50-
[6601, 4237, 345, 460, 779, 284, 787, 257],
51-
[59, 77, 611, 7, 9248, 796, 657, 8],
52-
[38, 10128, 6032, 651, 8699, 4, 4048, 20753],
53-
[21448, 7006, 930, 12901, 930, 7406, 7006, 198],
54-
[13256, 11, 281, 1605, 3370, 11, 1444, 6771],
55-
[9915, 27221, 59, 77, 383, 1853, 3327, 1462],
56-
[6601, 4237, 345, 460, 779, 284, 787, 257]], np.uint32)
48+
random_start_ids = np.random.randint(0, 50255, size=(BATCH_SIZE, START_LEN), dtype=np.uint32)
49+
# random_start_ids = np.array([[9915, 27221, 59, 77, 383, 1853, 3327, 1462],
50+
# [6601, 4237, 345, 460, 779, 284, 787, 257],
51+
# [59, 77, 611, 7, 9248, 796, 657, 8],
52+
# [38, 10128, 6032, 651, 8699, 4, 4048, 20753],
53+
# [21448, 7006, 930, 12901, 930, 7406, 7006, 198],
54+
# [13256, 11, 281, 1605, 3370, 11, 1444, 6771],
55+
# [9915, 27221, 59, 77, 383, 1853, 3327, 1462],
56+
# [6601, 4237, 345, 460, 779, 284, 787, 257]], np.uint32)
5757
input_len = np.array([ [sentence.size] for sentence in random_start_ids ], np.uint32)
5858
output_len = np.ones_like(input_len).astype(np.uint32) * OUTPUT_LEN
5959

@@ -177,4 +177,4 @@
177177
print(output_data.shape)
178178
print(output_data)
179179
stop_time = datetime.now()
180-
print("[INFO] execution time: {} ms".format((stop_time - start_time).total_seconds() * 1000.0 / request_parallelism))
180+
print("[INFO] execution time: {} ms".format((stop_time - start_time).total_seconds() * 1000.0 / request_parallelism))

tools/run_client.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727

2828
# export CUDA_VISIBLE_DEVICES=0
2929

30-
CLIENT_PY=$WORKSPACE/transformer_backend/tools/identity_test.py
30+
CLIENT_PY=$WORKSPACE/fastertransformer_backend/tools/identity_test.py
3131
CLIENT_LOG="./client.log"
3232

3333
rm -rf client.log err.log
@@ -44,4 +44,4 @@ for PROTOCOL in http; do
4444
set -e
4545
done
4646

47-
exit $RET
47+
exit $RET

tools/run_server.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
# export CUDA_VISIBLE_DEVICES=0
2929

3030
SERVER=/opt/tritonserver/bin/tritonserver
31-
SERVER_ARGS="--model-repository=$WORKSPACE/transformer_backend/all_models"
31+
SERVER_ARGS="--model-repository=$WORKSPACE/fastertransformer_backend/all_models"
3232
SERVER_LOG="./inference_server.log"
3333
source $WORKSPACE/common/util.sh
3434

@@ -39,4 +39,4 @@ if [ "$SERVER_PID" == "0" ]; then
3939
echo -e "\n***\n*** Failed to start $SERVER\n***"
4040
cat $SERVER_LOG
4141
exit 1
42-
fi
42+
fi

0 commit comments

Comments
 (0)