Skip to content

Commit 7af0466

Browse files
authored
Merge pull request #1125 from pytorch/issue_1114
Adding user guide for batch inferencing
2 parents fd4e3e8 + 51d2cf0 commit 7af0466

File tree

3 files changed

+284
-35
lines changed

3 files changed

+284
-35
lines changed

docs/batch_inference_with_ts.md

Lines changed: 216 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,17 @@
66
* [Prerequisites](#prerequisites)
77
* [Batch Inference with TorchServe's default handlers](#batch-inference-with-torchserves-default-handlers)
88
* [Batch Inference with TorchServe using ResNet-152 model](#batch-inference-with-torchserve-using-resnet-152-model)
9+
* [Demo to configure TorchServe ResNet-152 model with batch-supported model](#demo-to-configure-torchServe-resNet-152-model-with-batch-supported-model)
10+
* [Demo to configure TorchServe ResNet-152 model with batch-supported model using Docker](#demo-to-configure-torchServe-resNet-152-model-with-batch-supported-model-using-docker)
911

1012
## Introduction
1113

1214
Batch inference is a process of aggregating inference requests and sending this aggregated requests through the ML/DL framework for inference all at once.
1315
TorchServe was designed to natively support batching of incoming inference requests. This functionality enables you to use your host resources optimally,
1416
because most ML/DL frameworks are optimized for batch requests.
1517
This optimal use of host resources in turn reduces the operational expense of hosting an inference service using TorchServe.
16-
In this document we show an example of how this is done and compare the performance of running a batched inference against running single inference.
18+
19+
In this document we show an example of how to use batch inference in Torchserve when serving models locally or using docker containers.
1720

1821
## Prerequisites
1922

@@ -30,42 +33,64 @@ TorchServe's default handlers support batch inference out of box except for `tex
3033

3134
To support batch inference, TorchServe needs the following:
3235

33-
1. TorchServe model configuration: Configure `batch_size` and `max_batch_delay` by using the "POST /models" management API.
36+
1. TorchServe model configuration: Configure `batch_size` and `max_batch_delay` by using the "POST /models" management API or settings in config.properties.
3437
TorchServe needs to know the maximum batch size that the model can handle and the maximum time that TorchServe should wait to fill each batch request.
3538
2. Model handler code: TorchServe requires the Model handler to handle batch inference requests.
3639

3740
For a full working example of a custom model handler with batch processing, see [Hugging face transformer generalized handler](https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py)
3841

3942
### TorchServe Model Configuration
4043

41-
To configure TorchServe to use the batching feature, provide the batch configuration information through [**POST /models** API](management_api.md).
44+
Started from Torchserve 0.4.1, there are two methods to configure TorchServe to use the batching feature:
45+
1. provide the batch configuration information through [**POST /models** API](management_api.md).
46+
2. provide the batch configuration information through configuration file, config.properties.
4247

43-
The configuration that we are interested in is the following:
48+
The configuration properties that we are interested in are the following:
4449

45-
1. `batch_size`: This is the maximum batch size that a model is expected to handle.
50+
1. `batch_size`: This is the maximum batch size in `ms` that a model is expected to handle.
4651
2. `max_batch_delay`: This is the maximum batch delay time TorchServe waits to receive `batch_size` number of requests. If TorchServe doesn't receive `batch_size` number of
4752
requests before this timer time's out, it sends what ever requests that were received to the model `handler`.
4853

49-
Let's look at an example using this configuration
54+
Let's look at an example using this configuration through management API:
5055

5156
```bash
5257
# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milliseconds.
5358
curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=50"
59+
```
60+
Here is an example of using this configuration through the config.properties:
61+
62+
```text
63+
# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milli seconds, in the config.properties.
64+
65+
models={\
66+
"resnet-152": {\
67+
"1.0": {\
68+
"defaultVersion": true,\
69+
"marName": "resnet-152.mar",\
70+
"minWorkers": 1,\
71+
"maxWorkers": 1,\
72+
"batchSize": 8,\
73+
"maxBatchDelay": 50,\
74+
"responseTimeout": 120\
75+
}\
76+
}\
77+
}
78+
5479
```
5580

5681
These configurations are used both in TorchServe and in the model's custom service code (a.k.a the handler code).
5782
TorchServe associates the batch related configuration with each model.
5883
The frontend then tries to aggregate the batch-size number of requests and send it to the backend.
5984

60-
## Demo to configure TorchServe with batch-supported model
85+
## Demo to configure TorchServe ResNet-152 model with batch-supported model
6186

6287
In this section lets bring up model server and launch Resnet-152 model, which uses the default `image_classifier` handler for batch inferencing.
6388

6489
### Setup TorchServe and Torch Model Archiver
6590

6691
First things first, follow the main [Readme](../README.md) and install all the required packages including `torchserve`.
6792

68-
### Loading Resnet-152 which handles batch inferences
93+
### Batch inference of Resnet-152 configured with managment API
6994

7095
* Start the model server. In this example, we are starting the model server to run on inference port 8080 and management port 8081.
7196

@@ -90,13 +115,16 @@ $ curl localhost:8080/ping
90115
* Now let's launch resnet-152 model, which we have built to handle batch inference. Because this is an example, we are going to launch 1 worker which handles a batch size of 8 with a `max_batch_delay` of 10ms.
91116

92117
```text
93-
$ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=8&max_batch_delay=10&initial_workers=1"
118+
$ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"
94119
{
95120
"status": "Processing worker updates..."
96121
}
97122
```
98123

99124
* Verify that the workers were started properly.
125+
```bash
126+
curl http://localhost:8081/models/resnet-152-batch_v2
127+
```
100128

101129
```json
102130
[
@@ -108,15 +136,17 @@ $ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_fil
108136
"minWorkers": 1,
109137
"maxWorkers": 1,
110138
"batchSize": 3,
111-
"maxBatchDelay": 5000,
139+
"maxBatchDelay": 10,
112140
"loadedAtStartup": false,
113141
"workers": [
114142
{
115143
"id": "9000",
116-
"startTime": "2020-07-28T05:04:05.465Z",
144+
"startTime": "2021-06-14T23:18:21.793Z",
117145
"status": "READY",
118-
"gpu": false,
119-
"memoryUsage": 0
146+
"memoryUsage": 1726554112,
147+
"pid": 19946,
148+
"gpu": true,
149+
"gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
120150
}
121151
]
122152
}
@@ -143,3 +173,176 @@ $ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_fil
143173
"quilt": 0.0002698268508538604
144174
}
145175
```
176+
### Batch inference of Resnet-152 configured through config.properties
177+
178+
* Here, we first set the `batch_size` and `max_batch_delay` in the config.properties, make sure the mar file is located in the model-store and the version in the models setting is consistent with version of the mar file created. To read more about configurations please refer to this [document](./configuration.md).
179+
180+
```text
181+
load_models=resnet-152-batch_v2.mar
182+
models={\
183+
"resnet-152-batch_v2": {\
184+
"2.0": {\
185+
"defaultVersion": true,\
186+
"marName": "resnet-152-batch_v2.mar",\
187+
"minWorkers": 1,\
188+
"maxWorkers": 1,\
189+
"batchSize": 3,\
190+
"maxBatchDelay": 5000,\
191+
"responseTimeout": 120\
192+
}\
193+
}\
194+
}
195+
```
196+
* Then will start Torchserve by passing the config.properties using `--ts-config` flag
197+
198+
```bash
199+
torchserve --start --model-store model_store --ts-config config.properties
200+
```
201+
* Verify that TorchServe is up and running
202+
203+
```text
204+
$ curl localhost:8080/ping
205+
{
206+
"status": "Healthy"
207+
}
208+
```
209+
* Verify that the workers were started properly.
210+
```bash
211+
curl http://localhost:8081/models/resnet-152-batch_v2
212+
```
213+
```json
214+
[
215+
{
216+
"modelName": "resnet-152-batch_v2",
217+
"modelVersion": "2.0",
218+
"modelUrl": "resnet-152-batch_v2.mar",
219+
"runtime": "python",
220+
"minWorkers": 1,
221+
"maxWorkers": 1,
222+
"batchSize": 3,
223+
"maxBatchDelay": 5000,
224+
"loadedAtStartup": true,
225+
"workers": [
226+
{
227+
"id": "9000",
228+
"startTime": "2021-06-14T22:44:36.742Z",
229+
"status": "READY",
230+
"memoryUsage": 0,
231+
"pid": 19116,
232+
"gpu": true,
233+
"gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
234+
}
235+
]
236+
}
237+
]
238+
```
239+
* Now let's test this service.
240+
241+
* Get an image to test this service
242+
243+
```text
244+
$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg
245+
```
246+
247+
* Run inference to test the model.
248+
249+
```text
250+
$ curl http://localhost:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
251+
{
252+
"tiger_cat": 0.5848360657691956,
253+
"tabby": 0.3782736361026764,
254+
"Egyptian_cat": 0.03441936895251274,
255+
"lynx": 0.0005633446853607893,
256+
"quilt": 0.0002698268508538604
257+
}
258+
```
259+
## Demo to configure TorchServe ResNet-152 model with batch-supported model using Docker
260+
261+
Here, we show how to register a model with batch inference support when serving the model using docker contianers. We set the `batch_size` and `max_batch_delay` in the config.properties similar to the previous section which is being used by [dockered_entrypoint.sh](../docker/dockerd-entrypoint.sh).
262+
263+
### Batch inference of Resnet-152 using docker contianer
264+
265+
* Set the batch `batch_size` and `max_batch_delay` in the config.properties as referenced in the [dockered_entrypoint.sh](../docker/dockerd-entrypoint.sh)
266+
267+
```text
268+
inference_address=http://0.0.0.0:8080
269+
management_address=http://0.0.0.0:8081
270+
metrics_address=http://0.0.0.0:8082
271+
number_of_netty_threads=32
272+
job_queue_size=1000
273+
model_store=/home/model-server/model-store
274+
load_models=resnet-152-batch_v2.mar
275+
models={\
276+
"resnet-152-batch_v2": {\
277+
"1.0": {\
278+
"defaultVersion": true,\
279+
"marName": "resnet-152-batch_v2.mar",\
280+
"minWorkers": 1,\
281+
"maxWorkers": 1,\
282+
"batchSize": 3,\
283+
"maxBatchDelay": 100,\
284+
"responseTimeout": 120\
285+
}\
286+
}\
287+
}
288+
```
289+
* build the targeted docker image from [here](../docker), here we use the gpu image
290+
```bash
291+
./build_image.sh -g -cv cu102
292+
```
293+
294+
* Start serving the model with the container and pass the config.properties to the container
295+
296+
```bash
297+
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 --name mar -v /home/ubuntu/serve/model_store:/home/model-server/model-store -v $ path to config.properties:/home/model-server/config.properties pytorch/torchserve:latest-gpu
298+
```
299+
* Verify that the workers were started properly.
300+
```bash
301+
curl http://localhost:8081/models/resnet-152-batch_v2
302+
```
303+
```json
304+
[
305+
{
306+
"modelName": "resnet-152-batch_v2",
307+
"modelVersion": "2.0",
308+
"modelUrl": "resnet-152-batch_v2.mar",
309+
"runtime": "python",
310+
"minWorkers": 1,
311+
"maxWorkers": 1,
312+
"batchSize": 3,
313+
"maxBatchDelay": 5000,
314+
"loadedAtStartup": true,
315+
"workers": [
316+
{
317+
"id": "9000",
318+
"startTime": "2021-06-14T22:44:36.742Z",
319+
"status": "READY",
320+
"memoryUsage": 0,
321+
"pid": 19116,
322+
"gpu": true,
323+
"gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
324+
}
325+
]
326+
}
327+
]
328+
```
329+
* Now let's test this service.
330+
331+
* Get an image to test this service
332+
333+
```text
334+
$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg
335+
```
336+
337+
* Run inference to test the model.
338+
339+
```text
340+
$ curl http://localhost:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
341+
{
342+
"tiger_cat": 0.5848360657691956,
343+
"tabby": 0.3782736361026764,
344+
"Egyptian_cat": 0.03441936895251274,
345+
"lynx": 0.0005633446853607893,
346+
"quilt": 0.0002698268508538604
347+
}
348+
```

examples/Huggingface_Transformers/README.md

Lines changed: 35 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -175,23 +175,48 @@ To get an explanation: `curl -X POST http://127.0.0.1:8080/explanations/my_tc -T
175175

176176
## Batch Inference
177177

178-
For batch inference the main difference is that you need set the batch size while registering the model. As an example on sequence classification.
179-
180-
```
181-
mkdir model_store
182-
mv BERTSeqClassification.mar model_store/
183-
torchserve --start --model-store model_store
184-
185-
curl -X POST "localhost:8081/models?model_name=BERTSeqClassification&url=BERTSeqClassification.mar&batch_size=4&max_batch_delay=5000&initial_workers=3&synchronous=true"
186-
```
187-
178+
For batch inference the main difference is that you need set the batch size while registering the model. This can be done either through the management API or if using Torchserve 0.4.1 and above, it can be set through config.properties as well. Here is an example of setting batch size for sequence classification with management API and through config.properties. You can read more on batch inference in Torchserve [here](https://github.com/pytorch/serve/tree/master/docs/batch_inference_with_ts.md).
179+
180+
* Management API
181+
```
182+
mkdir model_store
183+
mv BERTSeqClassification.mar model_store/
184+
torchserve --start --model-store model_store
185+
186+
curl -X POST "localhost:8081/models?model_name=BERTSeqClassification&url=BERTSeqClassification.mar&batch_size=4&max_batch_delay=5000&initial_workers=3&synchronous=true"
187+
```
188+
189+
* Config.properties
190+
```text
191+
192+
models={\
193+
"BERTSeqClassification": {\
194+
"2.0": {\
195+
"defaultVersion": true,\
196+
"marName": "BERTSeqClassification.mar",\
197+
"minWorkers": 1,\
198+
"maxWorkers": 1,\
199+
"batchSize": 4,\
200+
"maxBatchDelay": 5000,\
201+
"responseTimeout": 120\
202+
}\
203+
}\
204+
}
205+
```
206+
```
207+
mkdir model_store
208+
mv BERTSeqClassification.mar model_store/
209+
torchserve --start --model-store model_store --ts-config config.properties --models BERTSeqClassification= BERTSeqClassification.mar
210+
211+
```
188212
Now to run the batch inference following command can be used:
189213
190214
```
191215
curl -X POST http://127.0.0.1:8080/predictions/BERT_seq_Classification -T ./Seq_classification_artifacts/sample_text1.txt
192216
& curl -X POST http://127.0.0.1:8080/predictions/BERT_seq_Classification -T ./Seq_classification_artifacts/sample_text2.txt
193217
& curl -X POST http://127.0.0.1:8080/predictions/BERT_seq_Classification -T ./Seq_classification_artifacts/sample_text3.txt &
194218
```
219+
195220
## More information
196221
197222
### Captum Explanations for Visual Insights

0 commit comments

Comments
 (0)