You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[Batch Inference with TorchServe's default handlers](#batch-inference-with-torchserves-default-handlers)
8
8
*[Batch Inference with TorchServe using ResNet-152 model](#batch-inference-with-torchserve-using-resnet-152-model)
9
+
*[Demo to configure TorchServe ResNet-152 model with batch-supported model](#demo-to-configure-torchServe-resNet-152-model-with-batch-supported-model)
10
+
*[Demo to configure TorchServe ResNet-152 model with batch-supported model using Docker](#demo-to-configure-torchServe-resNet-152-model-with-batch-supported-model-using-docker)
9
11
10
12
## Introduction
11
13
12
14
Batch inference is a process of aggregating inference requests and sending this aggregated requests through the ML/DL framework for inference all at once.
13
15
TorchServe was designed to natively support batching of incoming inference requests. This functionality enables you to use your host resources optimally,
14
16
because most ML/DL frameworks are optimized for batch requests.
15
17
This optimal use of host resources in turn reduces the operational expense of hosting an inference service using TorchServe.
16
-
In this document we show an example of how this is done and compare the performance of running a batched inference against running single inference.
18
+
19
+
In this document we show an example of how to use batch inference in Torchserve when serving models locally or using docker containers.
17
20
18
21
## Prerequisites
19
22
@@ -30,42 +33,64 @@ TorchServe's default handlers support batch inference out of box except for `tex
30
33
31
34
To support batch inference, TorchServe needs the following:
32
35
33
-
1. TorchServe model configuration: Configure `batch_size` and `max_batch_delay` by using the "POST /models" management API.
36
+
1. TorchServe model configuration: Configure `batch_size` and `max_batch_delay` by using the "POST /models" management API or settings in config.properties.
34
37
TorchServe needs to know the maximum batch size that the model can handle and the maximum time that TorchServe should wait to fill each batch request.
35
38
2. Model handler code: TorchServe requires the Model handler to handle batch inference requests.
36
39
37
40
For a full working example of a custom model handler with batch processing, see [Hugging face transformer generalized handler](https://github.com/pytorch/serve/blob/master/examples/Huggingface_Transformers/Transformer_handler_generalized.py)
38
41
39
42
### TorchServe Model Configuration
40
43
41
-
To configure TorchServe to use the batching feature, provide the batch configuration information through [**POST /models** API](management_api.md).
44
+
Started from Torchserve 0.4.1, there are two methods to configure TorchServe to use the batching feature:
45
+
1. provide the batch configuration information through [**POST /models** API](management_api.md).
46
+
2. provide the batch configuration information through configuration file, config.properties.
42
47
43
-
The configuration that we are interested in is the following:
48
+
The configuration properties that we are interested in are the following:
44
49
45
-
1.`batch_size`: This is the maximum batch size that a model is expected to handle.
50
+
1.`batch_size`: This is the maximum batch size in `ms`that a model is expected to handle.
46
51
2.`max_batch_delay`: This is the maximum batch delay time TorchServe waits to receive `batch_size` number of requests. If TorchServe doesn't receive `batch_size` number of
47
52
requests before this timer time's out, it sends what ever requests that were received to the model `handler`.
48
53
49
-
Let's look at an example using this configuration
54
+
Let's look at an example using this configuration through management API:
50
55
51
56
```bash
52
57
# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milliseconds.
53
58
curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=50"
59
+
```
60
+
Here is an example of using this configuration through the config.properties:
61
+
62
+
```text
63
+
# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milli seconds, in the config.properties.
64
+
65
+
models={\
66
+
"resnet-152": {\
67
+
"1.0": {\
68
+
"defaultVersion": true,\
69
+
"marName": "resnet-152.mar",\
70
+
"minWorkers": 1,\
71
+
"maxWorkers": 1,\
72
+
"batchSize": 8,\
73
+
"maxBatchDelay": 50,\
74
+
"responseTimeout": 120\
75
+
}\
76
+
}\
77
+
}
78
+
54
79
```
55
80
56
81
These configurations are used both in TorchServe and in the model's custom service code (a.k.a the handler code).
57
82
TorchServe associates the batch related configuration with each model.
58
83
The frontend then tries to aggregate the batch-size number of requests and send it to the backend.
59
84
60
-
## Demo to configure TorchServe with batch-supported model
85
+
## Demo to configure TorchServe ResNet-152 model with batch-supported model
61
86
62
87
In this section lets bring up model server and launch Resnet-152 model, which uses the default `image_classifier` handler for batch inferencing.
63
88
64
89
### Setup TorchServe and Torch Model Archiver
65
90
66
91
First things first, follow the main [Readme](../README.md) and install all the required packages including `torchserve`.
67
92
68
-
### Loading Resnet-152 which handles batch inferences
93
+
### Batch inference of Resnet-152 configured with managment API
69
94
70
95
* Start the model server. In this example, we are starting the model server to run on inference port 8080 and management port 8081.
71
96
@@ -90,13 +115,16 @@ $ curl localhost:8080/ping
90
115
* Now let's launch resnet-152 model, which we have built to handle batch inference. Because this is an example, we are going to launch 1 worker which handles a batch size of 8 with a `max_batch_delay` of 10ms.
91
116
92
117
```text
93
-
$ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=8&max_batch_delay=10&initial_workers=1"
118
+
$ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"
@@ -143,3 +173,176 @@ $ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_fil
143
173
"quilt": 0.0002698268508538604
144
174
}
145
175
```
176
+
### Batch inference of Resnet-152 configured through config.properties
177
+
178
+
* Here, we first set the `batch_size` and `max_batch_delay` in the config.properties, make sure the mar file is located in the model-store and the version in the models setting is consistent with version of the mar file created. To read more about configurations please refer to this [document](./configuration.md).
179
+
180
+
```text
181
+
load_models=resnet-152-batch_v2.mar
182
+
models={\
183
+
"resnet-152-batch_v2": {\
184
+
"2.0": {\
185
+
"defaultVersion": true,\
186
+
"marName": "resnet-152-batch_v2.mar",\
187
+
"minWorkers": 1,\
188
+
"maxWorkers": 1,\
189
+
"batchSize": 3,\
190
+
"maxBatchDelay": 5000,\
191
+
"responseTimeout": 120\
192
+
}\
193
+
}\
194
+
}
195
+
```
196
+
* Then will start Torchserve by passing the config.properties using `--ts-config` flag
## Demo to configure TorchServe ResNet-152 model with batch-supported model using Docker
260
+
261
+
Here, we show how to register a model with batch inference support when serving the model using docker contianers. We set the `batch_size` and `max_batch_delay` in the config.properties similar to the previous section which is being used by [dockered_entrypoint.sh](../docker/dockerd-entrypoint.sh).
262
+
263
+
### Batch inference of Resnet-152 using docker contianer
264
+
265
+
* Set the batch `batch_size` and `max_batch_delay` in the config.properties as referenced in the [dockered_entrypoint.sh](../docker/dockerd-entrypoint.sh)
266
+
267
+
```text
268
+
inference_address=http://0.0.0.0:8080
269
+
management_address=http://0.0.0.0:8081
270
+
metrics_address=http://0.0.0.0:8082
271
+
number_of_netty_threads=32
272
+
job_queue_size=1000
273
+
model_store=/home/model-server/model-store
274
+
load_models=resnet-152-batch_v2.mar
275
+
models={\
276
+
"resnet-152-batch_v2": {\
277
+
"1.0": {\
278
+
"defaultVersion": true,\
279
+
"marName": "resnet-152-batch_v2.mar",\
280
+
"minWorkers": 1,\
281
+
"maxWorkers": 1,\
282
+
"batchSize": 3,\
283
+
"maxBatchDelay": 100,\
284
+
"responseTimeout": 120\
285
+
}\
286
+
}\
287
+
}
288
+
```
289
+
* build the targeted docker image from [here](../docker), here we use the gpu image
290
+
```bash
291
+
./build_image.sh -g -cv cu102
292
+
```
293
+
294
+
* Start serving the model with the container and pass the config.properties to the container
295
+
296
+
```bash
297
+
docker run --rm -it --gpus all -p 8080:8080 -p 8081:8081 --name mar -v /home/ubuntu/serve/model_store:/home/model-server/model-store -v $ path to config.properties:/home/model-server/config.properties pytorch/torchserve:latest-gpu
Copy file name to clipboardExpand all lines: examples/Huggingface_Transformers/README.md
+35-10Lines changed: 35 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -175,23 +175,48 @@ To get an explanation: `curl -X POST http://127.0.0.1:8080/explanations/my_tc -T
175
175
176
176
## Batch Inference
177
177
178
-
For batch inference the main difference is that you need set the batch size while registering the model. As an example on sequence classification.
179
-
180
-
```
181
-
mkdir model_store
182
-
mv BERTSeqClassification.mar model_store/
183
-
torchserve --start --model-store model_store
184
-
185
-
curl -X POST "localhost:8081/models?model_name=BERTSeqClassification&url=BERTSeqClassification.mar&batch_size=4&max_batch_delay=5000&initial_workers=3&synchronous=true"
186
-
```
187
-
178
+
For batch inference the main difference is that you need set the batch size while registering the model. This can be done either through the management API or if using Torchserve 0.4.1 and above, it can be set through config.properties as well. Here is an example of setting batch size for sequence classification with management API and through config.properties. You can read more on batch inference in Torchserve [here](https://github.com/pytorch/serve/tree/master/docs/batch_inference_with_ts.md).
179
+
180
+
* Management API
181
+
```
182
+
mkdir model_store
183
+
mv BERTSeqClassification.mar model_store/
184
+
torchserve --start --model-store model_store
185
+
186
+
curl -X POST "localhost:8081/models?model_name=BERTSeqClassification&url=BERTSeqClassification.mar&batch_size=4&max_batch_delay=5000&initial_workers=3&synchronous=true"
0 commit comments