Cannot load model #1813

LiJell · 2022-08-24T08:13:47Z

🐛 Describe the bug

I am trying to deploy locally pretrained model via sagemaker to make a endpoint and use it.

I deployed a model

from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data='model.tar.gz',
role=role,
entry_point='inference.py',
framework_version="1.9.0",
py_version="py38")

predictor = pytorch_model.deploy(instance_type='ml.g4dn.xlarge', initial_instance_count=1)

and predict data

from PIL import Image

data = Image.open('./samples/inputs/1.jpg')
result = predictor.predict(data)
img = Image.open(result)
img.show()

as a result I got an error

ModelError Traceback (most recent call last)
/tmp/ipykernel_4268/3704626012.py in <cell line: 4>()
2
3 data = Image.open('./samples/inputs/1.jpg')
----> 4 result = predictor.predict(data)
5
6 img = Image.open(result)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model, target_variant, inference_id)
159 data, initial_args, target_model, target_variant, inference_id
160 )
--> 161 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
162 return self._handle_response(response)
163

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
506 )
507 # The "self" in this scope is referring to the BaseClient.
--> 508 return self._make_api_call(operation_name, kwargs)
509
510 _api_call.name = str(py_operation_name)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
913 error_code = parsed_response.get("Error", {}).get("Code")
914 error_class = self.exceptions.from_code(error_code)
--> 915 raise error_class(parsed_response, operation_name)
916 else:
917 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.".

I skim through logs via CloudWatch, and still struggling with this. need a help.

Error logs

timestamp	message	logStreamName
1661327528194	2022-08-24 07:52:07,987 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...	AllTraffic/i-0b6f78248b097b6c7
1661327528194	2022-08-24 07:52:08,112 [INFO ] main org.pytorch.serve.ModelServer -	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Torchserve version: 0.4.2	AllTraffic/i-0b6f78248b097b6c7
1661327528194	TS Home: /opt/conda/lib/python3.8/site-packages	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Current directory: /	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Temp directory: /home/model-server/tmp	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Number of GPUs: 1	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Number of CPUs: 1	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Max heap size: 3234 M	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Python executable: /opt/conda/bin/python3.8	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Config file: /etc/sagemaker-ts.properties	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Inference address: http://0.0.0.0:8080	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Management address: http://0.0.0.0:8080	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Metrics address: http://127.0.0.1:8082	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Model Store: /.sagemaker/ts/models	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Initial Models: model.mar	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Log dir: /logs	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Metrics dir: /logs	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Netty threads: 0	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Netty client threads: 0	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Default workers per model: 1	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Blacklist Regex: N/A	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Maximum Response Size: 6553500	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Maximum Request Size: 6553500	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Prefer direct buffer: false	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Allowed Urls: [file://.*	http(s)?://.*]
1661327528194	Custom python dependency for model allowed: false	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Metrics report format: prometheus	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Enable metrics API: true	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Workflow Store: /.sagemaker/ts/models	AllTraffic/i-0b6f78248b097b6c7
1661327528194	Model config: N/A	AllTraffic/i-0b6f78248b097b6c7
1661327528194	2022-08-24 07:52:08,120 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...	AllTraffic/i-0b6f78248b097b6c7
1661327528444	2022-08-24 07:52:08,149 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: model.mar	AllTraffic/i-0b6f78248b097b6c7
1661327528444	2022-08-24 07:52:08,353 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.	AllTraffic/i-0b6f78248b097b6c7
1661327528694	2022-08-24 07:52:08,370 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.	AllTraffic/i-0b6f78248b097b6c7
1661327528694	2022-08-24 07:52:08,472 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080	AllTraffic/i-0b6f78248b097b6c7
1661327528694	2022-08-24 07:52:08,473 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.	AllTraffic/i-0b6f78248b097b6c7
1661327528944	2022-08-24 07:52:08,474 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082	AllTraffic/i-0b6f78248b097b6c7
1661327528944	Model server started.	AllTraffic/i-0b6f78248b097b6c7
1661327528944	2022-08-24 07:52:08,738 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.	AllTraffic/i-0b6f78248b097b6c7
1661327528944	2022-08-24 07:52:08,786 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0	#Level:Host
1661327528944	2022-08-24 07:52:08,787 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:24.598094940185547	#Level:Host
1661327528944	2022-08-24 07:52:08,788 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:27.390167236328125	#Level:Host
1661327528944	2022-08-24 07:52:08,788 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:52.7	#Level:Host
1661327528944	2022-08-24 07:52:08,788 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:14186.97265625	#Level:Host
1661327528944	2022-08-24 07:52:08,789 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:1227.640625	#Level:Host
1661327529195	2022-08-24 07:52:08,789 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:9.9	#Level:Host
1661327529195	2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0b6f78248b097b6c7
1661327529195	2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]32	AllTraffic/i-0b6f78248b097b6c7
1661327529195	2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started.	AllTraffic/i-0b6f78248b097b6c7
1661327529195	2022-08-24 07:52:09,004 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.8.10	AllTraffic/i-0b6f78248b097b6c7
1661327529195	2022-08-24 07:52:09,011 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0b6f78248b097b6c7
1661327529195	2022-08-24 07:52:09,021 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,064 [INFO ] W-9000-model_1-stdout MODEL_LOG - model_name: model, batchSize: 1	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,605 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,605 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,606 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 183, in	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,606 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server()	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,606 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 155, in run_server	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,607 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,607 [INFO ] W-9000-model_1-stdout MODEL_LOG - self.handle_connection(cl_socket)	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,608 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 117, in handle_connection	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,608 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,608 [INFO ] W-9000-model_1-stdout MODEL_LOG - service, result, code = self.load_model(msg)	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,609 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,609 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,610 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 90, in load_model	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,610 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout	AllTraffic/i-0b6f78248b097b6c7
1661327529695	2022-08-24 07:52:09,610 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.	AllTraffic/i-0b6f78248b097b6c7
1661327531196	2022-08-24 07:52:09,628 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr	AllTraffic/i-0b6f78248b097b6c7
1661327531196	2022-08-24 07:52:11,192 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0b6f78248b097b6c7
1661327531196	2022-08-24 07:52:11,193 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]52	AllTraffic/i-0b6f78248b097b6c7
1661327531196	2022-08-24 07:52:11,193 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started.	AllTraffic/i-0b6f78248b097b6c7
1661327531196	2022-08-24 07:52:11,193 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0b6f78248b097b6c7
1661327531196	2022-08-24 07:52:11,194 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.8.10	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,195 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,212 [INFO ] W-9000-model_1-stdout MODEL_LOG - model_name: model, batchSize: 1	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,368 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,368 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,368 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,369 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,371 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stderr	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,371 [WARN ] W-9000-model_1 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1-stdout	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,371 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.	AllTraffic/i-0b6f78248b097b6c7
1661327531446	2022-08-24 07:52:11,371 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 183, in	AllTraffic/i-0b6f78248b097b6c7
1661327531696	2022-08-24 07:52:11,372 [INFO ] W-9000-model_1-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stdout	AllTraffic/i-0b6f78248b097b6c7
1661327531696	2022-08-24 07:52:11,665 [INFO ] W-9000-model_1 ACCESS_LOG - /169.254.178.2:35288 "GET /ping HTTP/1.1" 200 15	AllTraffic/i-0b6f78248b097b6c7
1661327531696	2022-08-24 07:52:11,666 [INFO ] W-9000-model_1 TS_METRICS - Requests2XX.Count:1	#Level:Host
1661327532947	2022-08-24 07:52:11,673 [INFO ] W-9000-model_1-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1-stderr	AllTraffic/i-0b6f78248b097b6c7
1661327532947	2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0b6f78248b097b6c7
1661327532947	2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - [PID]65	AllTraffic/i-0b6f78248b097b6c7
1661327532947	2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - Torch worker started.	AllTraffic/i-0b6f78248b097b6c7
1661327532947	2022-08-24 07:52:12,892 [INFO ] W-9000-model_1 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0b6f78248b097b6c7
1661327532947	2022-08-24 07:52:12,892 [INFO ] W-9000-model_1-stdout MODEL_LOG - Python runtime: 3.8.10	AllTraffic/i-0b6f78248b097b6c7
1661327532947	2022-08-24 07:52:12,893 [INFO ] W-9000-model_1-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.	AllTraffic/i-0b6f78248b097b6c7
1661327533197	2022-08-24 07:52:12,894 [INFO ] W-9000-model_1-stdout MODEL_LOG - model_name: model, batchSize: 1	AllTraffic/i-0b6f78248b097b6c7
1661327533197	2022-08-24 07:52:13,026 [INFO ] W-9000-model_1-stdout MODEL_LOG - Backend worker process died.	AllTraffic/i-0b6f78248b097b6c7
1661327533197	2022-08-24 07:52:13,026 [INFO ] W-9000-model_1-stdout MODEL_LOG - Traceback (most recent call last):	AllTraffic/i-0b6f78248b097b6c7
1661327533197	2022-08-24 07:52:13,027 [INFO ] W-9000-model_1-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 183, in	AllTraffic/i-0b6f78248b097b6c7
1661327533197	2022-08-24 07:52:13,027 [INFO ] W-9000-model_1-stdout MODEL_LOG - worker.run_server()	AllTraffic/i-0b6f78248b097b6c7

Installation instructions

I am using sagemaker

Model Packaing

`from sagemaker.pytorch import PyTorchModel

pytorch_model = PyTorchModel(model_data='model.tar.gz',
role=role,
entry_point='inference.py',
framework_version="1.9.0",
py_version="py38")`

config.properties

No response

Versions

framework_version="1.9.0",
py_version="py38"
Torchserve version: 0.4.2
working on conda_pytorch_p38 sagemaker notebook instance

Repro instructions

inference file that I wrote
class ConvNormLReLU(nn.Sequential):
def init(self, in_ch, out_ch, kernel_size=3, stride=1, padding=1, pad_mode="reflect", groups=1, bias=False):

    pad_layer = {
        "zero":    nn.ZeroPad2d,
        "same":    nn.ReplicationPad2d,
        "reflect": nn.ReflectionPad2d,
    }
    if pad_mode not in pad_layer:
        raise NotImplementedError
        
    super(ConvNormLReLU, self).__init__(
        pad_layer[pad_mode](padding),
        nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size, stride=stride, padding=0, groups=groups, bias=bias),
        nn.GroupNorm(num_groups=1, num_channels=out_ch, affine=True),
        nn.LeakyReLU(0.2, inplace=True)
    )

class InvertedResBlock(nn.Module):
def init(self, in_ch, out_ch, expansion_ratio=2):
super(InvertedResBlock, self).init()

    self.use_res_connect = in_ch == out_ch
    bottleneck = int(round(in_ch*expansion_ratio))
    layers = []
    if expansion_ratio != 1:
        layers.append(ConvNormLReLU(in_ch, bottleneck, kernel_size=1, padding=0))
    
    # dw
    layers.append(ConvNormLReLU(bottleneck, bottleneck, groups=bottleneck, bias=True))
    # pw
    layers.append(nn.Conv2d(bottleneck, out_ch, kernel_size=1, padding=0, bias=False))
    layers.append(nn.GroupNorm(num_groups=1, num_channels=out_ch, affine=True))

    self.layers = nn.Sequential(*layers)
    
def forward(self, input):
    out = self.layers(input)
    if self.use_res_connect:
        out = input + out
    return out

class Generator(nn.Module):
def init(self, ):
super().init()

    self.block_a = nn.Sequential(
        ConvNormLReLU(3,  32, kernel_size=7, padding=3),
        ConvNormLReLU(32, 64, stride=2, padding=(0,1,0,1)),
        ConvNormLReLU(64, 64)
    )
    
    self.block_b = nn.Sequential(
        ConvNormLReLU(64,  128, stride=2, padding=(0,1,0,1)),            
        ConvNormLReLU(128, 128)
    )
    
    self.block_c = nn.Sequential(
        ConvNormLReLU(128, 128),
        InvertedResBlock(128, 256, 2),
        InvertedResBlock(256, 256, 2),
        InvertedResBlock(256, 256, 2),
        InvertedResBlock(256, 256, 2),
        ConvNormLReLU(256, 128),
    )    
    
    self.block_d = nn.Sequential(
        ConvNormLReLU(128, 128),
        ConvNormLReLU(128, 128)
    )

    self.block_e = nn.Sequential(
        ConvNormLReLU(128, 64),
        ConvNormLReLU(64,  64),
        ConvNormLReLU(64,  32, kernel_size=7, padding=3)
    )

    self.out_layer = nn.Sequential(
        nn.Conv2d(32, 3, kernel_size=1, stride=1, padding=0, bias=False),
        nn.Tanh()
    )
    
def forward(self, input, align_corners=True):
    out = self.block_a(input)
    half_size = out.size()[-2:]
    out = self.block_b(out)
    out = self.block_c(out)
    
    if align_corners:
        out = F.interpolate(out, half_size, mode="bilinear", align_corners=True)
    else:
        out = F.interpolate(out, scale_factor=2, mode="bilinear", align_corners=False)
    out = self.block_d(out)

    if align_corners:
        out = F.interpolate(out, input.size()[-2:], mode="bilinear", align_corners=True)
    else:
        out = F.interpolate(out, scale_factor=2, mode="bilinear", align_corners=False)
    out = self.block_e(out)

    out = self.out_layer(out)
    return out

def model_fn(model_dir):
"""Load the model and return it.
Providing this function is optional.
There is a default_model_fn available, which will load the model
compiled using SageMaker Neo. You can override the default here.
The model_fn only needs to be defined if your model needs extra
steps to load, and can otherwise be left undefined.

Keyword arguments:
model_dir -- the directory path where the model artifacts are present
"""        

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# The compiled model is saved as "model.pt"
model = Generator()
model_path = os.path.join(model_dir, 'model.pt')
with open(os.path.join(model_path, 'model.pt'), 'rb') as f:
    model.load_state_dict(torch.load(f))
    
model.to(device).eval()


return model

def transform_fn(model, request_body, request_content_type='image/', response_content_type='image/'):
image_format = "png" #@param ["jpeg", "png"]
"""Run prediction and return the output.
The function
1. Pre-processes the input request
2. Runs prediction
3. Post-processes the prediction output.
"""
# preprocess
img_in = Image.open(io.BytesIO(request_body)).convert("RGB")

# predict
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
im_out = model(img_in)
buffer_out = BytesIO()
im_out.save(buffer_out, format=image_format)
out = buffer_out.getvalue()

return out, response_content_type

Possible Solution

No response

The text was updated successfully, but these errors were encountered:

agunapal · 2022-08-25T01:16:51Z

@LiJell I checked with @lxning about this this. Please try using version 1.11 instead of 1.9

LiJell · 2022-08-25T01:49:45Z

@LiJell I checked with @lxning about this this. Please try using version 1.11 instead of 1.9

Thank you I will try version 1.11!!

LiJell · 2022-08-25T02:10:35Z

@agunapal

I tried version 1.11, but I got more error ;(
I am a beginner of this AI field, and I kinda lost even if I am trying hard to find the reasons

timestamp	message	logStreamName
1661392700131	WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.	AllTraffic/i-0ebe254fbdc09af06
1661392700131	2022-08-25T01:58:19,953 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...	AllTraffic/i-0ebe254fbdc09af06
1661392700131	2022-08-25T01:58:20,041 [INFO ] main org.pytorch.serve.ModelServer -	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Torchserve version: 0.6.0	AllTraffic/i-0ebe254fbdc09af06
1661392700131	TS Home: /opt/conda/lib/python3.8/site-packages	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Current directory: /	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Temp directory: /home/model-server/tmp	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Number of GPUs: 1	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Number of CPUs: 1	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Max heap size: 3234 M	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Python executable: /opt/conda/bin/python3.8	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Config file: /etc/sagemaker-ts.properties	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Inference address: http://0.0.0.0:8080	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Management address: http://0.0.0.0:8080	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Metrics address: http://127.0.0.1:8082	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Model Store: /.sagemaker/ts/models	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Initial Models: model=/opt/ml/model	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Log dir: /logs	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Metrics dir: /logs	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Netty threads: 0	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Netty client threads: 0	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Default workers per model: 1	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Blacklist Regex: N/A	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Maximum Response Size: 6553500	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Maximum Request Size: 6553500	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Limit Maximum Image Pixels: true	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Prefer direct buffer: false	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Allowed Urls: [file://.*	http(s)?://.*]
1661392700131	Custom python dependency for model allowed: false	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Metrics report format: prometheus	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Enable metrics API: true	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Workflow Store: /.sagemaker/ts/models	AllTraffic/i-0ebe254fbdc09af06
1661392700131	Model config: N/A	AllTraffic/i-0ebe254fbdc09af06
1661392700131	2022-08-25T01:58:20,049 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...	AllTraffic/i-0ebe254fbdc09af06
1661392700131	2022-08-25T01:58:20,073 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: /opt/ml/model	AllTraffic/i-0ebe254fbdc09af06
1661392700131	2022-08-25T01:58:20,077 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive version is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher	AllTraffic/i-0ebe254fbdc09af06
1661392700131	2022-08-25T01:58:20,077 [WARN ] main org.pytorch.serve.archive.model.ModelArchive - Model archive createdOn is not defined. Please upgrade to torch-model-archiver 0.2.0 or higher	AllTraffic/i-0ebe254fbdc09af06
1661392700131	2022-08-25T01:58:20,080 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model model loaded.	AllTraffic/i-0ebe254fbdc09af06
1661392700381	2022-08-25T01:58:20,091 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.	AllTraffic/i-0ebe254fbdc09af06
1661392700381	2022-08-25T01:58:20,169 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080	AllTraffic/i-0ebe254fbdc09af06
1661392700381	2022-08-25T01:58:20,170 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.	AllTraffic/i-0ebe254fbdc09af06
1661392700631	2022-08-25T01:58:20,173 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082	AllTraffic/i-0ebe254fbdc09af06
1661392700631	Model server started.	AllTraffic/i-0ebe254fbdc09af06
1661392700631	2022-08-25T01:58:20,424 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.	AllTraffic/i-0ebe254fbdc09af06
1661392700631	2022-08-25T01:58:20,504 [ERROR] Thread-1 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last): File "ts/metrics/metric_collector.py", line 27, in system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu) File "/opt/conda/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 91, in collect_all value(num_of_gpu) File "/opt/conda/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 61, in gpu_utilization import nvgpu	AllTraffic/i-0ebe254fbdc09af06
1661392701132	ModuleNotFoundError: No module named 'nvgpu'	AllTraffic/i-0ebe254fbdc09af06
1661392701132	2022-08-25T01:58:20,894 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0ebe254fbdc09af06
1661392701132	2022-08-25T01:58:20,894 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]35	AllTraffic/i-0ebe254fbdc09af06
1661392701132	2022-08-25T01:58:20,895 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.	AllTraffic/i-0ebe254fbdc09af06
1661392701132	2022-08-25T01:58:20,895 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.13	AllTraffic/i-0ebe254fbdc09af06
1661392701132	2022-08-25T01:58:20,902 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0ebe254fbdc09af06
1661392701132	2022-08-25T01:58:20,909 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.	AllTraffic/i-0ebe254fbdc09af06
1661392701132	2022-08-25T01:58:20,912 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1661392700911	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:20,939 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,506 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,507 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,507 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 210, in	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,507 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,507 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 181, in run_server	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,508 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,508 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 139, in handle_connection	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,508 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,509 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 104, in load_model	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,509 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,509 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,509 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_loader.py", line 151, in load	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,510 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,510 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,510 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 51, in initialize	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,511 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1.0-stderr	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,511 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().initialize(context)	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,513 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,513 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._service.validate_and_initialize(model_dir=model_dir)	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,511 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1.0-stdout	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,514 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 162, in validate_and_initialize	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,515 [INFO ] W-9000-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1.0-stdout	AllTraffic/i-0ebe254fbdc09af06
1661392701632	2022-08-25T01:58:21,516 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.	AllTraffic/i-0ebe254fbdc09af06
1661392702132	2022-08-25T01:58:21,548 [INFO ] W-9000-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1.0-stderr	AllTraffic/i-0ebe254fbdc09af06
1661392702132	2022-08-25T01:58:22,051 [INFO ] W-9000-model_1.0 ACCESS_LOG - /169.254.178.2:60708 "GET /ping HTTP/1.1" 200 15	AllTraffic/i-0ebe254fbdc09af06
1661392703383	2022-08-25T01:58:22,052 [INFO ] W-9000-model_1.0 TS_METRICS - Requests2XX.Count:1	#Level:Host
1661392703383	2022-08-25T01:58:23,230 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0ebe254fbdc09af06
1661392703383	2022-08-25T01:58:23,231 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - [PID]59	AllTraffic/i-0ebe254fbdc09af06
1661392703383	2022-08-25T01:58:23,231 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Torch worker started.	AllTraffic/i-0ebe254fbdc09af06
1661392703383	2022-08-25T01:58:23,232 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0ebe254fbdc09af06
1661392703383	2022-08-25T01:58:23,232 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.13	AllTraffic/i-0ebe254fbdc09af06
1661392703383	2022-08-25T01:58:23,234 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.	AllTraffic/i-0ebe254fbdc09af06
1661392703383	2022-08-25T01:58:23,234 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1661392703234	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,235 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,447 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,450 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,451 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 210, in	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,451 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,448 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,452 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,453 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1.0-stderr	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,454 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-model_1.0-stdout	AllTraffic/i-0ebe254fbdc09af06
1661392703633	2022-08-25T01:58:23,454 [INFO ] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.	AllTraffic/i-0ebe254fbdc09af06
1661392703883	2022-08-25T01:58:23,455 [INFO ] W-9000-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1.0-stdout	AllTraffic/i-0ebe254fbdc09af06
1661392705384	2022-08-25T01:58:23,697 [INFO ] W-9000-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-model_1.0-stderr	AllTraffic/i-0ebe254fbdc09af06
1661392705384	2022-08-25T01:58:25,177 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000	AllTraffic/i-0ebe254fbdc09af06

LiJell · 2022-08-25T02:35:13Z

@agunapal

inference.py post here again just in case.
I think I might wrote inference.py in a wrong way since I still am not understood fully

`
class ConvNormLReLU(nn.Sequential):
def init(self, in_ch, out_ch, kernel_size=3, stride=1, padding=1, pad_mode="reflect", groups=1, bias=False):

    pad_layer = {
        "zero":    nn.ZeroPad2d,
        "same":    nn.ReplicationPad2d,
        "reflect": nn.ReflectionPad2d,
    }
    if pad_mode not in pad_layer:
        raise NotImplementedError
        
    super(ConvNormLReLU, self).__init__(
        pad_layer[pad_mode](padding),
        nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size, stride=stride, padding=0, groups=groups, bias=bias),
        nn.GroupNorm(num_groups=1, num_channels=out_ch, affine=True),
        nn.LeakyReLU(0.2, inplace=True)
    )

class InvertedResBlock(nn.Module):
def init(self, in_ch, out_ch, expansion_ratio=2):
super(InvertedResBlock, self).init()

    self.use_res_connect = in_ch == out_ch
    bottleneck = int(round(in_ch*expansion_ratio))
    layers = []
    if expansion_ratio != 1:
        layers.append(ConvNormLReLU(in_ch, bottleneck, kernel_size=1, padding=0))
    
    # dw
    layers.append(ConvNormLReLU(bottleneck, bottleneck, groups=bottleneck, bias=True))
    # pw
    layers.append(nn.Conv2d(bottleneck, out_ch, kernel_size=1, padding=0, bias=False))
    layers.append(nn.GroupNorm(num_groups=1, num_channels=out_ch, affine=True))

    self.layers = nn.Sequential(*layers)
    
def forward(self, input):
    out = self.layers(input)
    if self.use_res_connect:
        out = input + out
    return out

class Generator(nn.Module):
def init(self, ):
super().init()

    self.block_a = nn.Sequential(
        ConvNormLReLU(3,  32, kernel_size=7, padding=3),
        ConvNormLReLU(32, 64, stride=2, padding=(0,1,0,1)),
        ConvNormLReLU(64, 64)
    )
    
    self.block_b = nn.Sequential(
        ConvNormLReLU(64,  128, stride=2, padding=(0,1,0,1)),            
        ConvNormLReLU(128, 128)
    )
    
    self.block_c = nn.Sequential(
        ConvNormLReLU(128, 128),
        InvertedResBlock(128, 256, 2),
        InvertedResBlock(256, 256, 2),
        InvertedResBlock(256, 256, 2),
        InvertedResBlock(256, 256, 2),
        ConvNormLReLU(256, 128),
    )    
    
    self.block_d = nn.Sequential(
        ConvNormLReLU(128, 128),
        ConvNormLReLU(128, 128)
    )

    self.block_e = nn.Sequential(
        ConvNormLReLU(128, 64),
        ConvNormLReLU(64,  64),
        ConvNormLReLU(64,  32, kernel_size=7, padding=3)
    )

    self.out_layer = nn.Sequential(
        nn.Conv2d(32, 3, kernel_size=1, stride=1, padding=0, bias=False),
        nn.Tanh()
    )
    
def forward(self, input, align_corners=True):
    out = self.block_a(input)
    half_size = out.size()[-2:]
    out = self.block_b(out)
    out = self.block_c(out)
    
    if align_corners:
        out = F.interpolate(out, half_size, mode="bilinear", align_corners=True)
    else:
        out = F.interpolate(out, scale_factor=2, mode="bilinear", align_corners=False)
    out = self.block_d(out)

    if align_corners:
        out = F.interpolate(out, input.size()[-2:], mode="bilinear", align_corners=True)
    else:
        out = F.interpolate(out, scale_factor=2, mode="bilinear", align_corners=False)
    out = self.block_e(out)

    out = self.out_layer(out)
    return out

def model_fn(model_dir):
"""Load the model and return it.
Providing this function is optional.
There is a default_model_fn available, which will load the model
compiled using SageMaker Neo. You can override the default here.
The model_fn only needs to be defined if your model needs extra
steps to load, and can otherwise be left undefined.

Keyword arguments:
model_dir -- the directory path where the model artifacts are present
"""        

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# The compiled model is saved as "model.pt"
model = Generator()
model_path = os.path.join(model_dir, 'model.pt')
with open(os.path.join(model_path, 'model.pt'), 'rb') as f:
    model.load_state_dict(torch.load(f))
    
model.to(device).eval()


return model

def transform_fn(model, request_body, request_content_type='image/', response_content_type='image/'):
image_format = "png" #@param ["jpeg", "png"]
"""Run prediction and return the output.
The function
1. Pre-processes the input request
2. Runs prediction
3. Post-processes the prediction output.
"""
# preprocess
img_in = Image.open(io.BytesIO(request_body)).convert("RGB")

# predict
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
im_out = model(img_in)
buffer_out = BytesIO()
im_out.save(buffer_out, format=image_format)
out = buffer_out.getvalue()

return out, response_content_type

`

HamidShojanazeri · 2022-08-25T19:04:58Z

@LiJell For 1.9 it is not clear from the logs why its failing to load the model, I wonder if there is any further pointer in the log traces to show the exact point it fails. I am guessing it can be some path issue, are you following this doc and your model artifact lives in a s3 bucket?

it seems with 1.11 that is failing on importing nvgpu, "packages/ts/metrics/system_metrics.py", line 61, in gpu_utilization import nvgpu".
I think we should have updated the docker containers for "nvgpu" issue, the workaround is to use a custom container here is an example / or install nvgpu in your script before importing it. For docker nvgpu issue cc:@lxning

LiJell · 2022-08-29T01:27:28Z

@LiJell For 1.9 it is not clear from the logs why its failing to load the model, I wonder if there is any further pointer in the log traces to show the exact point it fails. I am guessing it can be some path issue, are you following this doc and your model artifact lives in a s3 bucket?

it seems with 1.11 that is failing on importing nvgpu, "packages/ts/metrics/system_metrics.py", line 61, in gpu_utilization import nvgpu". I think we should have updated the docker containers for "nvgpu" issue, the workaround is to use a custom container here is an example / or install nvgpu in your script before importing it. For docker nvgpu issue cc:@lxning

Thank you for your reply!

I already have tried install nvgpu and import it, but I will try again since I might have a mistake.

Actually, I uploaded model.pt file on jupyter notebook and converted into tar.gz format in notebook where belongs to s3 bucket, but I will follow the doc and upload as it had to.

Thank you again!!

LiJell · 2022-08-30T01:03:18Z

@LiJell For 1.9 it is not clear from the logs why its failing to load the model, I wonder if there is any further pointer in the log traces to show the exact point it fails. I am guessing it can be some path issue, are you following this doc and your model artifact lives in a s3 bucket?

it seems with 1.11 that is failing on importing nvgpu, "packages/ts/metrics/system_metrics.py", line 61, in gpu_utilization import nvgpu". I think we should have updated the docker containers for "nvgpu" issue, the workaround is to use a custom container here is an example / or install nvgpu in your script before importing it. For docker nvgpu issue cc:@lxning

Hi, HamidShojanazeri, there was an improvement after restructure model.tar.gz file maybe there was a mistake.
However, there are sill error. I am trying a couple of thing to resolve error or warn.
I will share the log really soon again.

By the way, nvgpu error still comes out depending on framework version.
When I use 1.9.0 it look fine with nvgpu.

Thank you!!

LiJell · 2022-09-01T01:16:55Z

@HamidShojanazeri Hi, HamidShojanazeri. Model loads well now!!! But, I still have a problem with inference.py, so I think i will be fine if I fix it in a right way.

Thank you for ur help :) @HamidShojanazeri @agunapal

justmul · 2022-09-02T23:26:41Z

I'm seeing this same error:

`2022-09-02T23:21:06,034 [ERROR] Thread-77 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last): File "ts/metrics/metric_collector.py", line 27, in system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu) File "/opt/conda/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 91, in collect_all value(num_of_gpu) File "/opt/conda/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 61, in gpu_utilization import nvgpu	2022-09-02T23:21:06,034 [ERROR] Thread-77 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last): File "ts/metrics/metric_collector.py", line 27, in system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu) File "/opt/conda/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 91, in collect_all value(num_of_gpu) File "/opt/conda/lib/python3.8/site-packages/ts/metrics/system_metrics.py", line 61, in gpu_utilization import nvgpu
	2022-09-02T16:21:09.155-07:00CopyModuleNotFoundError: No module named 'nvgpu'

Building with:

pytorch_model = PyTorchModel( model_data=s3_location, role=role, framework_version="1.12", py_version="py38", source_dir="my_model/code", entry_point='inference.py') predictor = pytorch_model.deploy(instance_type='ml.g4dn.xlarge', initial_instance_count=1)

I tried adding nvgpu to the imports in my requirments.txt file, but that doesn't seem to make a difference. I'm not sure what metrics the error is referring to.

justmul · 2022-09-02T23:29:53Z

I've also tried with the framework version 1.10, had the same error.

justmul · 2022-09-03T14:16:40Z

@lxning I also still have this error with framework version 1.9. Any advice on a work around or what might be missing?

LiJell · 2022-09-05T01:16:24Z

@lxning I also still have this error with framework version 1.9. Any advice on a work around or what might be missing?

I am a beginner, but I would like to share my experience.

I still do not get why nvpgu error occurred, but @HamidShojanazeri said its docker container issue.
Try different framework version.

Even if I installed nvgpu and import it just as like you did, Cloud Watch said "there is no module name nvgpu" until framework version fit on mine.

cheers!

lxning · 2022-09-07T21:48:34Z

@LiJell Torchserve doesn't own Sagemaker docker container. Please file a ticket to AWS Sagemaker if you are using Torchserve via Sagemaker.

LiJell · 2022-09-14T07:36:35Z

@LiJell Torchserve doesn't own Sagemaker docker container. Please file a ticket to AWS Sagemaker if you are using Torchserve via Sagemaker.

Okay!! Thank you for your help!! I will ask this question on right place.
Thank you again!!

agunapal self-assigned this Aug 24, 2022

HamidShojanazeri added the triaged_wait Waiting for the Reporter's resp label Aug 26, 2022

LiJell mentioned this issue Sep 1, 2022

I keep get request_body as bytearray on sagemaker #1839

Closed

LiJell closed this as completed Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load model #1813

Cannot load model #1813

LiJell commented Aug 24, 2022 •

edited

Loading

agunapal commented Aug 25, 2022

LiJell commented Aug 25, 2022

LiJell commented Aug 25, 2022 •

edited

Loading

LiJell commented Aug 25, 2022 •

edited

Loading

HamidShojanazeri commented Aug 25, 2022

LiJell commented Aug 29, 2022 •

edited

Loading

LiJell commented Aug 30, 2022

LiJell commented Sep 1, 2022 •

edited

Loading

justmul commented Sep 2, 2022 •

edited

Loading

justmul commented Sep 2, 2022

justmul commented Sep 3, 2022

LiJell commented Sep 5, 2022 •

edited

Loading

lxning commented Sep 7, 2022 •

edited

Loading

LiJell commented Sep 14, 2022

Cannot load model #1813

Cannot load model #1813

Comments

LiJell commented Aug 24, 2022 • edited Loading

🐛 Describe the bug

I deployed a model

and predict data

as a result I got an error

I skim through logs via CloudWatch, and still struggling with this. need a help.

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Possible Solution

agunapal commented Aug 25, 2022

LiJell commented Aug 25, 2022

LiJell commented Aug 25, 2022 • edited Loading

LiJell commented Aug 25, 2022 • edited Loading

HamidShojanazeri commented Aug 25, 2022

LiJell commented Aug 29, 2022 • edited Loading

LiJell commented Aug 30, 2022

LiJell commented Sep 1, 2022 • edited Loading

justmul commented Sep 2, 2022 • edited Loading

justmul commented Sep 2, 2022

justmul commented Sep 3, 2022

LiJell commented Sep 5, 2022 • edited Loading

lxning commented Sep 7, 2022 • edited Loading

LiJell commented Sep 14, 2022

LiJell commented Aug 24, 2022 •

edited

Loading

LiJell commented Aug 25, 2022 •

edited

Loading

LiJell commented Aug 25, 2022 •

edited

Loading

LiJell commented Aug 29, 2022 •

edited

Loading

LiJell commented Sep 1, 2022 •

edited

Loading

justmul commented Sep 2, 2022 •

edited

Loading

LiJell commented Sep 5, 2022 •

edited

Loading

lxning commented Sep 7, 2022 •

edited

Loading