-
Notifications
You must be signed in to change notification settings - Fork 72
Ignore zombie processes when detecting TorchServe status #166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@namannandan should we just check the process status rather than swallowing the exception ? |
Thanks @visinfo that makes sense, updated the PR. |
I'm currently facing this exact issue when trying to deploy a pytorch model in AWS Sagemaker using torch==2.2.0. When this fix will be available for deploying models ? |
As for @adrien-code-it , I also tried on a new model and @namannandan, @visinfo is there something we need to do to deploy using the update? Or when will it be distributed to all instances? |
@5agado I was able to deploy my model by adding a Although it's not a permanent solution (I would prefer pulling a fixed version, not the latest), it's working as of now. |
@adrien-code-it are you deploying the model as endpoint, or using in batch-transform? |
@5agado the fix in For batch-transform, unfortunately I didn't see any fix working... |
Description of changes:
When checking to see if the TorchServe process is running, we iterate through the current list of running processes using
psutil
:sagemaker-pytorch-inference-toolkit/src/sagemaker_pytorch_serving_container/torchserve.py
Lines 183 to 188 in 36a842e
Calling the
command()
psutil API on a zombie process raises thepsutil.ZombieProcess
exception. This unhandled exception causes TorchServe to be stopped which is not expected behavior in DLC: https://github.com/aws/deep-learning-containers/tree/master/pytorch/inferenceWe can ignore zombie processes when detecting the presence of a running TorchServe process. Reference: https://psutil.readthedocs.io/en/latest/#psutil.ZombieProcess
Tests:
Torchserve continues to run and container does not get terminated.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.