-
Notifications
You must be signed in to change notification settings - Fork 232
Clean up resources after worker thread is terminated #915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e85da68
to
8530102
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments. Most of them are questions.
frontend/server/src/main/java/com/amazonaws/ml/mms/wlm/WorkLoadManager.java
Show resolved
Hide resolved
@Override | ||
public void run() { | ||
try (Scanner scanner = new Scanner(is, StandardCharsets.UTF_8.name())) { | ||
while (scanner.hasNext()) { | ||
while (isRunning && scanner.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding flag looks a little hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1312,7 +1312,7 @@ private void testLoggingUnload(Channel inferChannel, Channel mgmtChannel) | |||
Scanner logscanner = new Scanner(logfile, "UTF-8"); | |||
while (logscanner.hasNextLine()) { | |||
String line = logscanner.nextLine(); | |||
if (line.contains("LoggingService exit")) { | |||
if (line.contains("Model logging unregistered")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats this change? I don't see corresponding change in the source file. Curious to know how this log line gets into the logfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line was being logged from the logging-model through the ReaderThread-stdout. But since we are now terminating ReaderThreads before deleting a model, this is no longer logged. Replaced with the last logging statement when model is unregistered
logger.debug("Terminating IOStreams for worker thread shutdown"); | ||
lifeCycle.terminateIOStreams(); | ||
try { | ||
if (out != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would we have out or error equal to null? Shoudln't they always be created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be null in the case of server thread or if an exception is thrown from runWorker
before the files are created.
err.close(); | ||
} | ||
} catch (IOException e) { | ||
logger.error("Failed to close IO file handles", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the cleanup process if we fail to close the IOs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I can do anything additional here. Let me know if you have any suggestions.
Was thinking of using closeQuietly but that is deprecated
@@ -113,6 +113,9 @@ public int getNumRunningWorkers(String modelName) { | |||
if (minWorker == 0) { | |||
threads = workers.remove(model.getModelName()); | |||
if (threads == null) { | |||
if (maxWorker == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if maxWorker==0
check is correct. Could you test the following sequence:1
- Register a model (no initial workers)
- Scale up the workers (maybe to 2 workers)
- Scale down the workers to 0.
- Scale up the workers again (maybe to 2 workers)..
Minworkers and Maxworkers are always the same number. So, when we scale down to 0, according to this change we would be remvoing the server thread. When will the server thread again be created? Its initially created during registration of the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was failing even before the fix. If you initialiaze a model with workers > 0, scale down to 0 and scale up again, we get an exception with serverthread being null. So essentially, there is no change in behaviour.
I can create a separate issue to track this and will send out another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a separate issue to fix this bug #916
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising the issue and tagging it as bug
. Without this, the scaledown to 0 and scaleup is broken. Only work around would be DELETE /models
and POST /models
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the PR considering the following:
- This fix is a must-have to address some of the long running stress test failures.
- Scaling down workers to 0 and back up again throws exception #916 would be fixed soon as well.
Description of changes:
fixture is being applied more than once to the same function
Testing done (Ubuntu18.04):
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.