Clean up resources after worker thread is terminated #915

maaquib · 2020-05-13T23:09:40Z

Description of changes:

Terminate err and out ReaderThreads after its corresponding WorkerThread has been terminated
Clean up FDs associated with a terminated WorkerThread
Pins pytest version in CI test to avoid error fixture is being applied more than once to the same function

Testing done (Ubuntu18.04):

Stress tested for 450000 iterations of model load and unload
CI tests successful

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

vdantu

I have left some comments. Most of them are questions.

frontend/server/src/main/java/com/amazonaws/ml/mms/wlm/WorkLoadManager.java

vdantu · 2020-05-14T20:25:34Z

frontend/server/src/main/java/com/amazonaws/ml/mms/wlm/WorkerLifeCycle.java

        @Override
        public void run() {
            try (Scanner scanner = new Scanner(is, StandardCharsets.UTF_8.name())) {
-                while (scanner.hasNext()) {
+                while (isRunning && scanner.hasNext()) {


Adding flag looks a little hacky.

Did some reading up on this (ref). It seems like using a flag is indeed a cleaner way to do this. I'll change the boolean to AtomicBoolean though. Also, this is using isRunning flag too

vdantu · 2020-05-14T20:28:39Z

frontend/server/src/test/java/com/amazonaws/ml/mms/ModelServerTest.java

@@ -1312,7 +1312,7 @@ private void testLoggingUnload(Channel inferChannel, Channel mgmtChannel)
        Scanner logscanner = new Scanner(logfile, "UTF-8");
        while (logscanner.hasNextLine()) {
            String line = logscanner.nextLine();
-            if (line.contains("LoggingService exit")) {
+            if (line.contains("Model logging unregistered")) {


Whats this change? I don't see corresponding change in the source file. Curious to know how this log line gets into the logfile.

This line was being logged from the logging-model through the ReaderThread-stdout. But since we are now terminating ReaderThreads before deleting a model, this is no longer logged. Replaced with the last logging statement when model is unregistered

vdantu · 2020-05-14T20:29:01Z

frontend/server/src/main/java/com/amazonaws/ml/mms/wlm/WorkerThread.java

+        logger.debug("Terminating IOStreams for worker thread shutdown");
+        lifeCycle.terminateIOStreams();
+        try {
+            if (out != null) {


When would we have out or error equal to null? Shoudln't they always be created?

It will be null in the case of server thread or if an exception is thrown from runWorker before the files are created.

vdantu · 2020-05-14T20:29:22Z

frontend/server/src/main/java/com/amazonaws/ml/mms/wlm/WorkerThread.java

+                err.close();
+            }
+        } catch (IOException e) {
+            logger.error("Failed to close IO file handles", e);


Whats the cleanup process if we fail to close the IOs?

Not sure if I can do anything additional here. Let me know if you have any suggestions.
Was thinking of using closeQuietly but that is deprecated

vdantu · 2020-05-18T22:19:44Z

frontend/server/src/main/java/com/amazonaws/ml/mms/wlm/WorkLoadManager.java

@@ -113,6 +113,9 @@ public int getNumRunningWorkers(String modelName) {
            if (minWorker == 0) {
                threads = workers.remove(model.getModelName());
                if (threads == null) {
+                    if (maxWorker == 0) {


I am not sure if maxWorker==0 check is correct. Could you test the following sequence:1

Register a model (no initial workers)

Scale up the workers (maybe to 2 workers)

Scale down the workers to 0.

Scale up the workers again (maybe to 2 workers)..
Minworkers and Maxworkers are always the same number. So, when we scale down to 0, according to this change we would be remvoing the server thread. When will the server thread again be created? Its initially created during registration of the model.

This was failing even before the fix. If you initialiaze a model with workers > 0, scale down to 0 and scale up again, we get an exception with serverthread being null. So essentially, there is no change in behaviour.
I can create a separate issue to track this and will send out another PR.

Created a separate issue to fix this bug #916

Thanks for raising the issue and tagging it as bug. Without this, the scaledown to 0 and scaleup is broken. Only work around would be DELETE /models and POST /models.

vdantu

Approving the PR considering the following:

This fix is a must-have to address some of the long running stress test failures.
Scaling down workers to 0 and back up again throws exception #916 would be fixed soon as well.

maaquib force-pushed the stress_test branch 3 times, most recently from e85da68 to 8530102 Compare May 14, 2020 00:25

Clean up resources after worker thread is terminated

b7efff4

maaquib force-pushed the stress_test branch from 8530102 to b7efff4 Compare May 14, 2020 20:00

maaquib marked this pull request as ready for review May 14, 2020 20:07

maaquib requested review from vdantu and mycpuorg May 14, 2020 20:21

vdantu reviewed May 14, 2020

View reviewed changes

Address code review comments

ac577cb

maaquib requested a review from vdantu May 15, 2020 22:08

vdantu reviewed May 18, 2020

View reviewed changes

maaquib requested a review from vdantu May 18, 2020 22:56

vdantu approved these changes May 18, 2020

View reviewed changes

maaquib merged commit 92fd5ed into awslabs:master May 19, 2020

maaquib mentioned this pull request Apr 4, 2022

Scaling with min_worker=0 removes worker but process is still up (and even restarting itself) #895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clean up resources after worker thread is terminated #915

Clean up resources after worker thread is terminated #915

Uh oh!

maaquib commented May 13, 2020 •

edited

Loading

Uh oh!

vdantu left a comment

Uh oh!

Uh oh!

vdantu May 14, 2020

Uh oh!

maaquib May 14, 2020

Uh oh!

vdantu May 14, 2020

Uh oh!

maaquib May 14, 2020

Uh oh!

vdantu May 14, 2020

Uh oh!

maaquib May 14, 2020

Uh oh!

vdantu May 14, 2020

Uh oh!

maaquib May 14, 2020

Uh oh!

vdantu May 18, 2020

Uh oh!

maaquib May 18, 2020

Uh oh!

maaquib May 18, 2020

Uh oh!

vdantu May 18, 2020 •

edited

Loading

Uh oh!

vdantu left a comment

Uh oh!

Uh oh!

Clean up resources after worker thread is terminated #915

Clean up resources after worker thread is terminated #915

Uh oh!

Conversation

maaquib commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes:

Testing done (Ubuntu18.04):

Uh oh!

vdantu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdantu May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdantu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maaquib commented May 13, 2020 •

edited

Loading

vdantu May 18, 2020 •

edited

Loading