Skip to content

Clean up resources after worker thread is terminated #915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 19, 2020

Conversation

maaquib
Copy link
Contributor

@maaquib maaquib commented May 13, 2020

Description of changes:

  • Terminate err and out ReaderThreads after its corresponding WorkerThread has been terminated
  • Clean up FDs associated with a terminated WorkerThread
  • Pins pytest version in CI test to avoid error fixture is being applied more than once to the same function

Testing done (Ubuntu18.04):

  • Stress tested for 450000 iterations of model load and unload
  • CI tests successful

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@maaquib maaquib force-pushed the stress_test branch 3 times, most recently from e85da68 to 8530102 Compare May 14, 2020 00:25
@maaquib maaquib marked this pull request as ready for review May 14, 2020 20:07
@maaquib maaquib requested review from vdantu and mycpuorg May 14, 2020 20:21
Copy link
Contributor

@vdantu vdantu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left some comments. Most of them are questions.

@Override
public void run() {
try (Scanner scanner = new Scanner(is, StandardCharsets.UTF_8.name())) {
while (scanner.hasNext()) {
while (isRunning && scanner.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding flag looks a little hacky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some reading up on this (ref). It seems like using a flag is indeed a cleaner way to do this. I'll change the boolean to AtomicBoolean though. Also, this is using isRunning flag too

@@ -1312,7 +1312,7 @@ private void testLoggingUnload(Channel inferChannel, Channel mgmtChannel)
Scanner logscanner = new Scanner(logfile, "UTF-8");
while (logscanner.hasNextLine()) {
String line = logscanner.nextLine();
if (line.contains("LoggingService exit")) {
if (line.contains("Model logging unregistered")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats this change? I don't see corresponding change in the source file. Curious to know how this log line gets into the logfile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was being logged from the logging-model through the ReaderThread-stdout. But since we are now terminating ReaderThreads before deleting a model, this is no longer logged. Replaced with the last logging statement when model is unregistered

logger.debug("Terminating IOStreams for worker thread shutdown");
lifeCycle.terminateIOStreams();
try {
if (out != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would we have out or error equal to null? Shoudln't they always be created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be null in the case of server thread or if an exception is thrown from runWorker before the files are created.

err.close();
}
} catch (IOException e) {
logger.error("Failed to close IO file handles", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the cleanup process if we fail to close the IOs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I can do anything additional here. Let me know if you have any suggestions.
Was thinking of using closeQuietly but that is deprecated

@maaquib maaquib requested a review from vdantu May 15, 2020 22:08
@@ -113,6 +113,9 @@ public int getNumRunningWorkers(String modelName) {
if (minWorker == 0) {
threads = workers.remove(model.getModelName());
if (threads == null) {
if (maxWorker == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if maxWorker==0 check is correct. Could you test the following sequence:1

  1. Register a model (no initial workers)
  2. Scale up the workers (maybe to 2 workers)
  3. Scale down the workers to 0.
  4. Scale up the workers again (maybe to 2 workers)..
    Minworkers and Maxworkers are always the same number. So, when we scale down to 0, according to this change we would be remvoing the server thread. When will the server thread again be created? Its initially created during registration of the model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was failing even before the fix. If you initialiaze a model with workers > 0, scale down to 0 and scale up again, we get an exception with serverthread being null. So essentially, there is no change in behaviour.
I can create a separate issue to track this and will send out another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate issue to fix this bug #916

Copy link
Contributor

@vdantu vdantu May 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising the issue and tagging it as bug. Without this, the scaledown to 0 and scaleup is broken. Only work around would be DELETE /models and POST /models.

@maaquib maaquib requested a review from vdantu May 18, 2020 22:56
Copy link
Contributor

@vdantu vdantu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR considering the following:

  1. This fix is a must-have to address some of the long running stress test failures.
  2. Scaling down workers to 0 and back up again throws exception #916 would be fixed soon as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants