Skip to content

Test hang when set withThreadCount(1) and too many thread created in QueryBatcher #1279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
llinggit opened this issue Dec 18, 2020 · 10 comments

Comments

@llinggit
Copy link
Contributor

This is a corner case for task JAVA-195 and issue #1269.

For example in test StringQueryHostBatcherTest.testProgressListener, copied below, set "withThreadCount(1)" will cause a hang.

		  QueryBatcher batcher60 = dmManager.newQueryBatcher(querydef).withBatchSize(60);
                  .withThreadCount(1); //total of 6k docs
		  batcher60.onUrisReady(
				  new ProgressListener()
				  .onProgressUpdate(progressUpdate -> {
					  System.out.println("From ProgressListener (From Batch 60): " + progressUpdate.getProgressAsString());
                      int index = progressUpdate.getProgressAsString().indexOf(";");
                      progressSet60.add(progressUpdate.getProgressAsString().substring(0, index));
				  }));
		  batcher60.onQueryFailure((throwable) -> {
			  System.out.println("queryFailures 60: ");
			  try {
				  Thread.sleep(7000L);
			  } catch (Exception e) {
				  e.printStackTrace();
			  }
		  });

		  dmManager.startJob(batcher60);
		  batcher60.awaitCompletion();```

Other affected tests are 
TestSplitters.testWriteOpsMultipleThreads
DeleteListenerTest.massDeleteSingleThread
WriteandReadPOJOs.

So we can address your issue, please include the following:
### Version of MarkLogic Java Client API
See Readme.txt

### Version of MarkLogic Server

See admin gui on port 8001 or run xdmp:version() in Query Console - port 8000)

### Java version

Run `java -version`

### OS and version

For MAC, run `sw_vers`. \
For Windows, run  `systeminfo | findstr /B /C:"OS Name" /C:"OS Version"`\
For Linux, run `cat /etc/os-release` and `uname -r`

### Input: Some code to illustrate the problem, preferably in a state that can be independently reproduced on our end

### Actual output: What did you observe? What errors did you see? Can you attach the logs? (Java logs, MarkLogic logs)

### Expected output: What specifically did you expect to happen?

### Alternatives: What else have you tried, actual/expected?
@llinggit
Copy link
Contributor Author

Hi @ehennum, I asked Anu to help me take a look at this one. She has no idea either. Could you help me take a look at this one please? Thanks.

@ehennum
Copy link
Contributor

ehennum commented Jun 23, 2021

@llinggit , I'm having trouble reproducing the issue.

When I uncomment the .withThreadCount(1) configuration on batcher60, the testProgressListener() test runs fine.

But, I'm running with a single host. Does the cluster have to have multiple hosts to reproduce the issue?

If so, that's provides a clue.

@llinggit
Copy link
Contributor Author

Hi @ehennum, thanks for taking a look. I just installed the lasted nightly build and I cannot reproduce the hang in StringQueryHostBatcherTest.testProgressListener either. However, with one of other listed test, TestSplitters.testWriteOpsMultipleThreads, I can still reproduce the issue. I forgot which build I was, but you could try this one?

@ehennum
Copy link
Contributor

ehennum commented Jun 25, 2021

Thanks, @llinggit . That reproduces the issue - 1 thread hangs but 2 threads succeeds. Investigating

@ehennum
Copy link
Contributor

ehennum commented Jun 25, 2021

@llinggit , I believe I know what's going on:

  • The success listener is never called
  • Execution hangs on the threadPool.execute() call within the for (int i = 1; i < getDocToUriBatchRatio(); i++) loop in QueryTask.run() for the 26th item
  • The QueryThreadPoolExecutor() constructor sizes the LinkedBlockingQueue to threadCount * 25, so the 26th item has to be rejected
  • The BlockingRunsPolicy queues the rejected item when space becomes available on the queue; however, because run() has blocked, the queue never drains; conversely, because the queue never drains, run() is blocked.

I'm wondering whether the easiest and correct fix is to size the QueryTask queue to (threadCount * getDocToUriBatchRatio()) + 1. In principle, that should allow run() to queue all items and thus run to completion.

What do you think?

By the way, I think the test has a minor issue -- it would be better to call

        wbatcher1.awaitCompletion();

instead of Thread.sleep(1000)

@llinggit
Copy link
Contributor Author

@ehennum Thanks! Now I understand. Yes, that makes sense. Please go ahead with the fix. Thank you!

@ehennum
Copy link
Contributor

ehennum commented Jun 25, 2021

The functional test was hanging before but ran green with this change.

@ehennum
Copy link
Contributor

ehennum commented Jun 28, 2021

The previous commit bases the queue size on

(the number of forests * processor ratio * 2) + thread count

with the reasoning that this ensures the QueryBatcher can specify the work to collect and process two batches of uris for each forest with latency in each thread.

The queue could very likely be smaller, but a queue that's too large doesn't use a lot of memory and a queue that's too small causes a hang, so this change errs on the side of caution.

@georgeajit georgeajit added ship and removed test labels Jul 7, 2021
@georgeajit
Copy link

Ran all the tests and they work fine. Thread count parameter was set to 1 with withThreadCount method.

@llinggit
Copy link
Contributor Author

There's no server-side change in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants