Skip to content

stub: optimize ThreadlessExecutor used for blocking calls #5516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 29, 2019

Conversation

njhill
Copy link
Contributor

@njhill njhill commented Mar 28, 2019

The ThreadlessExecutor currently used for blocking calls uses LinkedBlockingQueue which is relatively heavy both in terms of allocations and synchronization overhead (e.g. when compared to ConcurrentLinkedQueue). It accounts for ~10% of allocations and ~5% of allocated bytes per-call in the TransportBenchmark when using in-process transport with stats and tracing disabled.

Changing to use a ConcurrentLinkedQueue results in a ~5% speedup of that benchmark.

Replace LinkedBlockingQueue with ConcurrentLinkedQueue and explicit blocking.
@@ -639,20 +642,33 @@ public void onClose(Status status, Metadata trailers) {
* Waits until there is a Runnable, then executes it and all queued Runnables after it.
*/
public void waitAndDrain() throws InterruptedException {
Runnable runnable = queue.take();
while (runnable != null) {
Runnable runnable = poll();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mark waitAndDrain() with @NotThreadSafe because there must not be two concurrent callers of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the class is only for internal use in an SPSC context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible thing you could try is to make this SPSC. I have a POC (originally for SerializingExecutor) here: https://github.com/grpc/grpc-java/pull/3778/files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo sure, I remember seeing that before, could be worth a try here too. I thought (possibly mistakenly) this would be a simpler change just to circumvent LinkedBlockingQueue which was the main goal.

including interruption handling fix
@njhill
Copy link
Contributor Author

njhill commented Mar 29, 2019

Thanks @dapengzhang0 @carl-mastrangelo, have addressed the comments, PTAL

@dapengzhang0
Copy link
Member

LGTM

@dapengzhang0 dapengzhang0 added the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 29, 2019
@grpc-kokoro grpc-kokoro removed the kokoro:run Add this label to a PR to tell Kokoro the code is safe and tests can be run label Mar 29, 2019
@carl-mastrangelo
Copy link
Contributor

@njhill Can you include your before and after JMH numbers for the commit? We typically include them when making performance optimizations.

@njhill
Copy link
Contributor Author

njhill commented Mar 29, 2019

@carl-mastrangelo I thought I had observed a bigger difference in the non-direct case in other runs, when the system was noisier. I know it's not a huge delta but there's a couple more similar changes I have in mind which cumulatively add up to maybe ~15% (to be confirmed!)

Before:

Benchmark                         (direct)  (transport)  Mode  Cnt      Score     Error  Units
TransportBenchmark.unaryCall1024      true    INPROCESS  avgt   60   1877.339 ±  46.309  ns/op
TransportBenchmark.unaryCall1024     false    INPROCESS  avgt   60  12680.525 ± 208.684  ns/op

After:

Benchmark                         (direct)  (transport)  Mode  Cnt      Score     Error  Units
TransportBenchmark.unaryCall1024      true    INPROCESS  avgt   60   1779.188 ±  36.769  ns/op
TransportBenchmark.unaryCall1024     false    INPROCESS  avgt   60  12532.470 ± 238.271  ns/op

This is with the following changes to default config:

  • Set tracingEnabled and statsEnabled to false in channel and server builders
  • Bumped forks 1 -> 2, iterations 10 -> 30 and changed mode to AverageTime

Copy link
Contributor

@carl-mastrangelo carl-mastrangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@carl-mastrangelo carl-mastrangelo merged commit 5f88bc4 into grpc:master Mar 29, 2019
@carl-mastrangelo
Copy link
Contributor

@njhill merged, thanks!

@dapengzhang0
Copy link
Member

Thanks a lot for your PR @njhill

@njhill njhill deleted the threadless branch March 29, 2019 18:09
@lock lock bot locked as resolved and limited conversation to collaborators Jun 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants