Enable CPU compression offload for EB mode in Thrift server

Yong Tan · meta-codesync[bot] · commit 878ab3909c30 · 2026-04-18T13:55:41.000-07:00
Summary:
D97565705 moved Thrift server response compression from IO threads to CPU threads, but only when executor_ is non-null. In EB mode, executor_ is null (the defaultSync resource pool has no executor), so compression still runs on the IO thread — defeating the purpose.

This diff fixes the EB mode gap by computing the compression executor as a local variable in sendReply() with a fallback chain: executor_ (TM mode) → server's handler executor via the context chain (EB mode) → folly::getGlobalCPUExecutor() (safety net). The dispatch methods now accept the executor as a parameter instead of reading executor_ directly.

## Key properties:
- No new members on any object, no new virtual methods
- executor_ is never mutated — EB method semantics are unchanged
- The fallback chain (~4ns) only runs when: flag is on, payload exceeds the compression threshold, and we're on the IO thread in EB mode — negligible compared to the compression work it enables (us–ms)
- Gated by the existing thrift_server_compress_response_on_cpu flag

## Benchmark Results: CPU Compression Offload (Echo32k_semi_random_eb)

| Metric | Baseline | Server Offload | Change |
|---|---|---|---|
| Average QPS | 27,824 | 50,614 | +82% |
| p50 Latency (ms) | 5.352 | 3.153 | -41% |
| p99 Latency (ms) | 10.275 | 4.196 | -59% |
| p100 Latency (ms) | 26.030 | 11.664 | -55% |
| Server CPU Utilization | 2.27 | 4.89 | +115% |
| Client CPU Utilization | 2.37 | 4.04 | +71% |

### Summary

Offloading compression to CPU threads delivers an **82% QPS improvement** and cuts **p99 latency by 59%** for semi-compressible 32KB payloads on EB-thread handlers. The trade-off is higher CPU utilization (+115% server-side), which is expected — the IO threads are no longer blocked by compression and can accept requests faster, driving more total throughput. The CPU threads absorb the compression work in parallel, converting idle CPU capacity into lower latency and higher throughput.

### Limitations

- **IO thread saturation required.** The feature only helps when IO threads are the bottleneck. If IO threads have spare capacity, inline compression is fast enough and the dispatch overhead provides no benefit.
- **Thread-hop cost.** Each dispatched response pays a fixed overhead for executor queue insertion, CPU thread dequeue, reply queue notification (eventfd syscall), and IO thread wakeup. This fixed cost is independent of payload size, so it becomes proportionally less significant for larger payloads.
- **Minimum payload size.** Payloads must be large enough that compression time significantly exceeds the thread-hop overhead. A minimum threshold of 1KB (`thrift_server_min_cpu_compression_payload_size`) is enforced to prevent small responses (e.g., pings) from being dispatched at a net loss.
- **Data compressibility matters.** The feature benefits semi-compressible data (structured Thrift responses, JSON-like content) where compression is both CPU-expensive and effective at reducing wire size. Trivially compressible data (repeated bytes) compresses too fast to justify the hop. Incompressible data (random bytes) gains nothing from compression and bottlenecks on network IO instead.

### Two-threshold interaction
There are two independent size thresholds that gate compression behavior. They serve different purposes and do not conflict:

- `compressionSizeLimit` (existing, per-connection) — configured via the client's compression config (compressionConfig_.compressionSizeLimit()). Controls whether compression happens at all. Payloads at or below this limit skip compression entirely (no algorithm is selected). This threshold is unchanged by this diff.
- `thrift_server_min_cpu_compression_payload_size` (new, global flag, default 1024) — controls where compression runs (CPU thread vs inline on IO thread). Payloads below this threshold still get compressed, but inline on the current thread rather than being dispatched to a CPU executor. This avoids the thread-hop overhead for small payloads where inline compression is cheaper than the dispatch cost.

Evaluation order in `shouldDispatchCompressionToCpu(payloadSize)`:

- If payloadSize &lt; `thrift_server_min_cpu_compression_payload_size` → compress inline (no dispatch)
- If `getEligibleCompressionAlgorithm(payloadSize)` returns nullopt (no algorithm, or payload ≤ `compressionSizeLimit`) → no compression at all
- Otherwise → dispatch compression to CPU thread

This does not change existing behavior. Both thresholds are only evaluated when `thrift_server_compress_response_on_cpu` is true (default false). Services that have not opted in see zero behavior change. For services that have opted in, the new minimum size threshold adds a small-payload bypass that wasn't previously needed (because the prior code only dispatched when executor_ was non-null, which excluded EB mode entirely).

Reviewed By: robertroeser

Differential Revision: D100902596

fbshipit-source-id: 583199eb8d05d14af0a5119a7cf72a17736b91c3
diff --git a/thrift/lib/cpp2/async/processor/HandlerCallbackBase.cpp b/thrift/lib/cpp2/async/processor/HandlerCallbackBase.cpp
@@ -16,6 +16,7 @@
 
 #include <folly/ExceptionWrapper.h>
 #include <folly/Executor.h>
+#include <folly/executors/GlobalExecutor.h>
 #include <folly/stop_watch.h>
 #include <thrift/lib/cpp/TApplicationException.h>
 #include <thrift/lib/cpp/concurrency/ThreadManager.h>
@@ -149,6 +150,26 @@ void HandlerCallbackBase::doExceptionWrapped(folly::exception_wrapper ew) {
   }
 }
 
+folly::Executor::KeepAlive<>
+HandlerCallbackBase::getCompressionExecutorFallback() {
+  // Walk the context chain to the server's handler executor (defaultAsync pool
+  // in resource-pools mode, or the ThreadManager in legacy mode). Fall back to
+  // the global CPU executor as a safety net — it is always non-null.
+  if (reqCtx_) {
+    if (auto* connCtx = reqCtx_->getConnectionContext()) {
+      if (auto* workerCtx = connCtx->getWorkerContext()) {
+        if (auto* serverCtx = workerCtx->getServerContext()) {
+          auto exec = serverCtx->getHandlerExecutorKeepAlive();
+          if (exec) {
+            return exec;
+          }
+        }
+      }
+    }
+  }
+  return folly::getGlobalCPUExecutor();
+}
+
 void HandlerCallbackBase::sendReply(SerializedResponse response) {
   this->ctx_.reset();
 
@@ -161,9 +182,14 @@ void HandlerCallbackBase::sendReply(SerializedResponse response) {
     auto payloadSize =
         response.buffer ? response.buffer->computeChainDataLength() : 0;
     if (req_->shouldDispatchCompressionToCpu(payloadSize) && getEventBase() &&
-        getEventBase()->inRunningEventBaseThread() && executor_) {
-      dispatchReplyToCpuThread(std::move(response), payloadSize);
-      return;
+        getEventBase()->inRunningEventBaseThread()) {
+      auto compressionExec =
+          executor_ ? executor_ : getCompressionExecutorFallback();
+      if (compressionExec) {
+        dispatchReplyToCpuThread(
+            std::move(response), payloadSize, compressionExec);
+        return;
+      }
     }
     preCompressed = req_->compressResponse(response, reqCtx_, payloadSize);
   }
@@ -194,7 +220,9 @@ void HandlerCallbackBase::sendReply(SerializedResponse response) {
 }
 
 void HandlerCallbackBase::dispatchReplyToCpuThread(
-    SerializedResponse response, size_t payloadSize) {
+    SerializedResponse response,
+    size_t payloadSize,
+    const folly::Executor::KeepAlive<>& compressionExecutor) {
   // Capture all state needed on the CPU thread. After this method returns,
   // HandlerCallbackBase may be destroyed (req_ is moved out, so the
   // destructor's cleanup path will skip the active-request error).
@@ -204,13 +232,13 @@ void HandlerCallbackBase::dispatchReplyToCpuThread(
   auto writeTransforms = reqCtx->getHeader()->getWriteTTransforms();
   auto* replyQueue = &getReplyQueue();
 
-  executor_->add([req = std::move(req),
-                  reqCtx,
-                  protoSeqId,
-                  writeTransforms = std::move(writeTransforms),
-                  replyQueue,
-                  payloadSize,
-                  response = std::move(response)]() mutable {
+  compressionExecutor->add([req = std::move(req),
+                            reqCtx,
+                            protoSeqId,
+                            writeTransforms = std::move(writeTransforms),
+                            replyQueue,
+                            payloadSize,
+                            response = std::move(response)]() mutable {
     // On CPU thread: attempt compression.
     bool preCompressed = req->compressResponse(response, reqCtx, payloadSize);
 
@@ -252,9 +280,14 @@ void HandlerCallbackBase::sendReply(
         ? responseAndStream.response.buffer->computeChainDataLength()
         : 0;
     if (req_->shouldDispatchCompressionToCpu(payloadSize) && getEventBase() &&
-        getEventBase()->inRunningEventBaseThread() && executor_) {
-      dispatchStreamReplyToCpuThread(std::move(responseAndStream), payloadSize);
-      return;
+        getEventBase()->inRunningEventBaseThread()) {
+      auto compressionExec =
+          executor_ ? executor_ : getCompressionExecutorFallback();
+      if (compressionExec) {
+        dispatchStreamReplyToCpuThread(
+            std::move(responseAndStream), payloadSize, compressionExec);
+        return;
+      }
     }
     preCompressed = req_->compressResponse(
         responseAndStream.response, reqCtx_, payloadSize);
@@ -325,7 +358,9 @@ void HandlerCallbackBase::setupStreamFactory(
 }
 
 void HandlerCallbackBase::dispatchStreamReplyToCpuThread(
-    ResponseAndServerStreamFactory&& responseAndStream, size_t payloadSize) {
+    ResponseAndServerStreamFactory&& responseAndStream,
+    size_t payloadSize,
+    const folly::Executor::KeepAlive<>& compressionExecutor) {
   // Capture all state needed on the CPU thread. After this method returns,
   // HandlerCallbackBase may be destroyed (req_ is moved out).
   auto req = std::move(req_);
@@ -338,13 +373,14 @@ void HandlerCallbackBase::dispatchStreamReplyToCpuThread(
   auto& stream = responseAndStream.stream;
   setupStreamFactory(stream);
 
-  executor_->add([req = std::move(req),
-                  reqCtx,
-                  protoSeqId,
-                  writeTransforms = std::move(writeTransforms),
-                  replyQueue,
-                  payloadSize,
-                  responseAndStream = std::move(responseAndStream)]() mutable {
+  compressionExecutor->add([req = std::move(req),
+                            reqCtx,
+                            protoSeqId,
+                            writeTransforms = std::move(writeTransforms),
+                            replyQueue,
+                            payloadSize,
+                            responseAndStream =
+                                std::move(responseAndStream)]() mutable {
     // On CPU thread: attempt compression.
     bool preCompressed =
         req->compressResponse(responseAndStream.response, reqCtx, payloadSize);
diff --git a/thrift/lib/cpp2/async/processor/HandlerCallbackBase.h b/thrift/lib/cpp2/async/processor/HandlerCallbackBase.h
@@ -380,13 +380,22 @@ class HandlerCallbackBase {
   void sendReply(ResponseAndServerStreamFactory&& responseAndStream);
 
  private:
-  // Dispatches compression + reply to the CPU executor when sendReply is
+  // Returns an executor for compression offload when executor_ is null (EB
+  // mode). Walks the context chain to the server's handler executor, falling
+  // back to folly::getGlobalCPUExecutor() as a safety net.
+  folly::Executor::KeepAlive<> getCompressionExecutorFallback();
+
+  // Dispatches compression + reply to the given CPU executor when sendReply is
   // called on the IO thread. Moves all needed state into the lambda so
   // HandlerCallbackBase can be destroyed after this returns.
   void dispatchReplyToCpuThread(
-      SerializedResponse response, size_t payloadSize);
+      SerializedResponse response,
+      size_t payloadSize,
+      const folly::Executor::KeepAlive<>& compressionExecutor);
   void dispatchStreamReplyToCpuThread(
-      ResponseAndServerStreamFactory&& responseAndStream, size_t payloadSize);
+      ResponseAndServerStreamFactory&& responseAndStream,
+      size_t payloadSize,
+      const folly::Executor::KeepAlive<>& compressionExecutor);
 
   // Sets up stream factory with interaction, context stack, method name, and
   // interceptor context. Shared by sendReply and
diff --git a/thrift/lib/cpp2/server/ServerFlags.cpp b/thrift/lib/cpp2/server/ServerFlags.cpp
@@ -39,6 +39,8 @@ THRIFT_FLAG_DEFINE_bool(
 
 THRIFT_FLAG_DEFINE_bool(thrift_server_compress_response_on_cpu, false);
 
+THRIFT_FLAG_DEFINE_int64(thrift_server_min_cpu_compression_payload_size, 1024);
+
 FOLLY_GFLAGS_DEFINE_bool(
     thrift_use_token_bucket_concurrency_controller,
     false,
diff --git a/thrift/lib/cpp2/server/ServerFlags.h b/thrift/lib/cpp2/server/ServerFlags.h
@@ -32,6 +32,14 @@ THRIFT_FLAG_DECLARE_bool(allow_resource_pools_set_thread_manager_from_executor);
 
 THRIFT_FLAG_DECLARE_bool(thrift_server_compress_response_on_cpu);
 
+// This flag does not control whether compression happens — that is solely
+// determined by compressionSizeLimit. It only controls where compression runs:
+// payloads below this threshold are compressed inline on the IO thread
+// (skipping the thread-hop overhead), while larger payloads are dispatched to a
+// CPU thread. Only effective when thrift_server_compress_response_on_cpu is
+// enabled.
+THRIFT_FLAG_DECLARE_int64(thrift_server_min_cpu_compression_payload_size);
+
 // Use TokenBucketConcurrencyController as a standard concurrency controller in
 // ThriftServer
 FOLLY_GFLAGS_DECLARE_bool(thrift_use_token_bucket_concurrency_controller);
diff --git a/thrift/lib/cpp2/transport/core/ThriftRequest.cpp b/thrift/lib/cpp2/transport/core/ThriftRequest.cpp
@@ -28,6 +28,7 @@
 THRIFT_FLAG_DEFINE_int64(queue_time_logging_threshold_ms, 5);
 THRIFT_FLAG_DEFINE_bool(enable_request_event_logging, true);
 THRIFT_FLAG_DECLARE_bool(thrift_server_compress_response_on_cpu);
+THRIFT_FLAG_DECLARE_int64(thrift_server_min_cpu_compression_payload_size);
 
 namespace apache::thrift {
 
@@ -521,6 +522,11 @@ ThriftRequestCore::getEligibleCompressionAlgorithm(size_t payloadSize) const {
 
 bool ThriftRequestCore::shouldDispatchCompressionToCpu(
     size_t payloadSize) const {
+  auto minSize = static_cast<size_t>(
+      THRIFT_FLAG(thrift_server_min_cpu_compression_payload_size));
+  if (payloadSize < minSize) {
+    return false;
+  }
   return getEligibleCompressionAlgorithm(payloadSize).has_value();
 }