[Serve][2/n] add batching metrics (ray-project#59232)

abrarsheikh · peterxcli · commit 78477de17712 · 2026-02-25T15:55:58.000+08:00
fixes ray-project#59218 ### Performance Delta ```python from ray import serve from typing import List @serve.deployment(max_ongoing_requests=1000) class MyDeployment: @serve.batch(max_batch_size=10, batch_wait_timeout_s=1) async def handle_batch(self, requests: List[int]) -> List[int]: return [request + 1 for request in requests] async def __call__(self) -> List[int]: return await self.handle_batch(1) app = MyDeployment.bind() ``` `ray start --head --metrics-export-port=8080` -> `serve run batch_test:app` locust 100 users Metric | With Change | Master | Δ (Master – With Change) -- | -- | -- | -- Requests | 32,033 | 33,541 | +1,508 Fails | 0 | 0 | 0 Median (ms) | 170 | 170 | 0 95%ile (ms) | 240 | 240 | 0 99%ile (ms) | 280 | 270 | –10 ms Average (ms) | 172.98 | 171.87 | –1.11 ms Min (ms) | 70 | 84 | +14 ms Max (ms) | 352 | 365 | +13 ms Average size (bytes) | 1 | 1 | 0 Current RPS | 581.9 | 604.1 | +22.2 Current Failures/s | 0 | 0 | 0 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
diff --git a/doc/source/serve/monitoring.md b/doc/source/serve/monitoring.md
@@ -493,15 +493,23 @@ You can customize these buckets using environment variables:
   - `ray_serve_multiplexed_model_load_latency_ms`
   - `ray_serve_multiplexed_model_unload_latency_ms`
 
-Set these as comma-separated values, for example: `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS="10,50,100,500,1000,5000"`.
+- **`RAY_SERVE_BATCH_UTILIZATION_BUCKETS_PERCENT`**: Controls bucket boundaries for batch utilization histogram. Default: `[5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 95, 99, 100]` (percentage).
+  - `ray_serve_batch_utilization_percent`
+
+- **`RAY_SERVE_BATCH_SIZE_BUCKETS`**: Controls bucket boundaries for batch size histogram. Default: `[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]`.
+  - `ray_serve_actual_batch_size`
+
+Note: `ray_serve_batch_wait_time_ms` and `ray_serve_batch_execution_time_ms` use the same buckets as `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS`.
+
+Set these as comma-separated values, for example: `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS="10,50,100,500,1000,5000"` or `RAY_SERVE_BATCH_SIZE_BUCKETS="1,4,8,16,32,64"`.
 
 **Histogram accuracy considerations**
 
 Prometheus histograms aggregate data into predefined buckets, which can affect the accuracy of percentile calculations (e.g., p50, p95, p99) displayed on dashboards:
 
 - **Values outside bucket range**: If your latencies exceed the largest bucket boundary (default: 600,000ms / 10 minutes), they all fall into the `+Inf` bucket and percentile estimates become inaccurate.
 - **Sparse bucket coverage**: If your actual latencies cluster between two widely-spaced buckets, the calculated percentiles are interpolated and may not reflect true values.
-- **Bucket boundaries are fixed at startup**: Changes to `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS` or `RAY_SERVE_MODEL_LOAD_LATENCY_BUCKETS_MS` require restarting Serve actors to take effect.
+- **Bucket boundaries are fixed at startup**: Changes to bucket environment variables (such as `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS`, `RAY_SERVE_BATCH_SIZE_BUCKETS`, etc.) require restarting Serve actors to take effect.
 
 For accurate percentile calculations, configure bucket boundaries that closely match your expected latency distribution. For example, if most requests complete in 10-100ms, use finer-grained buckets in that range.
 :::
@@ -599,6 +607,19 @@ These metrics track request throughput, errors, and latency at the replica level
 | `ray_serve_deployment_processing_latency_ms` **[D]** | Histogram | `deployment`, `replica`, `route`, `application` | Histogram of request processing time in milliseconds (excludes queue wait time). |
 | `ray_serve_deployment_error_counter_total` **[D]** | Counter | `deployment`, `replica`, `route`, `application` | Total number of exceptions raised while processing requests. |
 
+### Batching metrics
+
+These metrics track request batching behavior for deployments using `@serve.batch`. Use them to tune batching parameters and debug latency issues.
+
+| Metric | Type | Tags | Description |
+|--------|------|------|-------------|
+| `ray_serve_batch_wait_time_ms` | Histogram | `deployment`, `replica`, `application`, `function_name` | Time requests waited for the batch to fill in milliseconds. High values indicate batch timeout may be too long. |
+| `ray_serve_batch_execution_time_ms` | Histogram | `deployment`, `replica`, `application`, `function_name` | Time to execute the batch function in milliseconds. |
+| `ray_serve_batch_queue_length` | Gauge | `deployment`, `replica`, `application`, `function_name` | Current number of requests waiting in the batch queue. High values indicate a batching bottleneck. |
+| `ray_serve_batch_utilization_percent` | Histogram | `deployment`, `replica`, `application`, `function_name` | Batch utilization as percentage (`computed_batch_size / max_batch_size * 100`). Low utilization suggests `batch_wait_timeout_s` is too aggressive or traffic is too low. |
+| `ray_serve_actual_batch_size` | Histogram | `deployment`, `replica`, `application`, `function_name` | The computed size of each batch. When `batch_size_fn` is configured, this reports the custom computed size (such as total tokens). Otherwise, it reports the number of requests. |
+| `ray_serve_batches_processed_total` | Counter | `deployment`, `replica`, `application`, `function_name` | Total number of batches executed. Compare with request counter to measure batching efficiency. |
+
 ### Replica lifecycle metrics
 
 These metrics track replica health and restarts.
diff --git a/python/ray/serve/_private/constants.py b/python/ray/serve/_private/constants.py
@@ -125,6 +125,58 @@
     DEFAULT_LATENCY_BUCKET_MS,
 )
 
+#: Histogram buckets for batch execution time in milliseconds.
+BATCH_EXECUTION_TIME_BUCKETS_MS = REQUEST_LATENCY_BUCKETS_MS
+
+#: Histogram buckets for batch wait time in milliseconds.
+BATCH_WAIT_TIME_BUCKETS_MS = REQUEST_LATENCY_BUCKETS_MS
+
+#: Histogram buckets for batch utilization percentage.
+DEFAULT_BATCH_UTILIZATION_BUCKETS_PERCENT = [
+    5,
+    10,
+    20,
+    30,
+    40,
+    50,
+    60,
+    70,
+    80,
+    90,
+    95,
+    99,
+    100,
+]
+BATCH_UTILIZATION_BUCKETS_PERCENT = parse_latency_buckets(
+    get_env_str(
+        "RAY_SERVE_BATCH_UTILIZATION_BUCKETS_PERCENT",
+        "",
+    ),
+    DEFAULT_BATCH_UTILIZATION_BUCKETS_PERCENT,
+)
+
+#: Histogram buckets for actual batch size.
+DEFAULT_BATCH_SIZE_BUCKETS = [
+    1,
+    2,
+    4,
+    8,
+    16,
+    32,
+    64,
+    128,
+    256,
+    512,
+    1024,
+]
+BATCH_SIZE_BUCKETS = parse_latency_buckets(
+    get_env_str(
+        "RAY_SERVE_BATCH_SIZE_BUCKETS",
+        "",
+    ),
+    DEFAULT_BATCH_SIZE_BUCKETS,
+)
+
 #: Name of deployment health check method implemented by user.
 HEALTH_CHECK_METHOD = "check_health"
 
diff --git a/python/ray/serve/batching.py b/python/ray/serve/batching.py
@@ -27,9 +27,16 @@
 from ray import serve
 from ray._common.signature import extract_signature, flatten_args, recover_args
 from ray._common.utils import get_or_create_event_loop
-from ray.serve._private.constants import SERVE_LOGGER_NAME
+from ray.serve._private.constants import (
+    BATCH_EXECUTION_TIME_BUCKETS_MS,
+    BATCH_SIZE_BUCKETS,
+    BATCH_UTILIZATION_BUCKETS_PERCENT,
+    BATCH_WAIT_TIME_BUCKETS_MS,
+    SERVE_LOGGER_NAME,
+)
 from ray.serve._private.utils import extract_self_if_method_call
 from ray.serve.exceptions import RayServeException
+from ray.serve.metrics import Counter, Gauge, Histogram
 from ray.util.annotations import PublicAPI
 
 logger = logging.getLogger(SERVE_LOGGER_NAME)
@@ -148,6 +155,46 @@ def __init__(
         # Used for observability.
         self.curr_iteration_start_times: Dict[asyncio.Task, float] = {}
 
+        # Initialize batching metrics.
+        self._batch_wait_time_histogram = Histogram(
+            "serve_batch_wait_time_ms",
+            description="Time requests waited for batch to fill (in milliseconds).",
+            boundaries=BATCH_WAIT_TIME_BUCKETS_MS,
+            tag_keys=("function_name",),
+        )
+        self._batch_execution_time_histogram = Histogram(
+            "serve_batch_execution_time_ms",
+            description="Time to execute the batch function (in milliseconds).",
+            boundaries=BATCH_EXECUTION_TIME_BUCKETS_MS,
+            tag_keys=("function_name",),
+        )
+        self._batch_queue_length_gauge = Gauge(
+            "serve_batch_queue_length",
+            description="Number of requests waiting in the batch queue.",
+            tag_keys=("function_name",),
+        )
+        self._batch_utilization_histogram = Histogram(
+            "serve_batch_utilization_percent",
+            description="Batch utilization as percentage (actual_batch_size / max_batch_size * 100).",
+            boundaries=BATCH_UTILIZATION_BUCKETS_PERCENT,
+            tag_keys=("function_name",),
+        )
+        self._batch_size_histogram = Histogram(
+            "serve_actual_batch_size",
+            description="The actual number of requests in each batch.",
+            boundaries=BATCH_SIZE_BUCKETS,
+            tag_keys=("function_name",),
+        )
+        self._batches_processed_counter = Counter(
+            "serve_batches_processed",
+            description="Counter of batches executed.",
+            tag_keys=("function_name",),
+        )
+
+        self._function_name = (
+            handle_batch_func.__name__ if handle_batch_func is not None else "unknown"
+        )
+
         self._handle_batch_task = None
         self._loop = get_or_create_event_loop()
         if handle_batch_func is not None:
@@ -199,12 +246,13 @@ def _compute_batch_size(self, batch: List[_SingleRequest]) -> int:
 
         return self.batch_size_fn(items)
 
-    async def wait_for_batch(self) -> List[_SingleRequest]:
+    async def wait_for_batch(self) -> Tuple[List[_SingleRequest], int]:
         """Wait for batch respecting self.max_batch_size and self.timeout_s.
 
-        Returns a batch of up to self.max_batch_size items. Waits for up to
-        to self.timeout_s after receiving the first request that will be in
-        the next batch. After the timeout, returns as many items as are ready.
+        Returns a tuple of (batch, computed_batch_size) where batch contains
+        up to self.max_batch_size items. Waits for up to self.timeout_s after
+        receiving the first request that will be in the next batch. After the
+        timeout, returns as many items as are ready.
 
         Always returns a batch with at least one item - will block
         indefinitely until an item comes in.
@@ -228,13 +276,18 @@ async def wait_for_batch(self) -> List[_SingleRequest]:
                 )
                 # Set exception on the future so the caller receives it
                 first_item.future.set_exception(exc)
-                return []
+                return [], 0
 
         batch.append(first_item)
 
         # Wait self.timeout_s seconds for new queue arrivals.
         batch_start_time = time.time()
         while True:
+            # Record queue length metric.
+            self._batch_queue_length_gauge.set(
+                self.queue.qsize(), tags={"function_name": self._function_name}
+            )
+
             remaining_batch_time_s = max(
                 batch_wait_timeout_s - (time.time() - batch_start_time), 0
             )
@@ -270,6 +323,9 @@ async def wait_for_batch(self) -> List[_SingleRequest]:
                     # so newer requests may be processed before it. Consider using
                     # asyncio.PriorityQueue if strict ordering is required.
                     self.queue.put_nowait(deferred_item)
+                    # Compute final batch size before breaking (batch is now valid
+                    # after popping the deferred item).
+                    current_batch_size = self._compute_batch_size(batch)
                     # break the loop early because the deferred item is too large to fit in the batch
                     break
             else:
@@ -293,7 +349,13 @@ async def wait_for_batch(self) -> List[_SingleRequest]:
             ):
                 break
 
-        return batch
+        # Record batch wait time metric (time spent waiting for batch to fill).
+        batch_wait_time_ms = (time.time() - batch_start_time) * 1000
+        self._batch_wait_time_histogram.observe(
+            batch_wait_time_ms, tags={"function_name": self._function_name}
+        )
+
+        return batch, current_batch_size
 
     def _validate_results(
         self, results: Iterable[Any], input_batch_length: int
@@ -379,29 +441,57 @@ async def _process_batches(self, func: Callable) -> None:
         # So we unset the request context so the current context is not inherited by the task, _process_batch.
         serve.context._unset_request_context()
         while not self._loop.is_closed():
-            batch = await self.wait_for_batch()
-            promise = self._process_batch(func, batch)
+            batch, computed_batch_size = await self.wait_for_batch()
+            promise = self._process_batch(func, batch, computed_batch_size)
             task = asyncio.create_task(promise)
             self.tasks.add(task)
             self.curr_iteration_start_times[task] = time.time()
             task.add_done_callback(self._handle_completed_task)
 
-    async def _process_batch(self, func: Callable, batch: List[_SingleRequest]) -> None:
+    async def _process_batch(
+        self, func: Callable, batch: List[_SingleRequest], computed_batch_size: int
+    ) -> None:
         """Processes queued request batch."""
         # NOTE: this semaphore caps the number of concurrent batches specified by `max_concurrent_batches`
         async with self.semaphore:
             # Remove requests that have been cancelled from the batch. If
             # all requests have been cancelled, simply return and wait for
             # the next batch.
+            original_batch_len = len(batch)
             batch = [req for req in batch if not req.future.cancelled()]
             if len(batch) == 0:
                 return
 
+            # Record batch utilization metric.
+            # Use computed_batch_size from wait_for_batch for efficiency.
+            # If requests were cancelled, we need to recompute since the batch changed.
+            if len(batch) != original_batch_len:
+                computed_batch_size = self._compute_batch_size(batch)
+
+            # Calculate and record batch utilization percentage.
+            batch_utilization_percent = (
+                computed_batch_size / self.max_batch_size
+            ) * 100
+            self._batch_utilization_histogram.observe(
+                batch_utilization_percent, tags={"function_name": self._function_name}
+            )
+
+            # Record actual batch size (number of requests in the batch computed by the batch_size_fn).
+            self._batch_size_histogram.observe(
+                computed_batch_size, tags={"function_name": self._function_name}
+            )
+
+            # Increment batches processed counter.
+            self._batches_processed_counter.inc(
+                tags={"function_name": self._function_name}
+            )
+
             futures = [item.future for item in batch]
 
             # Most of the logic in the function should be wrapped in this try-
             # except block, so the futures' exceptions can be set if an exception
             # occurs. Otherwise, the futures' requests may hang indefinitely.
+            batch_execution_start_time = time.time()
             try:
                 self_arg = batch[0].self_arg
                 args, kwargs = _batch_args_kwargs(
@@ -436,6 +526,14 @@ async def _process_batch(self, func: Callable, batch: List[_SingleRequest]) -> N
 
                 for future in futures:
                     _set_exception_if_not_done(future, e)
+            finally:
+                # Record batch execution time.
+                batch_execution_time_ms = (
+                    time.time() - batch_execution_start_time
+                ) * 1000
+                self._batch_execution_time_histogram.observe(
+                    batch_execution_time_ms, tags={"function_name": self._function_name}
+                )
 
     def _handle_completed_task(self, task: asyncio.Task) -> None:
         self.tasks.remove(task)
diff --git a/python/ray/serve/tests/test_batching.py b/python/ray/serve/tests/test_batching.py
@@ -635,7 +635,7 @@ async def f(self, request: Request):
 
 
 def test_batch_size_fn_deferred_item_processed(serve_instance):
-    @serve.deployment
+    @serve.deployment(max_ongoing_requests=15)
     class DeferredItemBatcher:
         def __init__(self):
             self.batch_sizes = []
diff --git a/python/ray/serve/tests/test_metrics.py b/python/ray/serve/tests/test_metrics.py