You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: `ray_serve_batch_wait_time_ms`and `ray_serve_batch_execution_time_ms` use the same buckets as `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS`.
503
+
504
+
Set these as comma-separated values, for example: `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS="10,50,100,500,1000,5000"`or `RAY_SERVE_BATCH_SIZE_BUCKETS="1,4,8,16,32,64"`.
497
505
498
506
**Histogram accuracy considerations**
499
507
500
508
Prometheus histograms aggregate data into predefined buckets, which can affect the accuracy of percentile calculations (e.g., p50, p95, p99) displayed on dashboards:
501
509
502
510
- **Values outside bucket range**: If your latencies exceed the largest bucket boundary (default: 600,000ms / 10 minutes), they all fall into the `+Inf` bucket and percentile estimates become inaccurate.
503
511
- **Sparse bucket coverage**: If your actual latencies cluster between two widely-spaced buckets, the calculated percentiles are interpolated and may not reflect true values.
504
-
- **Bucket boundaries are fixed at startup**: Changes to `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS` or `RAY_SERVE_MODEL_LOAD_LATENCY_BUCKETS_MS` require restarting Serve actors to take effect.
512
+
- **Bucket boundaries are fixed at startup**: Changes to bucket environment variables (such as `RAY_SERVE_REQUEST_LATENCY_BUCKETS_MS`, `RAY_SERVE_BATCH_SIZE_BUCKETS`, etc.) require restarting Serve actors to take effect.
505
513
506
514
For accurate percentile calculations, configure bucket boundaries that closely match your expected latency distribution. For example, if most requests complete in 10-100ms, use finer-grained buckets in that range.
507
515
:::
@@ -599,6 +607,19 @@ These metrics track request throughput, errors, and latency at the replica level
599
607
| `ray_serve_deployment_processing_latency_ms` **[D]** | Histogram | `deployment`, `replica`, `route`, `application` | Histogram of request processing time in milliseconds (excludes queue wait time). |
600
608
| `ray_serve_deployment_error_counter_total` **[D]** | Counter | `deployment`, `replica`, `route`, `application` | Total number of exceptions raised while processing requests. |
601
609
610
+
### Batching metrics
611
+
612
+
These metrics track request batching behavior for deployments using `@serve.batch`. Use them to tune batching parameters and debug latency issues.
613
+
614
+
| Metric | Type | Tags | Description |
615
+
|--------|------|------|-------------|
616
+
| `ray_serve_batch_wait_time_ms` | Histogram | `deployment`, `replica`, `application`, `function_name` | Time requests waited for the batch to fill in milliseconds. High values indicate batch timeout may be too long. |
617
+
| `ray_serve_batch_execution_time_ms` | Histogram | `deployment`, `replica`, `application`, `function_name` | Time to execute the batch function in milliseconds. |
618
+
| `ray_serve_batch_queue_length` | Gauge | `deployment`, `replica`, `application`, `function_name` | Current number of requests waiting in the batch queue. High values indicate a batching bottleneck. |
619
+
| `ray_serve_batch_utilization_percent` | Histogram | `deployment`, `replica`, `application`, `function_name` | Batch utilization as percentage (`computed_batch_size / max_batch_size * 100`). Low utilization suggests `batch_wait_timeout_s` is too aggressive or traffic is too low. |
620
+
| `ray_serve_actual_batch_size` | Histogram | `deployment`, `replica`, `application`, `function_name` | The computed size of each batch. When `batch_size_fn` is configured, this reports the custom computed size (such as total tokens). Otherwise, it reports the number of requests. |
621
+
| `ray_serve_batches_processed_total` | Counter | `deployment`, `replica`, `application`, `function_name` | Total number of batches executed. Compare with request counter to measure batching efficiency. |
0 commit comments