ray-project · abrarsheikh · Nov 24, 2025 · Oct 14, 2025 · Oct 14, 2025 · Oct 17, 2025
diff --git a/doc/source/serve/advanced-guides/external-scaling-webhook.md b/doc/source/serve/advanced-guides/external-scaling-webhook.md
@@ -0,0 +1,153 @@
+(serve-external-scale-webhook)=
+
+:::{warning}
+This API is in alpha and may change before becoming stable.
+:::
+
+# External Scaling Webhook
+
+Ray Serve exposes a REST API endpoint that you can use to dynamically scale your deployments from outside the Ray cluster. This endpoint gives you flexibility to implement custom scaling logic based on any metrics or signals you choose, such as external monitoring systems, business metrics, or predictive models.
+
+## Overview
+
+The external scaling webhook provides programmatic control over the number of replicas for any deployment in your Ray Serve application. Unlike Ray Serve's built-in autoscaling, which scales based on queue depth and ongoing requests, this webhook allows you to scale based on any external criteria you define.
+
+## Prerequisites
+
+Before you can use the external scaling webhook, you must enable it in your Ray Serve application configuration:
+
+### Enable external scaler
+
+Set `external_scaler_enabled: true` in your application configuration:
+
+```yaml
+applications:
+  - name: my-app
+    import_path: my_module:app
+    external_scaler_enabled: true
+    deployments:
+      - name: my-deployment
+        num_replicas: 1
+```
+
+:::{warning}
+External scaling and built-in autoscaling are mutually exclusive. You can't use both for the same application.
+
+- If you set `external_scaler_enabled: true`, you **must not** configure `autoscaling_config` on any deployment in that application.
+- If you configure `autoscaling_config` on any deployment, you **must not** set `external_scaler_enabled: true` for the application.
+
+Attempting to use both will result in an error.
+:::
+
+### Get authentication token
+
+The external scaling webhook requires authentication using a bearer token. You can obtain this token from the Ray Dashboard UI:
+
+1. Open the Ray Dashboard in your browser (typically at `http://localhost:8265`).
+2. Navigate to the Serve section.
+3. Find and copy the authentication token for your application.
+
+## API endpoint
+
+The webhook is available at the following endpoint:
+
+```
+POST /api/v1/applications/{application_name}/deployments/{deployment_name}/scale
+```
+
+**Path Parameters:**
+- `application_name`: The name of your Serve application.
+- `deployment_name`: The name of the deployment you want to scale.
+
+**Headers:**
+- `Authorization` (required): Bearer token for authentication. Format: `Bearer <token>`
+- `Content-Type` (required): Must be `application/json`
+
+**Request Body:**
+
+The following example shows the request body structure:
+
+```json
+{
+    "target_num_replicas": 5
+}
+```
+
+The request body must conform to the [`ScaleDeploymentRequest`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ScaleDeploymentRequest.html) schema:
+
+- `target_num_replicas` (integer, required): The target number of replicas for the deployment. Must be a non-negative integer.
+
+
+## Example - Predictive scaling
+
+Implement predictive scaling based on historical patterns or forecasts. For instance, you can preemptively scale up before anticipated traffic spikes:
+
+```python
+import requests
+from datetime import datetime
+
+def predictive_scale(
+    application_name: str,
+    deployment_name: str,
+    auth_token: str,
+    serve_endpoint: str = "http://localhost:8000"
+) -> bool:
+    """Scale based on time of day and historical patterns."""
+    hour = datetime.now().hour
+
+    # Define scaling profile based on historical traffic patterns
+    if 9 <= hour < 17:  # Business hours
+        target_replicas = 10
+    elif 17 <= hour < 22:  # Evening peak
+        target_replicas = 15
+    else:  # Off-peak hours
+        target_replicas = 3
+
+    url = (
+        f"{serve_endpoint}/api/v1/applications/{application_name}"
+        f"/deployments/{deployment_name}/scale"
+    )
+
+    headers = {
+        "Authorization": f"Bearer {auth_token}",
+        "Content-Type": "application/json"
+    }
+
+    response = requests.post(
+        url,
+        headers=headers,
+        json={"target_num_replicas": target_replicas}
+    )
+
+    return response.status_code == 200
+
+```
+
+## Use cases
+
+The external scaling webhook is useful for several scenarios where you need custom scaling logic beyond what Ray Serve's built-in autoscaling provides:
+
+### Custom metric-based scaling
+
+Scale your deployments based on business or application metrics that Ray Serve doesn't track automatically:
+
+- External monitoring systems such as Prometheus, Datadog, or CloudWatch metrics.
+- Database query latencies or connection pool sizes.
+- Cost metrics to optimize for budget constraints.
+
+### Predictive and scheduled scaling
+
+Implement predictive scaling based on historical patterns or business schedules:
+
+- Preemptive scaling before anticipated traffic spikes (such as daily or weekly patterns).
+- Event-driven scaling for known traffic events (such as sales, launches, or scheduled batch jobs).
+- Time-of-day based scaling profiles for predictable workloads.
+
+### Manual and operational control
+
+Direct control over replica counts for operational scenarios:
+
+- Manual scaling for load testing or performance testing.
+- Cost optimization by scaling down during off-peak hours or weekends.
+- Development and staging environment management.
+
diff --git a/doc/source/serve/advanced-guides/index.md b/doc/source/serve/advanced-guides/index.md
@@ -16,6 +16,7 @@ deploy-vm
 multi-app-container
 custom-request-router
 multi-node-gpu-troubleshooting
+external-scaling-webhook
 ```
 
 If you’re new to Ray Serve, start with the [Ray Serve Quickstart](serve-getting-started).
@@ -33,3 +34,4 @@ Use these advanced guides for more options and configurations:
 - [Run Applications in Different Containers](serve-container-runtime-env-guide)
 - [Use Custom Algorithm for Request Routing](custom-request-router)
 - [Troubleshoot multi-node GPU setups for serving LLMs](multi-node-gpu-troubleshooting)
+- [External Scaling Webhook API](external-scaling-webhook)
@@ -40,7 +40,8 @@ applications:
 - name: ...
   route_prefix: ...
   import_path: ...
-  runtime_env: ... 
+  runtime_env: ...
+  external_scaler_enabled: ...
   deployments:
   - name: ...
     num_replicas: ...
@@ -99,6 +100,7 @@ These are the fields per `application`:
 - **`route_prefix`**: An application can be called via HTTP at the specified route prefix. It defaults to `/`. The route prefix for each application must be unique.
 - **`import_path`**: The path to your top-level Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`.
 - **`runtime_env`**: Defines the environment that the application runs in. Use this parameter to package application dependencies such as `pip` packages (see {ref}`Runtime Environments <runtime-environments>` for supported fields). The `import_path` must be available _within_ the `runtime_env` if it's specified. The Serve config's `runtime_env` can only use [remote URIs](remote-uris) in its `working_dir` and `py_modules`; it can't use local zip files or directories. [More details on runtime env](serve-runtime-env).
+- **`external_scaler_enabled`**: Enables the external scaling webhook, which lets you scale deployments from outside the Ray cluster using a REST API. When enabled, you can't use built-in autoscaling (`autoscaling_config`) for any deployment in this application. Defaults to `False`. See [External Scaling Webhook](serve-external-scale-webhook) for details.
 - **`deployments (optional)`**: A list of deployment options that allows you to override the `@serve.deployment` settings specified in the deployment graph code. Each entry in this list must include the deployment `name`, which must match one in the code. If this section is omitted, Serve launches all deployments in the graph with the parameters specified in the code. See how to [configure serve deployment options](serve-configure-deployment).
 - **`args`**: Arguments that are passed to the [application builder](serve-app-builder-guide).