Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions doc/source/serve/advanced-guides/external-scaling-webhook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
(serve-external-scale-webhook)=

:::{warning}
This API is in alpha and may change before becoming stable.
:::

# External Scaling Webhook

Ray Serve exposes a REST API endpoint that you can use to dynamically scale your deployments from outside the Ray cluster. This endpoint gives you flexibility to implement custom scaling logic based on any metrics or signals you choose, such as external monitoring systems, business metrics, or predictive models.

## Overview

The external scaling webhook provides programmatic control over the number of replicas for any deployment in your Ray Serve application. Unlike Ray Serve's built-in autoscaling, which scales based on queue depth and ongoing requests, this webhook allows you to scale based on any external criteria you define.

## Prerequisites

Before you can use the external scaling webhook, you must enable it in your Ray Serve application configuration:

### Enable external scaler

Set `external_scaler_enabled: true` in your application configuration:

```yaml
applications:
- name: my-app
import_path: my_module:app
external_scaler_enabled: true
deployments:
- name: my-deployment
num_replicas: 1
```

:::{warning}
External scaling and built-in autoscaling are mutually exclusive. You can't use both for the same application.

- If you set `external_scaler_enabled: true`, you **must not** configure `autoscaling_config` on any deployment in that application.
- If you configure `autoscaling_config` on any deployment, you **must not** set `external_scaler_enabled: true` for the application.

Attempting to use both will result in an error.
:::

### Get authentication token

The external scaling webhook requires authentication using a bearer token. You can obtain this token from the Ray Dashboard UI:

1. Open the Ray Dashboard in your browser (typically at `http://localhost:8265`).
2. Navigate to the Serve section.
3. Find and copy the authentication token for your application.

## API endpoint

The webhook is available at the following endpoint:

```
POST /api/v1/applications/{application_name}/deployments/{deployment_name}/scale
```

**Path Parameters:**
- `application_name`: The name of your Serve application.
- `deployment_name`: The name of the deployment you want to scale.

**Headers:**
- `Authorization` (required): Bearer token for authentication. Format: `Bearer <token>`
- `Content-Type` (required): Must be `application/json`

**Request Body:**

The following example shows the request body structure:

```json
{
"target_num_replicas": 5
}
```

The request body must conform to the [`ScaleDeploymentRequest`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.schema.ScaleDeploymentRequest.html) schema:

- `target_num_replicas` (integer, required): The target number of replicas for the deployment. Must be a non-negative integer.


## Example - Predictive scaling

Implement predictive scaling based on historical patterns or forecasts. For instance, you can preemptively scale up before anticipated traffic spikes:

```python
import requests
from datetime import datetime

def predictive_scale(
application_name: str,
deployment_name: str,
auth_token: str,
serve_endpoint: str = "http://localhost:8000"
) -> bool:
"""Scale based on time of day and historical patterns."""
hour = datetime.now().hour

# Define scaling profile based on historical traffic patterns
if 9 <= hour < 17: # Business hours
target_replicas = 10
elif 17 <= hour < 22: # Evening peak
target_replicas = 15
else: # Off-peak hours
target_replicas = 3

url = (
f"{serve_endpoint}/api/v1/applications/{application_name}"
f"/deployments/{deployment_name}/scale"
)

headers = {
"Authorization": f"Bearer {auth_token}",
"Content-Type": "application/json"
}

response = requests.post(
url,
headers=headers,
json={"target_num_replicas": target_replicas}
)

return response.status_code == 200

```

## Use cases

The external scaling webhook is useful for several scenarios where you need custom scaling logic beyond what Ray Serve's built-in autoscaling provides:

### Custom metric-based scaling

Scale your deployments based on business or application metrics that Ray Serve doesn't track automatically:

- External monitoring systems such as Prometheus, Datadog, or CloudWatch metrics.
- Database query latencies or connection pool sizes.
- Cost metrics to optimize for budget constraints.

### Predictive and scheduled scaling

Implement predictive scaling based on historical patterns or business schedules:

- Preemptive scaling before anticipated traffic spikes (such as daily or weekly patterns).
- Event-driven scaling for known traffic events (such as sales, launches, or scheduled batch jobs).
- Time-of-day based scaling profiles for predictable workloads.

### Manual and operational control

Direct control over replica counts for operational scenarios:

- Manual scaling for load testing or performance testing.
- Cost optimization by scaling down during off-peak hours or weekends.
- Development and staging environment management.

2 changes: 2 additions & 0 deletions doc/source/serve/advanced-guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ deploy-vm
multi-app-container
custom-request-router
multi-node-gpu-troubleshooting
external-scaling-webhook
```

If you’re new to Ray Serve, start with the [Ray Serve Quickstart](serve-getting-started).
Expand All @@ -33,3 +34,4 @@ Use these advanced guides for more options and configurations:
- [Run Applications in Different Containers](serve-container-runtime-env-guide)
- [Use Custom Algorithm for Request Routing](custom-request-router)
- [Troubleshoot multi-node GPU setups for serving LLMs](multi-node-gpu-troubleshooting)
- [External Scaling Webhook API](external-scaling-webhook)
4 changes: 3 additions & 1 deletion doc/source/serve/production-guide/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ applications:
- name: ...
route_prefix: ...
import_path: ...
runtime_env: ...
runtime_env: ...
external_scaler_enabled: ...
deployments:
- name: ...
num_replicas: ...
Expand Down Expand Up @@ -99,6 +100,7 @@ These are the fields per `application`:
- **`route_prefix`**: An application can be called via HTTP at the specified route prefix. It defaults to `/`. The route prefix for each application must be unique.
- **`import_path`**: The path to your top-level Serve deployment (or the same path passed to `serve run`). The most minimal config file consists of only an `import_path`.
- **`runtime_env`**: Defines the environment that the application runs in. Use this parameter to package application dependencies such as `pip` packages (see {ref}`Runtime Environments <runtime-environments>` for supported fields). The `import_path` must be available _within_ the `runtime_env` if it's specified. The Serve config's `runtime_env` can only use [remote URIs](remote-uris) in its `working_dir` and `py_modules`; it can't use local zip files or directories. [More details on runtime env](serve-runtime-env).
- **`external_scaler_enabled`**: Enables the external scaling webhook, which lets you scale deployments from outside the Ray cluster using a REST API. When enabled, you can't use built-in autoscaling (`autoscaling_config`) for any deployment in this application. Defaults to `False`. See [External Scaling Webhook](serve-external-scale-webhook) for details.
- **`deployments (optional)`**: A list of deployment options that allows you to override the `@serve.deployment` settings specified in the deployment graph code. Each entry in this list must include the deployment `name`, which must match one in the code. If this section is omitted, Serve launches all deployments in the graph with the parameters specified in the code. See how to [configure serve deployment options](serve-configure-deployment).
- **`args`**: Arguments that are passed to the [application builder](serve-app-builder-guide).

Expand Down