Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ nav:
- Getting started (Latest/Main): guides/getting-started-latest.md
- Use Cases:
- Serving Multiple Inference Pools (Latest/Main): guides/serving-multiple-inference-pools-latest.md
- Deploy As a Standalone Request Scheduler: guides/epp-standalone.md
- Rollout:
- Adapter Rollout: guides/adapter-rollout.md
- InferencePool Rollout: guides/inferencepool-rollout.md
Expand Down
124 changes: 124 additions & 0 deletions site-src/guides/epp-standalone.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Deploy As A Standalone Request Scheduler
The endpoint picker (EPP) at its core is a smart request scheduler for LLM requests, it currently implements a number of LLM-specific load balancing optimizations including:

* Prefix-cache aware scheduling
* Load-aware scheduling

When using EPP with Gateway API, it works as an ext-proc to an envoy-based proxy fronting model servers running in a k8s cluster;
examples of such proxies are cloud managed ones like GKE’s L7LB and open source counterparts like Istio and kGateway.
EPP as an ext-proc here offers several key advantages:

* It utilizes robust, pre-existing L7 proxies, including both managed and open-source options.
* Seamless integration with the Kubernetes networking ecosystem, the Gateway API, allows for:Transforming a Kubernetes gateway into an inference scheduler using familiar APIs.
Leveraging Gateway API features like traffic splitting for gradual rollouts and HTTP rule matching.
Access to provider-specific features.

These benefits are critical for online services, including MaaS (Model-as-a-Service), which require support for multi-tenancy, demand high availability, scalability, and streamlined operations.

However, for some batch inference, a tight integration with the Gateway API and requiring an external proxy to be deployed separately is in practice an operational overhead.
Consider an offline RL post-training job, where the sampler, the inference service in the job, is a single tenant/workload with a lifecycle tied with the training job;
this inference service is specific to the job, it is continuously updated during post-training, and so it is not one that would be serving any other traffic.
A simpler deployment mode would reduce the barrier to adopting the EPP for such single-tenant workloads.

## How
A proxy is deployed as a sidecar to the EPP. The proxy and EPP continue to communicate via ext-proc protocol over localhost.
For the endpoint discovery, you can configure the model server pods as a flag to EPP instead of using InferencePool dependency.

## Example

### **Prerequisites**

--8<-- "site-src/_includes/prereqs.md"

### **Steps**

#### Deploy Sample Model Server

--8<-- "site-src/_includes/model-server-intro.md"

--8<-- "site-src/_includes/model-server-gpu.md"

```bash
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to the set of Llama models
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
```

--8<-- "site-src/_includes/model-server-cpu.md"

```bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml
```

--8<-- "site-src/_includes/model-server-sim.md"

```bash
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/sim-deployment.yaml
```

#### Deploy Endpoint Picker Extension with Envoy sidecar

Deploy an Endpoint Picker Extension named `vllm-llama3-8b-instruct` that selects from endpoints with label `app=vllm-llama3-8b-instruct` and listening on port 8000. The Helm install command automatically installs the endpoint-picker specific resources.

Set the chart version and then select a tab to follow the provider-specific instructions.

```bash
export EPP_STANDALONE_CHART_VERSION=v0
export PROVIDER=<YOUR_PROVIDER> #optional, can be gke as gke needed it specific epp monitoring resources.
helm install vllm-llama3-8b-instruct \
--dependency-update \
--set inferenceExtension.endpointsServer.endpointSelector="app=vllm-llama3-8b-instruct" \
--set provider.name=$PROVIDER \
--version $EPP_STANDALONE_CHART_VERSION \
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/epp-standalone
```

#### Try it out

Wait until the EPP deployment is ready.

Once you epp-standalone pod is running,
Install the curl pod as follows:
```bash
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: curl
labels:
app: curl
spec:
containers:
- name: curl
image: curlimages/curl:7.83.1
imagePullPolicy: IfNotPresent
command:
- tail
- -f
- /dev/null
restartPolicy: Never
EOF
```
Send an inference request via
```bash
kubectl exec curl -- curl -i http://vllm-llama3-8b-instruct-epp:8081/v1/completions \
-H 'Content-Type: application/json' \
-d '{"model": "food-review-1","prompt": "Write as if you were a critic: San Francisco","max_tokens": 100,"temperature": 0}'
```

#### Cleanup
Run the following commands to remove all resources created by this guide.

The following instructions assume you would like to cleanup ALL resources that were created in this guide.
Please be careful not to delete resources you'd like to keep.

1. Uninstall the EPP, curl pod and model server resources:

```bash
helm uninstall vllm-llama3-8b-instruct
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferenceobjective.yaml --ignore-not-found
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml --ignore-not-found
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --ignore-not-found
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/sim-deployment.yaml --ignore-not-found
kubectl delete secret hf-token --ignore-not-found
kubectl delete pod curl --ignore-not-found
```