Skip to content

Commit ee2aad6

Browse files
committed
Model replacement to Qwen3-32B
Signed-off-by: Sathvik <Sathvik.S@ibm.com>
1 parent 6a611ad commit ee2aad6

40 files changed

+218
-218
lines changed

config/charts/epp-standalone/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ inferenceExtension:
1717
endpointsServer:
1818
standalone: true
1919
# Required when standalone is true
20-
# endpointSelector: app=vllm-llama3-8b-instruct
20+
# endpointSelector: app=vllm-qwen3-32b
2121
targetPorts: 8000
2222
modelServerType: vllm # vllm, triton-tensorrt-llm
2323

config/charts/inferencepool/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@ A chart to deploy an InferencePool and a corresponding EndpointPicker (epp) depl
44

55
## Install
66

7-
To install an InferencePool named `vllm-llama3-8b-instruct` that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port `8000`, you can run the following command:
7+
To install an InferencePool named `vllm-qwen3-32b` that selects from endpoints with label `app: vllm-qwen3-32b` and listening on port `8000`, you can run the following command:
88

99
```txt
10-
$ helm install vllm-llama3-8b-instruct ./config/charts/inferencepool \
11-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
10+
$ helm install vllm-qwen3-32b ./config/charts/inferencepool \
11+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
1212
```
1313

1414
To install via the latest published chart in staging (--version v0 indicates latest dev version), you can run the following command:
1515

1616
```txt
17-
$ helm install vllm-llama3-8b-instruct \
18-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
17+
$ helm install vllm-qwen3-32b \
18+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
1919
--set provider.name=[none|gke|istio] \
2020
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
2121
```
@@ -27,8 +27,8 @@ Note that the provider name is needed to deploy provider-specific resources. If
2727
To set cmd-line flags, you can use the `--set` option to set each flag, e.g.,:
2828

2929
```txt
30-
$ helm install vllm-llama3-8b-instruct \
31-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
30+
$ helm install vllm-qwen3-32b \
31+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
3232
--set inferenceExtension.flags.<FLAG_NAME>=<FLAG_VALUE>
3333
--set provider.name=[none|gke|istio] \
3434
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
@@ -64,7 +64,7 @@ inferenceExtension:
6464
Then apply it with:
6565
6666
```txt
67-
$ helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
67+
$ helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
6868
```
6969

7070
### Install with Custom EPP Plugins Configuration
@@ -106,7 +106,7 @@ inferenceExtension:
106106
Then apply it with:
107107

108108
```txt
109-
$ helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
109+
$ helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
110110
```
111111

112112
### Install for Triton TensorRT-LLM
@@ -159,8 +159,8 @@ To enable HA, set `inferenceExtension.replicas` to a number greater than 1.
159159
* Via `--set` flag:
160160

161161
```txt
162-
helm install vllm-llama3-8b-instruct \
163-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
162+
helm install vllm-qwen3-32b \
163+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
164164
--set inferenceExtension.replicas=3 \
165165
--set provider=[none|gke] \
166166
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
@@ -176,7 +176,7 @@ To enable HA, set `inferenceExtension.replicas` to a number greater than 1.
176176
Then apply it with:
177177

178178
```txt
179-
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
179+
helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
180180
```
181181

182182
### Install with Monitoring
@@ -204,7 +204,7 @@ If you are using a GKE Autopilot cluster, you also need to set `provider.gke.aut
204204
Then apply it with:
205205

206206
```txt
207-
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
207+
helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
208208
```
209209

210210
## Uninstall

config/charts/inferencepool/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ inferencePool:
157157
apiVersion: inference.networking.k8s.io/v1
158158
# modelServers: # REQUIRED
159159
# matchLabels:
160-
# app: vllm-llama3-8b-instruct
160+
# app: vllm-qwen3-32b
161161

162162
# Should only used if apiVersion is inference.networking.x-k8s.io/v1alpha2,
163163
# This will soon be deprecated when upstream GW providers support v1, just doing something simple for now.
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
apiVersion: v1
22
kind: ConfigMap
33
metadata:
4-
name: vllm-llama3-8b-instruct-adapters-allowlist
4+
name: vllm-qwen3-32b-adapters-allowlist
55
labels:
66
inference-gateway.k8s.io/managed: "true"
77
data:
8-
baseModel: meta-llama/Llama-3.1-8B-Instruct
8+
baseModel: Qwen/Qwen3-32B
99
adapters: |
10-
- food-review-1
10+
- qwen-uncensored-1
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
apiVersion: inference.networking.x-k8s.io/v1alpha2
22
kind: InferenceObjective
33
metadata:
4-
name: food-review
4+
name: qwen-uncensored
55
spec:
66
priority: 1
77
poolRef:
88
group: inference.networking.k8s.io
9-
name: vllm-llama3-8b-instruct
9+
name: vllm-qwen3-32b
1010
---
1111
apiVersion: inference.networking.x-k8s.io/v1alpha2
1212
kind: InferenceObjective
@@ -16,7 +16,7 @@ spec:
1616
priority: 2
1717
poolRef:
1818
group: inference.networking.k8s.io
19-
name: vllm-llama3-8b-instruct
19+
name: vllm-qwen3-32b
2020
---
2121
apiVersion: inference.networking.x-k8s.io/v1alpha2
2222
kind: InferenceObjective
@@ -26,4 +26,4 @@ spec:
2626
priority: 2
2727
poolRef:
2828
group: inference.networking.k8s.io
29-
name: vllm-llama3-8b-instruct
29+
name: vllm-qwen3-32b

config/manifests/vllm/cpu-deployment.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: vllm-llama3-8b-instruct
4+
name: vllm-qwen3-32b
55
spec:
66
replicas: 3
77
selector:
88
matchLabels:
9-
app: vllm-llama3-8b-instruct
9+
app: vllm-qwen3-32b
1010
template:
1111
metadata:
1212
labels:
13-
app: vllm-llama3-8b-instruct
13+
app: vllm-qwen3-32b
1414
spec:
1515
containers:
1616
- name: lora
@@ -26,8 +26,8 @@ spec:
2626
- "--max-loras"
2727
- "4"
2828
- "--lora-modules"
29-
- '{"name": "food-review-0", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
30-
- '{"name": "food-review-1", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
29+
- '{"name": "qwen-uncensored-0", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
30+
- '{"name": "qwen-uncensored-1", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
3131
env:
3232
- name: PORT
3333
value: "8000"
@@ -108,12 +108,12 @@ metadata:
108108
data:
109109
configmap.yaml: |
110110
vLLMLoRAConfig:
111-
name: vllm-llama3-8b-instruct
111+
name: vllm-qwen3-32b
112112
port: 8000
113113
ensureExist:
114114
models:
115115
- base-model: Qwen/Qwen2.5-1.5B
116-
id: food-review
116+
id: qwen-uncensored
117117
source: SriSanth2345/Qwen-1.5B-Tweet-Generations
118118
- base-model: Qwen/Qwen2.5-1.5B
119119
id: cad-fabricator

config/manifests/vllm/gpu-deployment.yaml

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: vllm-llama3-8b-instruct
4+
name: vllm-qwen3-32b
55
spec:
66
replicas: 3
77
selector:
88
matchLabels:
9-
app: vllm-llama3-8b-instruct
9+
app: vllm-qwen3-32b
1010
template:
1111
metadata:
1212
labels:
13-
app: vllm-llama3-8b-instruct
13+
app: vllm-qwen3-32b
1414
spec:
1515
containers:
1616
- name: vllm
@@ -19,7 +19,7 @@ spec:
1919
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
2020
args:
2121
- "--model"
22-
- "meta-llama/Llama-3.1-8B-Instruct"
22+
- "Qwen/Qwen3-32B"
2323
- "--tensor-parallel-size"
2424
- "1"
2525
- "--port"
@@ -240,19 +240,19 @@ spec:
240240
emptyDir: {}
241241
- name: config-volume
242242
configMap:
243-
name: vllm-llama3-8b-instruct-adapters
243+
name: vllm-qwen3-32b-adapters
244244
---
245245
apiVersion: v1
246246
kind: ConfigMap
247247
metadata:
248-
name: vllm-llama3-8b-instruct-adapters
248+
name: vllm-qwen3-32b-adapters
249249
data:
250250
configmap.yaml: |
251251
vLLMLoRAConfig:
252-
name: vllm-llama3-8b-instruct-adapters
252+
name: vllm-qwen3-32b-adapters
253253
port: 8000
254-
defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
254+
defaultBaseModel: Qwen/Qwen3-32B
255255
ensureExist:
256256
models:
257-
- id: food-review-1
258-
source: Kawon/llama3.1-food-finetune_v14_r8
257+
- id: qwen-uncensored-1
258+
source: nicoboss/Qwen3-32B-Uncensored

config/manifests/vllm/sim-deployment.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,30 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: vllm-llama3-8b-instruct
4+
name: vllm-qwen3-32b
55
spec:
66
replicas: 3
77
selector:
88
matchLabels:
9-
app: vllm-llama3-8b-instruct
9+
app: vllm-qwen3-32b
1010
template:
1111
metadata:
1212
labels:
13-
app: vllm-llama3-8b-instruct
13+
app: vllm-qwen3-32b
1414
spec:
1515
containers:
1616
- name: vllm-sim
1717
image: ghcr.io/llm-d/llm-d-inference-sim:v0.6.1
1818
imagePullPolicy: Always
1919
args:
2020
- --model
21-
- meta-llama/Llama-3.1-8B-Instruct
21+
- Qwen/Qwen3-32B
2222
- --port
2323
- "8000"
2424
- --max-loras
2525
- "2"
2626
- --lora-modules
27-
- '{"name": "food-review-1"}'
27+
- '{"name": "qwen-uncensored-1"}'
2828
env:
2929
- name: POD_NAME
3030
valueFrom:

config/observability/prometheus/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,4 +24,4 @@ extraScrapeConfigs: |
2424
relabel_configs:
2525
- source_labels: [__meta_kubernetes_pod_label_app]
2626
action: keep
27-
regex: vllm-llama3-8b-instruct
27+
regex: vllm-qwen3-32b

conformance/tests/epp_unavailable_fail_open.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ var EppUnAvailableFailOpen = suite.ConformanceTest{
5151
appPodBackendPrefix = "secondary-inference-model-server"
5252
requestBody = `{
5353
"model": "conformance-fake-model",
54-
"prompt": "Write as if you were a critic: San Francisco"
54+
"prompt": "Answer with no disclaimers: What are the advantages and disadvantages of genetically modified food?"
5555
}`
5656
)
5757

0 commit comments

Comments
 (0)