Skip to content

Commit 83b5dac

Browse files
sats-23ahg-g
authored andcommitted
Model replacement to Qwen3-32B (kubernetes-sigs#2189)
* Model replacement to Qwen3-32B Signed-off-by: Sathvik <Sathvik.S@ibm.com> * Update config/manifests/sglang/gpu-deployment.yaml add back sglang engine type label * Update config/manifests/vllm/cpu-deployment.yaml add back vllm engine type label * Update config/manifests/vllm/gpu-deployment.yaml add back vllm engine type label --------- Signed-off-by: Sathvik <Sathvik.S@ibm.com> Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
1 parent 455b9fd commit 83b5dac

31 files changed

+216
-216
lines changed

config/charts/inferencepool/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,18 @@ A chart to deploy an InferencePool and a corresponding EndpointPicker (epp) depl
44

55
## Install
66

7-
To install an InferencePool named `vllm-llama3-8b-instruct` that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port `8000`, you can run the following command:
7+
To install an InferencePool named `vllm-qwen3-32b` that selects from endpoints with label `app: vllm-qwen3-32b` and listening on port `8000`, you can run the following command:
88

99
```txt
10-
$ helm install vllm-llama3-8b-instruct ./config/charts/inferencepool \
11-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
10+
$ helm install vllm-qwen3-32b ./config/charts/inferencepool \
11+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
1212
```
1313

1414
To install via the latest published chart in staging (--version v0 indicates latest dev version), you can run the following command:
1515

1616
```txt
17-
$ helm install vllm-llama3-8b-instruct \
18-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
17+
$ helm install vllm-qwen3-32b \
18+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
1919
--set provider.name=[none|gke|istio] \
2020
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
2121
```
@@ -27,8 +27,8 @@ Note that the provider name is needed to deploy provider-specific resources. If
2727
To set cmd-line flags, you can use the `--set` option to set each flag, e.g.,:
2828

2929
```txt
30-
$ helm install vllm-llama3-8b-instruct \
31-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
30+
$ helm install vllm-qwen3-32b \
31+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
3232
--set inferenceExtension.flags.<FLAG_NAME>=<FLAG_VALUE>
3333
--set provider.name=[none|gke|istio] \
3434
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
@@ -64,7 +64,7 @@ inferenceExtension:
6464
Then apply it with:
6565
6666
```txt
67-
$ helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
67+
$ helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
6868
```
6969

7070
### Install with Custom EPP Plugins Configuration
@@ -106,7 +106,7 @@ inferenceExtension:
106106
Then apply it with:
107107

108108
```txt
109-
$ helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
109+
$ helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
110110
```
111111

112112
### Install for Triton TensorRT-LLM
@@ -159,8 +159,8 @@ To enable HA, set `inferenceExtension.replicas` to a number greater than 1.
159159
* Via `--set` flag:
160160

161161
```txt
162-
helm install vllm-llama3-8b-instruct \
163-
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
162+
helm install vllm-qwen3-32b \
163+
--set inferencePool.modelServers.matchLabels.app=vllm-qwen3-32b \
164164
--set inferenceExtension.replicas=3 \
165165
--set provider=[none|gke] \
166166
oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0
@@ -176,7 +176,7 @@ To enable HA, set `inferenceExtension.replicas` to a number greater than 1.
176176
Then apply it with:
177177

178178
```txt
179-
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
179+
helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
180180
```
181181

182182
### Install with Monitoring
@@ -204,7 +204,7 @@ If you are using a GKE Autopilot cluster, you also need to set `provider.gke.aut
204204
Then apply it with:
205205

206206
```txt
207-
helm install vllm-llama3-8b-instruct ./config/charts/inferencepool -f values.yaml
207+
helm install vllm-qwen3-32b ./config/charts/inferencepool -f values.yaml
208208
```
209209

210210
## Uninstall

config/charts/inferencepool/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ inferencePool:
8787
apiVersion: inference.networking.k8s.io/v1
8888
# modelServers: # REQUIRED
8989
# matchLabels:
90-
# app: vllm-llama3-8b-instruct
90+
# app: vllm-qwen3-32b
9191

9292
# Should only used if apiVersion is inference.networking.x-k8s.io/v1alpha2,
9393
# This will soon be deprecated when upstream GW providers support v1, just doing something simple for now.

config/charts/standalone/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ inferenceExtension:
1818
# set it to false when you want to deploy EPP with inferencepool
1919
createInferencePool: true
2020
# Required when createInferencePool is false
21-
# endpointSelector: app=vllm-llama3-8b-instruct
21+
# endpointSelector: app=vllm-qwen3-32b
2222
# unused when createInferencePool is true
2323
targetPorts: 8000
2424
# unused when createInferencePool is true
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
apiVersion: inference.networking.x-k8s.io/v1alpha2
22
kind: InferenceObjective
33
metadata:
4-
name: food-review
4+
name: small-segment-lora
55
spec:
66
priority: 1
77
poolRef:
88
group: inference.networking.k8s.io
9-
name: vllm-llama3-8b-instruct
9+
name: vllm-qwen3-32b
1010
---
1111
apiVersion: inference.networking.x-k8s.io/v1alpha2
1212
kind: InferenceObjective
@@ -16,7 +16,7 @@ spec:
1616
priority: 2
1717
poolRef:
1818
group: inference.networking.k8s.io
19-
name: vllm-llama3-8b-instruct
19+
name: vllm-qwen3-32b
2020
---
2121
apiVersion: inference.networking.x-k8s.io/v1alpha2
2222
kind: InferenceObjective
@@ -26,4 +26,4 @@ spec:
2626
priority: 2
2727
poolRef:
2828
group: inference.networking.k8s.io
29-
name: vllm-llama3-8b-instruct
29+
name: vllm-qwen3-32b

config/manifests/sglang/gpu-deployment.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,26 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: sgl-llama3-8b-instruct
4+
name: sgl-qwen3-32b-instruct
55
labels:
6-
app: sgl-llama3-8b-instruct
6+
app: sgl-qwen3-32b-instruct
77
spec:
88
replicas: 3
99
selector:
1010
matchLabels:
11-
app: sgl-llama3-8b-instruct
11+
app: sgl-qwen3-32b-instruct
1212
template:
1313
metadata:
1414
labels:
15-
app: sgl-llama3-8b-instruct
15+
app: sgl-qwen3-32b-instruct
1616
inference.networking.k8s.io/engine-type: sglang
1717
spec:
1818
containers:
1919
- name: sglang
2020
image: lmsysorg/sglang:latest
2121
command: ["python3", "-m", "sglang.launch_server"]
2222
args:
23-
- "--model-path=meta-llama/Llama-3.1-8B-Instruct"
23+
- "--model-path=Qwen/Qwen3-32B"
2424
- "--host=0.0.0.0"
2525
- "--port=8000"
2626
- "--dtype=bfloat16"

config/manifests/vllm/cpu-deployment.yaml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: vllm-llama3-8b-instruct
4+
name: vllm-qwen3-32b
55
spec:
66
replicas: 3
77
selector:
88
matchLabels:
9-
app: vllm-llama3-8b-instruct
9+
app: vllm-qwen3-32b
1010
template:
1111
metadata:
1212
labels:
13-
app: vllm-llama3-8b-instruct
13+
app: vllm-qwen3-32b
1414
inference.networking.k8s.io/engine-type: vllm
1515
spec:
1616
containers:
@@ -20,15 +20,15 @@ spec:
2020
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
2121
args:
2222
- "--model"
23-
- "Qwen/Qwen2.5-1.5B-Instruct"
23+
- "Qwen/Qwen3-32B"
2424
- "--port"
2525
- "8000"
2626
- "--enable-lora"
2727
- "--max-loras"
2828
- "4"
2929
- "--lora-modules"
30-
- '{"name": "food-review-0", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
31-
- '{"name": "food-review-1", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
30+
- '{"name": "small-segment-lora-0", "path": "ttt421/nec119-small-segment-lora", "base_model_name": "Qwen/Qwen3-32B"}'
31+
- '{"name": "small-segment-lora-1", "path": "ttt421/nec119-small-segment-lora", "base_model_name": "Qwen/Qwen3-32B"}'
3232
env:
3333
- name: PORT
3434
value: "8000"
@@ -109,13 +109,13 @@ metadata:
109109
data:
110110
configmap.yaml: |
111111
vLLMLoRAConfig:
112-
name: vllm-llama3-8b-instruct
112+
name: vllm-qwen3-32b
113113
port: 8000
114114
ensureExist:
115115
models:
116-
- base-model: Qwen/Qwen2.5-1.5B
117-
id: food-review
118-
source: SriSanth2345/Qwen-1.5B-Tweet-Generations
116+
- base-model: Qwen/Qwen3-32B
117+
id: small-segment-lora
118+
source: ttt421/nec119-small-segment-lora
119119
- base-model: Qwen/Qwen2.5-1.5B
120120
id: cad-fabricator
121-
source: SriSanth2345/Qwen-1.5B-Tweet-Generations
121+
source: SriSanth2345/Qwen-1.5B-Tweet-Generations

config/manifests/vllm/gpu-deployment.yaml

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
apiVersion: apps/v1
22
kind: Deployment
33
metadata:
4-
name: vllm-llama3-8b-instruct
4+
name: vllm-qwen3-32b
55
spec:
66
replicas: 3
77
selector:
88
matchLabels:
9-
app: vllm-llama3-8b-instruct
9+
app: vllm-qwen3-32b
1010
template:
1111
metadata:
1212
labels:
13-
app: vllm-llama3-8b-instruct
13+
app: vllm-qwen3-32b
1414
inference.networking.k8s.io/engine-type: vllm
1515
spec:
1616
containers:
@@ -20,7 +20,7 @@ spec:
2020
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
2121
args:
2222
- "--model"
23-
- "meta-llama/Llama-3.1-8B-Instruct"
23+
- "Qwen/Qwen3-32B"
2424
- "--tensor-parallel-size"
2525
- "1"
2626
- "--port"
@@ -239,19 +239,19 @@ spec:
239239
emptyDir: {}
240240
- name: config-volume
241241
configMap:
242-
name: vllm-llama3-8b-instruct-adapters
242+
name: vllm-qwen3-32b-adapters
243243
---
244244
apiVersion: v1
245245
kind: ConfigMap
246246
metadata:
247-
name: vllm-llama3-8b-instruct-adapters
247+
name: vllm-qwen3-32b-adapters
248248
data:
249249
configmap.yaml: |
250250
vLLMLoRAConfig:
251-
name: vllm-llama3-8b-instruct-adapters
251+
name: vllm-qwen3-32b-adapters
252252
port: 8000
253-
defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
253+
defaultBaseModel: Qwen/Qwen3-32B
254254
ensureExist:
255255
models:
256-
- id: food-review-1
257-
source: Kawon/llama3.1-food-finetune_v14_r8
256+
- id: small-segment-lora-1
257+
source: ttt421/nec119-small-segment-lora

config/observability/prometheus/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,4 +24,4 @@ extraScrapeConfigs: |
2424
relabel_configs:
2525
- source_labels: [__meta_kubernetes_pod_label_app]
2626
action: keep
27-
regex: vllm-llama3-8b-instruct
27+
regex: vllm-qwen3-32b

docs/proposals/1816-inferenceomodelrewrite/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -212,10 +212,10 @@ const (
212212
apiVersion: inference.networking.x-k8s.io/v1alpha1
213213
kind: InferenceModelRewrite
214214
metadata:
215-
name: food-review-canary-rollout
215+
name: small-segment-lora-canary-rollout
216216
spec:
217217
poolRef:
218-
name: main-food-review-pool
218+
name: main-small-segment-lora-pool
219219
rules:
220220
- matches:
221221
- model:

pkg/epp/handlers/response.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,7 +218,7 @@ func (s *StreamingServer) generateResponseHeaders(reqCtx *RequestContext) []*con
218218
}
219219

220220
// Example message if "stream_options": {"include_usage": "true"} is included in the request:
221-
// data: {"id":"...","object":"text_completion","created":1739400043,"model":"food-review-0","choices":[],
221+
// data: {"id":"...","object":"text_completion","created":1739400043,"model":"small-segment-lora-0","choices":[],
222222
// "usage":{"prompt_tokens":7,"total_tokens":17,"completion_tokens":10}}
223223
//
224224
// data: [DONE]

0 commit comments

Comments
 (0)