Skip to content

Commit 6f4b937

Browse files
author
Sandermage
committed
docs(CONFIGS): fix script naming + add 8 friction-point fixes from noonghunna feedback
Closes 8 friction points raised in club-3090 discussion #19 comment (2026-05-01 by noonghunna). All factual claims verified against actual repo state before edit. CHANGES: 1. **Script naming aligned to filesystem reality.** CONFIGS.md was referencing fp8_e5m2-suffixed scripts that don't exist. Updated to match actual file names + paths in scripts/ AND scripts/launch/. Added explicit naming convention table: start_* (Docker) vs bare_metal_* (host install), _no_TQ_* (legacy name for fp8_e5m2), _TQ_k8v4_*, _DFLASH_*, _single_card suffix. 2. **New Step 2b — Docker compose mirror.** Worked compose snippet showing how to translate scripts/start_27b_int4_TQ_k8v4.sh env vars + flags into compose.yaml environment + command blocks. Explicit pointer to the script as source-of-truth for env-var tracking. 3. **TQ k8v4 deps consolidated.** Step 4 now has copy-paste 'minimal' block (P4 + P67 + P98 + P101 + PN8 — required for boot) AND recommended PROD additions (~25 env vars from start_27b_int4_TQ_k8v4.sh). Added explicit P82 caveat: OFF in our 27B PROD launch (factual correction — was wrongly assumed PROD-on by external readers). 4. **API key fallback note.** Step 5's smoke test curl explicitly notes that --api-key genesis-local must match what was set on server, or drop the Authorization header for unauthenticated mode. 5. **Bench input format note.** Step 5's bench cmd now warns about chat template mismatch (Qwen vs Llama vs Mistral) being silent killer — every reply gets ~2-token stop and the bench looks slow. 6. **Spec-decode trio cross-references.** Step 3 spec-decode block now back-links Step 1 §4 (model capability check) for each method: ngram (always works) / MTP (needs mtp_* weights) / DFlash (needs z-lab drafter download). 7. **DFlash hybrid PR #40898 caveat.** Spec-decode DFlash line now notes the OPEN upstream PR + Genesis PN21 partial backport status. 8. **Step 7 already links CONTRIBUTING.md** (no change needed). P4 description was already present (line 243 — 'P4 — required, removes hybrid TQ rejection') so #3 was a doc-discovery issue, not missing content. The minimal block consolidation should make it visible. Source: noonghunna/club-3090#19
1 parent f289e07 commit 6f4b937

1 file changed

Lines changed: 143 additions & 12 deletions

File tree

docs/CONFIGS.md

Lines changed: 143 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -53,19 +53,97 @@ Write these five things down. The rest of this guide refers back to them.
5353

5454
## Step 2: Pick a base launch script
5555

56-
Genesis ships launch scripts at `scripts/`. Pick the closest match to your config:
56+
Genesis ships launch scripts in **two locations** for two deployment modes:
57+
58+
- `scripts/*.sh` — Docker container launch (recommended for most users; uses `docker run`).
59+
- `scripts/launch/*.sh` — both Docker (`start_*`) and bare-metal (`bare_metal_*`, host shell with vLLM installed via pip) variants.
60+
61+
Pick the closest match to your config:
5762

5863
| Your config | Start with |
5964
|---|---|
60-
| Qwen3-family + INT4 + hybrid + 1× 24GB | `start_27b_int4_fp8_e5m2_short_single_card.sh` |
61-
| Qwen3-family + INT4 + hybrid + 2× 24GB long ctx | `start_27b_int4_fp8_e5m2_long_256K.sh` |
62-
| Qwen3-family + INT4 + hybrid + TurboQuant k8v4 (5× KV pool) | `start_27b_int4_TQ_k8v4.sh` |
63-
| Qwen3-family + MoE + FP8 | `start_35b_fp8_PROD.sh` |
64-
| Qwen3-family + MoE + FP8 + 1× 48GB | `start_35b_fp8_PROD_single_card.sh` |
65-
| Coding-agent workload (DFlash drafter) | `start_27b_int4_DFLASH.sh` |
66-
| Llama / Mistral / Gemma dense | `start_35b_fp8_PROD.sh` (drop MoE-specific flags) |
65+
| Qwen3-family + INT4 + hybrid + 1× 24GB short ctx | `scripts/launch/start_27b_int4_no_TQ_short_single_card.sh` |
66+
| Qwen3-family + INT4 + hybrid + 1× 24GB long ctx (256K) | `scripts/launch/start_27b_int4_no_TQ_long_256K_single_card.sh` |
67+
| Qwen3-family + INT4 + hybrid + 2× 24GB short ctx | `scripts/start_27b_int4_fp8_e5m2_short.sh` |
68+
| Qwen3-family + INT4 + hybrid + 2× 24GB long ctx (256K) | `scripts/start_27b_int4_fp8_e5m2_long_256K.sh` |
69+
| Qwen3-family + INT4 + hybrid + TurboQuant k8v4 (5× KV pool) | `scripts/start_27b_int4_TQ_k8v4.sh` |
70+
| Qwen3-family + MoE + FP8 (2× 24GB) | `scripts/start_35b_fp8_PROD.sh` |
71+
| Qwen3-family + MoE + FP8 + 1× 48GB | `scripts/launch/start_35b_fp8_PROD_single_card.sh` |
72+
| Coding-agent workload (DFlash drafter) | `scripts/start_27b_int4_DFLASH.sh` |
73+
| Bare-metal (no Docker), 27B INT4 + TQ k8v4 | `scripts/launch/bare_metal_27b_int4_TQ_k8v4.sh` |
74+
| Bare-metal (no Docker), 35B FP8 PROD | `scripts/launch/bare_metal_35b_fp8_PROD.sh` |
75+
| Llama / Mistral / Gemma dense | `scripts/start_35b_fp8_PROD.sh` (drop MoE-specific flags) |
76+
77+
**Naming convention:**
78+
79+
- `start_*` = Docker container launch (uses `docker run`, mounts /models, applies patches inside container)
80+
- `bare_metal_*` = host shell launch (assumes vLLM already installed via `pip install`)
81+
- `_no_TQ_*` historical name for fp8_e5m2 KV cache (without TurboQuant) — file IS for fp8 KV
82+
- `_TQ_k8v4_*` = TurboQuant 8-bit-key 4-bit-value KV cache (5× pool, requires P4 + P67 + P98)
83+
- `_DFLASH_*` = DFlash speculator (coding-agent workload)
84+
- `_single_card` suffix = TP=1 variant; without suffix = TP=2
85+
86+
**Rule of thumb:** start with a script that matches your *attention type* (hybrid vs. dense), *KV dtype* (auto / fp8_e5m2 / turboquant), and *card count*. The rest you'll edit in step 3.
87+
88+
### Step 2b: Docker compose mirror (if you don't use the bash scripts directly)
89+
90+
If you deploy via Docker compose (rather than `bash scripts/...`), the
91+
bare-metal scripts won't run as-is inside the container — they expect
92+
`pip install`-writable layers. Mirror the env vars + flags into your
93+
compose's `environment:` and `command:` blocks. Reference: each
94+
`start_*.sh` script lists the full env-var set baked for that config —
95+
copy the `-e GENESIS_ENABLE_*` blocks into compose `environment:` and
96+
the `vllm serve` flags into `command:`.
97+
98+
Worked compose snippet (27B + TQ k8v4 PROD, mirrors `scripts/start_27b_int4_TQ_k8v4.sh`):
99+
100+
```yaml
101+
services:
102+
vllm-27b:
103+
image: vllm/vllm-openai:nightly
104+
environment:
105+
GENESIS_ENABLE_P4: "1"
106+
GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL: "1"
107+
GENESIS_ENABLE_P98: "1"
108+
GENESIS_ENABLE_P85: "1"
109+
GENESIS_ENABLE_P87: "1"
110+
GENESIS_ENABLE_P91: "1"
111+
GENESIS_ENABLE_P99: "1"
112+
GENESIS_ENABLE_P100: "1"
113+
GENESIS_ENABLE_P101: "1"
114+
GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX: "1"
115+
GENESIS_ENABLE_P60_GDN_NGRAM_FIX: "1"
116+
GENESIS_ENABLE_P60B_TRITON_KERNEL: "1"
117+
GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL: "1"
118+
GENESIS_ENABLE_P61B_STREAMING_OVERLAP: "1"
119+
GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING: "1"
120+
GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING: "1"
121+
GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER: "1"
122+
GENESIS_ENABLE_P68_AUTO_FORCE_TOOL: "1"
123+
GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER: "1"
124+
GENESIS_ENABLE_P72_PROFILE_RUN_CAP: "1"
125+
GENESIS_PROFILE_RUN_CAP_M: "4096"
126+
GENESIS_ENABLE_P74_CHUNK_CLAMP: "1"
127+
GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT: "1"
128+
GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN: "1"
129+
GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS: "1"
130+
GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY: "1"
131+
GENESIS_ENABLE_PN17_FA2_LSE_CLAMP: "1"
132+
GENESIS_PREALLOC_TOKEN_BUDGET: "4096"
133+
GENESIS_BUFFER_MODE: "shared"
134+
command:
135+
- --model
136+
- /models/Qwen3.6-27B-int4-AutoRound
137+
- --kv-cache-dtype
138+
- turboquant_k8v4
139+
- --tensor-parallel-size
140+
- "2"
141+
- --speculative-config
142+
- '{"method":"mtp","num_speculative_tokens":3}'
143+
# ... copy remaining flags from scripts/start_27b_int4_TQ_k8v4.sh
144+
```
67145

68-
**Rule of thumb:** start with a script that matches your *attention type* (hybrid vs. dense) and *KV dtype* (auto / fp8_e5m2 / turboquant). The rest you'll edit in step 3.
146+
The full source of truth for env vars is the `start_*.sh` script header — keep your compose in sync when Genesis updates the env-var set.
69147

70148
---
71149

@@ -85,12 +163,16 @@ Open the script and change these:
85163
--max-model-len 65536 # set to your target context
86164
--max-num-seqs 4 # lower for long ctx, raise for high concurrency
87165

88-
# Spec-decode (pick one)
166+
# Spec-decode (pick one — see Step 1 for which methods YOUR model supports)
89167
--speculative-config '{"method": "ngram", "num_speculative_tokens": 5, "prompt_lookup_min": 2, "prompt_lookup_max": 5}'
90-
# or for MTP:
168+
# or for MTP (REQUIRES the HF repo to carry mtp_* weight prefix — see Step 1 §4):
91169
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'
92-
# or for DFlash (gated download required):
170+
# or for DFlash (REQUIRES separate z-lab drafter checkpoint download — see Step 1 §4):
93171
--speculative-config '{"method": "dflash", "model": "/path/to/dflash-draft", "num_speculative_tokens": 4}'
172+
# Note for DFlash on hybrid GDN models (Qwen3.6 family): vllm PR #40898 (DFlash SWA support)
173+
# is OPEN as of 2026-05-01. Genesis ships PN21 partial backport — full SWA enabler awaits
174+
# upstream merge. Without it, ~25% acceptance-length gap on long context with sliding-window
175+
# attention layers. Track at https://github.com/vllm-project/vllm/pull/40898.
94176

95177
# KV cache dtype (pick one)
96178
--kv-cache-dtype auto # default
@@ -168,6 +250,51 @@ If `--kv-cache-dtype turboquant_k8v4`:
168250
- P101 — TQ packed-slot layout opt.
169251
- PN8 — VRAM savings.
170252

253+
**Copy-paste TQ k8v4 minimal env block:**
254+
255+
```bash
256+
# Required for TQ k8v4 — engine won't boot without these
257+
export GENESIS_ENABLE_P4=1 # removes hybrid TQ rejection
258+
export GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1 # Genesis multi-query kernel
259+
export GENESIS_ENABLE_P98=1 # WorkspaceManager revert (vllm#40941 fix)
260+
export GENESIS_ENABLE_P101=1 # TQ packed-slot layout
261+
export GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 # ~600 MiB VRAM savings on draft
262+
```
263+
264+
**Recommended additional patches for TQ k8v4 PROD (per `start_27b_int4_TQ_k8v4.sh`):**
265+
266+
```bash
267+
# Performance (composes with TQ k8v4)
268+
export GENESIS_ENABLE_P85=1 # hybrid fine-shadow prefix cache
269+
export GENESIS_ENABLE_P87=1 # ~+24% on AutoRound INT4 path (Marlin)
270+
export GENESIS_ENABLE_P91=1 # AutoRound row-parallel scales fix
271+
export GENESIS_ENABLE_P99=1 # WorkspaceManager memoize
272+
export GENESIS_ENABLE_P100=1 # FlashInfer FULL cudagraph spec-decode
273+
274+
# Recommended quality patches
275+
export GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1
276+
export GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1
277+
export GENESIS_ENABLE_P60B_TRITON_KERNEL=1
278+
export GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1
279+
export GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1
280+
export GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1
281+
export GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING=1
282+
export GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER=1
283+
export GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1
284+
export GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1
285+
export GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1
286+
export GENESIS_PROFILE_RUN_CAP_M=4096
287+
export GENESIS_ENABLE_P74_CHUNK_CLAMP=1
288+
export GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1
289+
export GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1
290+
export GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1
291+
export GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1
292+
export GENESIS_PREALLOC_TOKEN_BUDGET=4096
293+
export GENESIS_BUFFER_MODE=shared
294+
```
295+
296+
> **Note**: P82 (SGLang acceptance threshold OR-clause) is **OFF by default** on the 27B PROD launch — historical bench data showed it's biased on small batch single-stream Lorbus INT4 + MTP K=3. Enable via `GENESIS_P82_THRESHOLD_SINGLE=0.3 GENESIS_ENABLE_P82=1` only after A/B on your specific workload.
297+
171298
### Compile / cudagraph safety
172299

173300
If you hit boot crashes related to torch.compile or cudagraph capture:
@@ -209,6 +336,8 @@ curl -s http://localhost:8000/v1/chat/completions \
209336
}' | jq '.choices[0].message'
210337
```
211338

339+
> **API key note:** the launch scripts include `--api-key genesis-local` so the `Authorization: Bearer genesis-local` header is required. If you launched **without** `--api-key`, drop the header (and any `OPENAI_API_KEY` you set client-side). If you set a different key, replace it.
340+
212341
You should see a `tool_calls` field with `name: "get_time"` and `arguments: {"city": "Paris"}`. If you get garbage tokens (`<tool_call><tool_call>...`) or repetition, you've hit a cliff — see [docs/CLIFFS.md](CLIFFS.md).
213342

214343
5. **Bench:**
@@ -220,6 +349,8 @@ python tools/genesis_bench_suite.py \
220349
--runs 5
221350
```
222351

352+
> **Bench input format:** the script ships with default `NARR_PROMPTS` (narrative / 600-char) and `CODE_PROMPTS` (code / ~80-char) lists targeting the Qwen3-Coder chat template. If your model has a different template (e.g. Llama-3 ChatML, Mistral instruct), edit the prompt lists in `tools/genesis_bench_suite.py` to match the format your tokenizer expects — wrong template causes the bench to look slow because every reply gets a "I cannot help with that" stop after a few tokens.
353+
223354
Record `wall_TPS` mean, std, CV. CV under 8% is a clean run. Above that, investigate (other tenants, thermal, allocator fragmentation).
224355

225356
---

0 commit comments

Comments
 (0)