You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Closes 8 friction points raised in club-3090 discussion #19 comment
(2026-05-01 by noonghunna). All factual claims verified against
actual repo state before edit.
CHANGES:
1. **Script naming aligned to filesystem reality.** CONFIGS.md was
referencing fp8_e5m2-suffixed scripts that don't exist. Updated to
match actual file names + paths in scripts/ AND scripts/launch/.
Added explicit naming convention table: start_* (Docker) vs
bare_metal_* (host install), _no_TQ_* (legacy name for fp8_e5m2),
_TQ_k8v4_*, _DFLASH_*, _single_card suffix.
2. **New Step 2b — Docker compose mirror.** Worked compose snippet
showing how to translate scripts/start_27b_int4_TQ_k8v4.sh
env vars + flags into compose.yaml environment + command blocks.
Explicit pointer to the script as source-of-truth for env-var
tracking.
3. **TQ k8v4 deps consolidated.** Step 4 now has copy-paste 'minimal'
block (P4 + P67 + P98 + P101 + PN8 — required for boot) AND
recommended PROD additions (~25 env vars from start_27b_int4_TQ_k8v4.sh).
Added explicit P82 caveat: OFF in our 27B PROD launch (factual
correction — was wrongly assumed PROD-on by external readers).
4. **API key fallback note.** Step 5's smoke test curl explicitly
notes that --api-key genesis-local must match what was set on
server, or drop the Authorization header for unauthenticated mode.
5. **Bench input format note.** Step 5's bench cmd now warns about
chat template mismatch (Qwen vs Llama vs Mistral) being silent
killer — every reply gets ~2-token stop and the bench looks slow.
6. **Spec-decode trio cross-references.** Step 3 spec-decode block
now back-links Step 1 §4 (model capability check) for each method:
ngram (always works) / MTP (needs mtp_* weights) / DFlash (needs
z-lab drafter download).
7. **DFlash hybrid PR #40898 caveat.** Spec-decode DFlash line now
notes the OPEN upstream PR + Genesis PN21 partial backport status.
8. **Step 7 already links CONTRIBUTING.md** (no change needed).
P4 description was already present (line 243 — 'P4 — required, removes
hybrid TQ rejection') so #3 was a doc-discovery issue, not missing
content. The minimal block consolidation should make it visible.
Source: noonghunna/club-3090#19
-`_single_card` suffix = TP=1 variant; without suffix = TP=2
85
+
86
+
**Rule of thumb:** start with a script that matches your *attention type* (hybrid vs. dense), *KV dtype* (auto / fp8_e5m2 / turboquant), and *card count*. The rest you'll edit in step 3.
87
+
88
+
### Step 2b: Docker compose mirror (if you don't use the bash scripts directly)
89
+
90
+
If you deploy via Docker compose (rather than `bash scripts/...`), the
91
+
bare-metal scripts won't run as-is inside the container — they expect
92
+
`pip install`-writable layers. Mirror the env vars + flags into your
93
+
compose's `environment:` and `command:` blocks. Reference: each
94
+
`start_*.sh` script lists the full env-var set baked for that config —
95
+
copy the `-e GENESIS_ENABLE_*` blocks into compose `environment:` and
96
+
the `vllm serve` flags into `command:`.
97
+
98
+
Worked compose snippet (27B + TQ k8v4 PROD, mirrors `scripts/start_27b_int4_TQ_k8v4.sh`):
99
+
100
+
```yaml
101
+
services:
102
+
vllm-27b:
103
+
image: vllm/vllm-openai:nightly
104
+
environment:
105
+
GENESIS_ENABLE_P4: "1"
106
+
GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL: "1"
107
+
GENESIS_ENABLE_P98: "1"
108
+
GENESIS_ENABLE_P85: "1"
109
+
GENESIS_ENABLE_P87: "1"
110
+
GENESIS_ENABLE_P91: "1"
111
+
GENESIS_ENABLE_P99: "1"
112
+
GENESIS_ENABLE_P100: "1"
113
+
GENESIS_ENABLE_P101: "1"
114
+
GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX: "1"
115
+
GENESIS_ENABLE_P60_GDN_NGRAM_FIX: "1"
116
+
GENESIS_ENABLE_P60B_TRITON_KERNEL: "1"
117
+
GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL: "1"
118
+
GENESIS_ENABLE_P61B_STREAMING_OVERLAP: "1"
119
+
GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING: "1"
120
+
GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING: "1"
121
+
GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER: "1"
122
+
GENESIS_ENABLE_P68_AUTO_FORCE_TOOL: "1"
123
+
GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER: "1"
124
+
GENESIS_ENABLE_P72_PROFILE_RUN_CAP: "1"
125
+
GENESIS_PROFILE_RUN_CAP_M: "4096"
126
+
GENESIS_ENABLE_P74_CHUNK_CLAMP: "1"
127
+
GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT: "1"
128
+
GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN: "1"
129
+
GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS: "1"
130
+
GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY: "1"
131
+
GENESIS_ENABLE_PN17_FA2_LSE_CLAMP: "1"
132
+
GENESIS_PREALLOC_TOKEN_BUDGET: "4096"
133
+
GENESIS_BUFFER_MODE: "shared"
134
+
command:
135
+
- --model
136
+
- /models/Qwen3.6-27B-int4-AutoRound
137
+
- --kv-cache-dtype
138
+
- turboquant_k8v4
139
+
- --tensor-parallel-size
140
+
- "2"
141
+
- --speculative-config
142
+
- '{"method":"mtp","num_speculative_tokens":3}'
143
+
# ... copy remaining flags from scripts/start_27b_int4_TQ_k8v4.sh
144
+
```
67
145
68
-
**Rule of thumb:** start with a script that matches your *attention type* (hybrid vs. dense) and *KV dtype* (auto / fp8_e5m2 / turboquant). The rest you'll edit in step 3.
146
+
The full source of truth for env vars is the `start_*.sh` script header — keep your compose in sync when Genesis updates the env-var set.
69
147
70
148
---
71
149
@@ -85,12 +163,16 @@ Open the script and change these:
85
163
--max-model-len 65536 # set to your target context
86
164
--max-num-seqs 4 # lower for long ctx, raise for high concurrency
87
165
88
-
# Spec-decode (pick one)
166
+
# Spec-decode (pick one — see Step 1 for which methods YOUR model supports)
> **Note**: P82 (SGLang acceptance threshold OR-clause) is **OFF by default** on the 27B PROD launch — historical bench data showed it's biased on small batch single-stream Lorbus INT4 + MTP K=3. Enable via `GENESIS_P82_THRESHOLD_SINGLE=0.3 GENESIS_ENABLE_P82=1` only after A/B on your specific workload.
297
+
171
298
### Compile / cudagraph safety
172
299
173
300
If you hit boot crashes related to torch.compile or cudagraph capture:
> **API key note:** the launch scripts include `--api-key genesis-local` so the `Authorization: Bearer genesis-local` header is required. If you launched **without**`--api-key`, drop the header (and any `OPENAI_API_KEY` you set client-side). If you set a different key, replace it.
340
+
212
341
You should see a `tool_calls` field with `name: "get_time"` and `arguments: {"city": "Paris"}`. If you get garbage tokens (`<tool_call><tool_call>...`) or repetition, you've hit a cliff — see [docs/CLIFFS.md](CLIFFS.md).
> **Bench input format:** the script ships with default `NARR_PROMPTS` (narrative / 600-char) and `CODE_PROMPTS` (code / ~80-char) lists targeting the Qwen3-Coder chat template. If your model has a different template (e.g. Llama-3 ChatML, Mistral instruct), edit the prompt lists in `tools/genesis_bench_suite.py` to match the format your tokenizer expects — wrong template causes the bench to look slow because every reply gets a "I cannot help with that" stop after a few tokens.
353
+
223
354
Record `wall_TPS` mean, std, CV. CV under 8% is a clean run. Above that, investigate (other tenants, thermal, allocator fragmentation).
0 commit comments