Skip to content

Commit f387622

Browse files
committed
power-cap-sweep: load-mode flag + concurrent/prefill modes (Codex iteration)
Implements the three load modes drafted in /tmp/codex-prompt-power- cap-loadmodes.md after my partial implementation hit a hang during decode-concurrent execution. Codex root-caused the hang and added several robustness improvements I'd missed. ## Bug fixes - **decode-concurrent hang**: bare `wait` was waiting on the infinite background power sampler PID (which never exits). Fix: track curl PIDs explicitly and wait only on those, leaving the sampler running cleanly through the bench. - **prefill-heavy JSON**: replaced inline-bash JSON construction with Python-via-stdin to avoid escaping issues at 50K-token prompt sizes. - **Prefix-cache leakage between caps**: prefill-heavy now prepends a random nonce to each cap's prompt — defeats vLLM's prefix cache reusing the previous cap's prefill, which would have masked the real prefill compute under each cap. ## Robustness improvements - EXIT/INT/TERM trap reset GPU to stock (was EXIT-only) - Sampler PID cleanup is idempotent across all exit paths - Best-effort concurrency probe for decode-concurrent — quick check that the running compose can accept N parallel requests before starting the sweep - Summary footer is now load-mode-aware (decode-single vs aggregate vs prefill columns each get matching footnotes) - Header docs document mode selection per card class (3090/4090 decode-single fine, 5090 decode-concurrent N=8+, RTX PRO 6000 prefill-heavy or large-model swap) ## Validation (Codex's smoke tests on running gemma-mtp endpoint) - decode-single 280W/330W: 111 narr / 146 code @ 280W, 122 narr / 151 code @ 330W - decode-concurrent N=4 280W/330W: 286 narr / 360 code @ 280W aggregate, 305 narr / 378 code @ 330W aggregate (the curve we couldn't see in decode-single emerges clearly here — +7% TPS for +18% power, classic diminishing-returns shape) - prefill-heavy 280W/330W: 732 prefill TPS @ 280W, 749 prefill TPS @ 330W (compute-bound by construction; produces a clean curve on any card class regardless of model fit) All three modes correctly reset GPU 0 to 370W stock at end. No hangs on any path. ## Implication for cross-rig matrix @apnar's earlier 5090 sweep produced flat curves on decode-single because the workload was compute-undersaturated for the 5090. With decode-concurrent N=4 (or N=8) the same sweep on his rig should now surface a real curve. Re-ping pending.
1 parent e46f1e8 commit f387622

1 file changed

Lines changed: 362 additions & 27 deletions

File tree

0 commit comments

Comments
 (0)