Commit f387622
committed
power-cap-sweep: load-mode flag + concurrent/prefill modes (Codex iteration)
Implements the three load modes drafted in /tmp/codex-prompt-power-
cap-loadmodes.md after my partial implementation hit a hang during
decode-concurrent execution. Codex root-caused the hang and added
several robustness improvements I'd missed.
## Bug fixes
- **decode-concurrent hang**: bare `wait` was waiting on the infinite
background power sampler PID (which never exits). Fix: track curl
PIDs explicitly and wait only on those, leaving the sampler running
cleanly through the bench.
- **prefill-heavy JSON**: replaced inline-bash JSON construction with
Python-via-stdin to avoid escaping issues at 50K-token prompt sizes.
- **Prefix-cache leakage between caps**: prefill-heavy now prepends a
random nonce to each cap's prompt — defeats vLLM's prefix cache
reusing the previous cap's prefill, which would have masked the real
prefill compute under each cap.
## Robustness improvements
- EXIT/INT/TERM trap reset GPU to stock (was EXIT-only)
- Sampler PID cleanup is idempotent across all exit paths
- Best-effort concurrency probe for decode-concurrent — quick check
that the running compose can accept N parallel requests before
starting the sweep
- Summary footer is now load-mode-aware (decode-single vs aggregate
vs prefill columns each get matching footnotes)
- Header docs document mode selection per card class (3090/4090
decode-single fine, 5090 decode-concurrent N=8+, RTX PRO 6000
prefill-heavy or large-model swap)
## Validation (Codex's smoke tests on running gemma-mtp endpoint)
- decode-single 280W/330W: 111 narr / 146 code @ 280W, 122 narr / 151 code @ 330W
- decode-concurrent N=4 280W/330W: 286 narr / 360 code @ 280W aggregate,
305 narr / 378 code @ 330W aggregate (the curve we couldn't see in
decode-single emerges clearly here — +7% TPS for +18% power, classic
diminishing-returns shape)
- prefill-heavy 280W/330W: 732 prefill TPS @ 280W, 749 prefill TPS @
330W (compute-bound by construction; produces a clean curve on any
card class regardless of model fit)
All three modes correctly reset GPU 0 to 370W stock at end. No hangs
on any path.
## Implication for cross-rig matrix
@apnar's earlier 5090 sweep produced flat curves on decode-single
because the workload was compute-undersaturated for the 5090. With
decode-concurrent N=4 (or N=8) the same sweep on his rig should now
surface a real curve. Re-ping pending.1 parent e46f1e8 commit f387622
1 file changed
Lines changed: 362 additions & 27 deletions
0 commit comments