INT8 Quant for Dual-Max Recipe (To replace the FP8 one) #515
Replies: 2 comments 5 replies
-
|
Thanks Chris — the hardware read is correct: Ampere has native INT8 (IMMA) but no native FP8, which is why we treat FP8 as storage-only here. Two things worth clarifying though — first on how we think about the tiers, then on why this particular swap doesn't move them. The fast / balanced / max tiers are a fidelity ladder, not a speed ladder. They're ordered by how much weight + KV precision they keep, not by throughput:
We name the lean end "fast" because lean is fast — and yes, there's a real speed cost climbing the ladder: Why an INT8-weights swap on
Your low-bit-on-Ampere instinct is right — just on the KV, not the weights. One snag: those KV dtypes land in v0.24.0 (still in dev), and we pin the production tiers to a stable digest. So rather than rebuild the tiers against a nightly, we're parking a re-evaluation of all three — weights and KV together — for when v0.24.0 ships stable. If you're up for it, a KV-dtype A/B on the dual recipe (int8-PTH baseline vs int4-PTH / turboquant-4bit) is the contribution with genuine upside on this hardware — happy to point you at the bench + NIAH harness when v0.24.0 lands. 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
Closing the loop with controlled numbers on vLLM v0.24.0. I ran your INT8 idea head-to-head against the FP8 dual-max — same compose, only
Results (2× 3090, v0.24.0, caps 370/420 W):
Takeaways:
Huge thanks for driving this, Chris — it filled a genuinely empty corner of the tier map. Open follow-ups: a thinking-on agentic re-run (tracking tied so far) and a bf16-KV W8A8 (your config — likely even faster prefill at a ctx cost). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RTX 3090s process INT8 naively, but do not for FP8. This suggests that using an INT8 quant for the vllm/qwen-27b-dual-max recipe would give a performance boost, with no quality loss.
I found a INT8 quant here, and was considering putting up a PR for it.
https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound
Beta Was this translation helpful? Give feedback.
All reactions