INT8 Quant for Dual-Max Recipe (To replace the FP8 one) #515

chriskerley78910 · 2026-06-29T14:18:21Z

chriskerley78910
Jun 29, 2026

RTX 3090s process INT8 naively, but do not for FP8. This suggests that using an INT8 quant for the vllm/qwen-27b-dual-max recipe would give a performance boost, with no quality loss.

I found a INT8 quant here, and was considering putting up a PR for it.
https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound

noonghunna · 2026-06-30T01:20:54Z

noonghunna
Jun 30, 2026
Maintainer

Thanks Chris — the hardware read is correct: Ampere has native INT8 (IMMA) but no native FP8, which is why we treat FP8 as storage-only here. Two things worth clarifying though — first on how we think about the tiers, then on why this particular swap doesn't move them.

The fast / balanced / max tiers are a fidelity ladder, not a speed ladder. They're ordered by how much weight + KV precision they keep, not by throughput:

fast — AutoRound INT4 weights + fp8 KV (leanest)
balanced — AWQ mixed bf16/int4 weights + int8-PTH KV
max — FP8 weights + int8-PTH KV (highest fidelity)

We name the lean end "fast" because lean is fast — and yes, there's a real speed cost climbing the ladder: max decodes ~56 TPS vs fast's ~89/115 (≈0.6×). Decode is memory-bandwidth-bound, so the cost is mostly the weights — INT4→FP8 doubles the bytes/token (½→1 byte). The KV format is ~a throughput wash at TP=2. The point of the higher tiers is fidelity headroom at long context / under KV pressure, traded against that decode speed.

Why an INT8-weights swap on max is a no-op:

The "FP8" weights already aren't computed in FP8 — with no FP8 hardware, vLLM runs them via Marlin FP8 W8A16 (dequant → 16-bit matmul). No FP8-compute penalty to recover.
An AutoRound INT8 quant is weight-only, so vLLM serves it via Marlin INT8 W8A16 — the same dequant path. Ampere's native INT8 IMMA cores only engage for W8A8 (quantized activations too), which AutoRound doesn't produce.
FP8 and INT8 weights are both 1 byte → identical VRAM, bandwidth, and decode speed. And fidelity-wise they're a wash too: our weight-quant A/B on the 8-pack is a tie (FP8 110 / INT4 109 / AWQ 105). So the swap moves max on neither axis.

Your low-bit-on-Ampere instinct is right — just on the KV, not the weights. max's differentiator is its KV cache (int8_per_token_head), running a tight 1.13× pool at 262K. That's where low-bit pays off, and vLLM's --kv-cache-dtype menu is growing exactly the right options: int4_per_token_head, turboquant_4bit_nc, the asymmetric turboquant_k3v4_nc / k8v4, fp8_per_token_head. Those are a real lever to widen the KV pool / push max-ctx, and they want a long-context NIAH A/B to evaluate (which the 8-pack can't see).

One snag: those KV dtypes land in v0.24.0 (still in dev), and we pin the production tiers to a stable digest. So rather than rebuild the tiers against a nightly, we're parking a re-evaluation of all three — weights and KV together — for when v0.24.0 ships stable. If you're up for it, a KV-dtype A/B on the dual recipe (int8-PTH baseline vs int4-PTH / turboquant-4bit) is the contribution with genuine upside on this hardware — happy to point you at the bench + NIAH harness when v0.24.0 lands. 🙏

4 replies

chriskerley78910 Jun 30, 2026
Author

@noonghunna Some brief tests with a INT8 quant model:

Used bash scripts/soak-test.sh --continuous for all tests.

Model Quant	KV-Cache Quant	p50 Decode TPS	p95 TTFT (ms)
INT4 Model	FP8	179.55	6,023
FP8 Model	FP8	87.06	8,812
INT8 Model	INT8	83.50	5,602
INT8 Model	BF16	96.19	3,305

INT8 Model Link: https://huggingface.co/Avesed/Qwen3.6-27B-INT8-W8A8

noonghunna Jun 30, 2026
Maintainer

Great datapoint, Chris — and it confirms exactly the split we expected. Through the prefill-vs-decode lens:

TTFT (prefill, compute-bound): the W8A8 model is the fastest of everything — 3.3s (bf16 KV) / 5.6s (int8 KV) vs FP8's 8.8s and even INT4's 6.0s. That's the native-INT8 win: W8A8 runs INT8×INT8 on Ampere's IMMA cores with no dequant, while FP8 and INT4 both pay a Marlin dequant tax in prefill. This is the W8A8 upside from up-thread, and TTFT is exactly where it shows.

Decode (bandwidth-bound): W8A8 (~84–96) ≈ FP8 (~87); INT4 ~2× both. Also as expected — 8-bit weights move the same bytes whether FP8 or INT8, so they tie on decode; only INT4's half-bytes wins it. The format doesn't move decode, the bit-width does.

So W8A8 fills a real gap on this hardware — 8-bit fidelity + the fastest TTFT:

INT4 — decode king (long generation), lower fidelity
FP8-max — 8-bit fidelity, slow TTFT
W8A8 — 8-bit fidelity and fast TTFT → the one to want for prefill / TTFT-bound work (long prompts, agentic loops, RAG)

Nice sub-finding too: on W8A8, bf16 KV beat int8 KV on both axes (96 / 3.3s vs 84 / 5.6s) — the int8-KV quant/dequant overhead isn't paying for itself at that context length. The catch is VRAM (bf16 KV ≈ 2× the bytes → caps max-ctx well below int8-PTH), so it's a speed-vs-context dial.

The one axis your soak doesn't cover — and the gate before this is a tier — is quality. W8A8 quantizes activations too (dynamic per-token), which the weight-only FP8/INT4 don't — exactly where quality can quietly drop (tool-call arg precision, structured output, numeric reasoning). Our weight-only A/B was a tie (FP8 110 / INT4 109); W8A8 is the unknown. If you're up for it, the missing piece is the 8-pack — and since activation quant can interact with reasoning differently, run both arms so we can see the delta:

bash scripts/quality-test.sh --full --no-thinking      # reasoning OFF
bash scripts/quality-test.sh --full --enable-thinking  # reasoning ON

If it holds ~108–110 (and reasoning-on doesn't regress the strict-format packs) alongside that TTFT, W8A8 is a genuinely compelling 8-bit tier — your numbers would feed straight into the fast/balanced/max re-evaluation we've already got lined up for the v0.24.0 stable bump. Excellent find. 🙏

noonghunna Jun 30, 2026
Maintainer

One more thread that's directly related, since you're clearly in the prefill/TTFT zone — there are really two levers here and they're complementary:

W8A8 makes the prefill compute cheaper (native INT8, no dequant) → faster cold prefill — what your soak just showed.
LMCache skips prefill for prefixes it has already seen (persistent prefix-KV reuse) → cold→warm TTFT ~7–8× in our testing.

We've already got LMCache wired as an opt-in incubating slug — vllm/qwen-27b-dual-lmcache: byte-identical serving to dual-max plus an MP/HMA tiered prefix-KV cache, validated in #133 with zero decode penalty (the offload is async/overlapped). It bases on dual-max (FP8) today, but the connector is weight-quant-agnostic, so the interesting endgame for agentic/long-prompt work is W8A8 (cheap compute) × LMCache (reuse) on the dual base — the fastest-possible TTFT, with INT4 staying the decode tier.

Separate investigation from the W8A8 quality gate, but the same niche — flagging it in case it's useful for what you're exploring. 🙏

walmis Jul 1, 2026

While looking at prefix reuse strategies, it's also worth noting that vLLM now has a native CPU offloading path built right in. It achieves a similar CPU RAM offloading result for the KV cache but keeps things entirely within the native architecture without introducing external dependencies.

For anyone testing similar architectures, you can spin it up directly using:

--kv_offloading_backend native
--kv_offloading_size 16

Definitely a parallel path worth benchmarking alongside LMCache to see how the async overlapping compares against the native implementation.

noonghunna · 2026-06-30T23:59:42Z

noonghunna
Jun 30, 2026
Maintainer

Closing the loop with controlled numbers on vLLM v0.24.0. I ran your INT8 idea head-to-head against the FP8 dual-max — same compose, only --model + --quantization differ, so same int8-PTH KV / 295K pool / MTP n=3 / TP=2 / 262K / sampling (a clean isolation of the weight scheme). Both on the new v0.24.0 stable, which is also where W8A8 loads via the native CutlassInt8ScaledMM kernel — FP8 falls back to MarlinFP8ScaledMM weight-only dequant, and vLLM literally warns it "may degrade performance for compute-heavy workloads."

⚠️ Quant note: your first link, Minachist/…-INT8-AutoRound, is W8A16 (INT8 weights only, 16-bit activations) — that dequants like FP8 and does not hit the native-INT8 path. The one that does is a W8A8 (INT8 weights and activations) — I used Avesed/Qwen3.6-27B-INT8-W8A8.

Results (2× 3090, v0.24.0, caps 370/420 W):

	FP8	W8A8
Quality — 8-pack `--full`	107/150	107/150 (TIE, ±1 per-pack)
Decode TPS (narr / code)	82 / 105	76 / 96
Short-prompt TTFT	158 ms	122 ms
Prefill t/s @10k → 90K	1364 → 875	2062 → 1021 (+17–51%)
KV pool · VRAM · NIAH	295K · 21.4 GB · 240K	identical

Takeaways:

"No quality loss" — confirmed. 107=107 on the verifier-backed 8-pack, pack-for-pack within ±1 — including tool-call / structured / numeric, where activation-quant would bite if anywhere. So quantizing activations cost nothing measurable here.
The boost is real — but it's prefill, not decode. W8A8's native INT8 wins TTFT/prefill by +17–51% (lowest TTFT of any quant — matches your soak's 3.3 s). But FP8 actually wins decode by ~8% (its Marlin kernel is better-tuned for single-token). Your premise — "3090s do INT8 natively → boost" — is dead-on for the compute-bound prefill path; decode is bandwidth-bound and tips the other way.
So it reads as a prefill-vs-decode tradeoff, not a replacement. Equal quality, equal pool, equal VRAM — W8A8 is the prefill/agentic corner, FP8 the decode corner. That points toward keeping both rather than swapping one for the other — but this is still an experiment on my rig; nothing's committed to the catalog yet, and I'd want cross-rig numbers + the thinking-on agentic arm before deciding.

Huge thanks for driving this, Chris — it filled a genuinely empty corner of the tier map. Open follow-ups: a thinking-on agentic re-run (tracking tied so far) and a bf16-KV W8A8 (your config — likely even faster prefill at a ctx cost).

1 reply

chriskerley78910 Jul 1, 2026
Author

@noonghunna You are way ahead of me! I was thinking of running the benchmark today, but you already did it!

INT8 Quant for Dual-Max Recipe (To replace the FP8 one) #515

Uh oh!

Uh oh!

chriskerley78910 Jun 29, 2026

Replies: 2 comments · 5 replies

Uh oh!

noonghunna Jun 30, 2026 Maintainer

Uh oh!

Uh oh!

chriskerley78910 Jun 30, 2026 Author

Uh oh!

noonghunna Jun 30, 2026 Maintainer

Uh oh!

noonghunna Jun 30, 2026 Maintainer

Uh oh!

walmis Jul 1, 2026

Uh oh!

Uh oh!

noonghunna Jun 30, 2026 Maintainer

Uh oh!

chriskerley78910 Jul 1, 2026 Author

chriskerley78910
Jun 29, 2026

Replies: 2 comments 5 replies

noonghunna
Jun 30, 2026
Maintainer

chriskerley78910 Jun 30, 2026
Author

noonghunna Jun 30, 2026
Maintainer

noonghunna Jun 30, 2026
Maintainer

noonghunna
Jun 30, 2026
Maintainer

chriskerley78910 Jul 1, 2026
Author