Recommended settings for 2x3090s for 8-bit weight and 8-bit KV-cache #341

@chriskerley78910 — ran the controlled A/B for exactly this. Short version: 8-bit weights + 8-bit KV boot and serve fine at full 262K on 2× 3090, but on our 8-pack they don't out-quality the 4-bit default — and they're slower with a smaller KV pool. Where 8-bit should help (long-context fidelity) is still untested.

Three dual configs, stock vLLM v0.22.0, TP=2, MTP n=3, same harness/day:

Serving (2× 3090 PCIe, @262K)

tier	weights	KV	decode TPS	KV pool @262K
fast `vllm/dual`	AutoRound int4	fp8_e5m2	~89 code	622K / 2.37×
balanced	AWQ int4	int8-PTH	~67	370K / 1.41×
max `vllm/qwen-27b-dual-max`	FP8 (8-bit)	int8-PTH	~56	295K / 1.13×

Quality — 8-pack --full, same harness (2026-06-07)

	fast	balanced	max
deterministic /75	64	64	65
sandbox /75	45	41	45
TOTAL /150	109	105	110

Takeaways

Quality is a tie — 105/109/110 is inside ±5–7 8-pack noise; deterministic packs 64/64/65. The 8-bit max does not beat the 4-bit default on behavioral quality.
8-bit costs speed + KV pool. FP8 weights run via Marlin W8A16 on Ampere (memory layout only — no native FP8 compute on sm_86), so ~37% slower decode, and the heavier weights shrink the KV pool.
Where 8-bit KV should win is long context (needle recall / many-turn drift) — the short-ctx 8-pack can't see it. That NIAH A/B is the open follow-up; until it lands, the int4 default is right for 2× 3090.
On your "4-bit degrades as context grows for tool calls" point — that's exactly the long-ctx-fidelity question the 8-pack doesn't test. If you're seeing int4 degrade specifically at high ctx, that's the strongest case for the int8-PTH KV tiers, and the test we still owe.

vllm/qwen-27b-dual-max shipped in #340; vllm/qwen-27b-dual-balanced is in #343.

(All vLLM. We have llama.cpp/ik/beellama 8-pack numbers too, but single-card only so far — a different, more-constrained regime — so I'm not claiming a cross-engine quality verdict; dual-card Q8 is queued.)

1 reply

chriskerley78910 Jun 11, 2026
Author

I am running the 8-bit one now (max vllm/qwen-27b-dual-max) , and am unlikely to go back to the others until I can do some more first-hand experiments. Someone did some tests on the effect of KV-Cache quantisation and the results strongly suggest both 8-bit weight and KV-Cache is likely necessary for most long context work.

They defined long documents as follows:

"Long documents: extended context inputs up to ~30k tokens" [2].

Most multi-file refactors I've seen go over 20k context, ( and often over 30k ). At 4-bit KV-Cache with Qwen 3.6 27b, the KV-Divergence was 0.6 in source [2]. That's equivalent to a 1.1 standard deviation shift from the models original distribution when handling long contexts. So, assuming their measurements are accurate, 8-bit seems like a reasonable choice for anyone doing professional software engineering work.

The source also suggests that increasing precision on the KV-Cache alone will not somehow makeup for low-precision weights. i.e. Using Q5 weights inflicted damage similar to using a 4-bit KV-Cache in terms of KL Divergence [2].

Source 1:https://localbench.substack.com/p/gguf-benchmark-methodology
Source 2: https://localbench.substack.com/p/kv-cache-quantization-benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended settings for 2x3090s for 8-bit weight and 8-bit KV-cache #341

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Recommended settings for 2x3090s for 8-bit weight and 8-bit KV-cache #341

Uh oh!

chriskerley78910 Jun 7, 2026

Replies: 2 comments · 2 replies

Uh oh!

noonghunna Jun 7, 2026 Maintainer

Uh oh!

chriskerley78910 Jun 7, 2026 Author

Uh oh!

noonghunna Jun 7, 2026 Maintainer

Uh oh!

Uh oh!

chriskerley78910 Jun 11, 2026 Author

chriskerley78910
Jun 7, 2026

Replies: 2 comments 2 replies

noonghunna
Jun 7, 2026
Maintainer

chriskerley78910 Jun 7, 2026
Author

noonghunna
Jun 7, 2026
Maintainer

chriskerley78910 Jun 11, 2026
Author