Recommended settings for 2x3090s for 8-bit weight and 8-bit KV-cache #341
-
|
The 4-bit weights seems to degrade too much as the context grows, especially for agent tool calls. Are there any recommended recipes someone can point me too that uses 8-bit weights? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
|
just released vllm/qwen-27b-dual-fast fast aims for speed (intel/autoround-int4) |
Beta Was this translation helpful? Give feedback.
-
|
Qwen3.6-27B on 2× 3090 — does 8-bit weight + 8-bit KV actually help? (measured A/B) @chriskerley78910 — ran the controlled A/B for exactly this. Short version: 8-bit weights + 8-bit KV boot and serve fine at full 262K on 2× 3090, but on our 8-pack they don't out-quality the 4-bit default — and they're slower with a smaller KV pool. Where 8-bit should help (long-context fidelity) is still untested. Three dual configs, stock vLLM v0.22.0, TP=2, MTP n=3, same harness/day: Serving (2× 3090 PCIe, @262K)
Quality — 8-pack
Takeaways
(All vLLM. We have llama.cpp/ik/beellama 8-pack numbers too, but single-card only so far — a different, more-constrained regime — so I'm not claiming a cross-engine quality verdict; dual-card Q8 is queued.) |
Beta Was this translation helpful? Give feedback.
just released
vllm/qwen-27b-dual-fast
vllm/qwen-27b-dual-max
fast aims for speed (intel/autoround-int4)
max aims for accuracy (qwen/fp8)