Replies: 4 comments 2 replies
-
|
@RossNE99 — short answer: technically yes, but it'd run slower than your current INT4 path with only marginal quality gain. Worth unpacking why, since it's the kind of misconception that catches a lot of 3090 users. The Ampere-FP8 catchThe 3090 (Ampere sm_86) doesn't have native FP8 tensor cores — those were introduced with Hopper (sm_90) and Ada (sm_89), one generation after Ampere:
When you load
So FP8-on-Ampere = "compressed weight storage with BF16 compute + dequant overhead." You get the VRAM savings but pay 5-10% in compute speed for the dequant tax. No native FP8 acceleration. How that compares to what you're already runningFor 2× 3090 specifically, here's the kernel-path tradeoff:
Marlin INT4 has had years more optimization work than Marlin FP8 on Ampere. The W4A16 path is genuinely the fastest weight-quantized matmul on a 3090. Net for FP8 on your rig: you'd give up ~30-40% of your INT4 TPS for ~2 percentage points of quality recovery. That's a steep trade for a marginal quality win on this hardware class. Better paths for "quality between INT4 and Q8"If the goal is closing the quality gap you're feeling vs llama.cpp Q8, FP8 is the wrong tool on Ampere. Two practical options that don't require building anything new: 1. Q8 GGUF on dual-card llama.cpp (exact quality match)You're already familiar with the Q8 quality. llama.cpp can split the model across both 3090s with 2. Stay on INT4 if the speed delta matters more than the quality deltaThe 2-3 percentage points of quality between AutoRound INT4 and Q8 is real but mostly visible on long-context, code-edit, and structured-output edge cases. For chat / day-to-day coding it's usually within noise. Worth A/B-ing both paths on YOUR workload before deciding the gap is meaningful — sometimes it isn't. 3. Wait for Qwen 3.6 27B INT8 GPTQ / W8A16 publicationINT8 on Ampere via Marlin W8A16 is the proper "between INT4 and Q8" sweet spot — same VRAM as FP8, faster than FP8 on this hardware, ~99% of BF16 quality. I haven't found one published for 3.6 27B yet (search shows GPTQ-4bit, AWQ-INT4, and Qwen's official FP8, but no INT8). RedHat / Neural Magic typically catch up within 4-6 weeks of model release for popular ones. TL;DR
Side note for any future 3090 users reading this: "FP8 quants are great" is a Hopper/Ada/Blackwell statement, not an Ampere one. On 3090 the W4A16 INT4 path (Marlin) is the speed-optimal weight quant; INT8 GPTQ is a quality-friendlier middle ground when published; FP8 is the worst trade. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @noonghunna
How about https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound ? However I'm getting this vllm error with this model : torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @gm843838383, thanks for surfacing the Minachist quant — wasn't aware it existed. The CUDA error you're seeing is a known Marlin gotcha on Ampere consumer: This typically means vLLM dispatched the INT8 W8A16 path to Marlin, but Marlin doesn't have a compiled kernel for the specific (8-bit weights, 16-bit activations, sm_86) tuple your config produces. Marlin's well-trodden Ampere path is INT4 W4A16 (what Lorbus AutoRound INT4 uses); INT8 W8A16 support on sm_86 is partial / regressing across vLLM versions. Quick diagnostic steps
Realistic recommendationFor Qwen3.6-27B on Ampere consumer today, the well-trodden production path is Lorbus AutoRound INT4 + TQ3 KV ( ships this). It nets ~85 TPS @ 125K context single-card and handles tool calls, vision, MTP — all working. INT8 weights would give marginally better quality but Ampere's Marlin INT8 W8A16 path isn't reliable enough today to recommend over INT4 W4A16. If you specifically need INT8 quality (some users do for nuanced reasoning), the INT4 weights + INT8 PTH KV path on Gemma 4 () is an alternate point on the curve — gives you 8-bit precision in the KV cache without relying on Marlin's INT8 weight kernels. Different model family, but if model choice is flexible it's a known-working path. Worth filing with Minachist + Intel AutoRound (https://github.com/intel/auto-round/issues) — cross-checking whether their INT8 W8A16 quant has known sm_86 compatibility caveats. If you do file, link this discussion and I'll subscribe to track it for the broader rig audience. |
Beta Was this translation helpful? Give feedback.
-
|
I am currently using the official FP8 model and I am using the dual 3090 with NVLINK. vllm serve /mnt/data/AI/Qwen3.6-27B-FP8 Under such settings, a speed of approximately 40T/S can be achieved - which is already sufficient for me, although I am still learning how to further optimize it. Of course, I also hope to have INT8 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Would it be possible to run Qwen 3.6 27B FP8 on 2x 3090's? Really impressed with the INT4 speed, but coming from Llama cpp q8 quant, I feel like there is a bit of a quality hit.
I wouldn't mind dropping to 1 parallel request if needed, but ideally, I wouldn't want to use less than 200k context
Beta Was this translation helpful? Give feedback.
All reactions