Qwen 3.6 27B FP8 ? #100

RossNE99 · 2026-05-07T20:00:56Z

RossNE99
May 7, 2026

Would it be possible to run Qwen 3.6 27B FP8 on 2x 3090's? Really impressed with the INT4 speed, but coming from Llama cpp q8 quant, I feel like there is a bit of a quality hit.

I wouldn't mind dropping to 1 parallel request if needed, but ideally, I wouldn't want to use less than 200k context

noonghunna · 2026-05-07T21:28:09Z

noonghunna
May 7, 2026
Maintainer

@RossNE99 — short answer: technically yes, but it'd run slower than your current INT4 path with only marginal quality gain. Worth unpacking why, since it's the kind of misconception that catches a lot of 3090 users.

The Ampere-FP8 catch

The 3090 (Ampere sm_86) doesn't have native FP8 tensor cores — those were introduced with Hopper (sm_90) and Ada (sm_89), one generation after Ampere:

GPU	Native FP8 tensor cores	FP8 compute throughput
3090 (Ampere sm_86)	❌ No	N/A — falls back to BF16 path
4090 (Ada sm_89)	✅ E4M3/E5M2	~1.3 PFLOPS
5090 (Blackwell sm_120)	✅ E4M3/E5M2 + FP4	~3.4 PFLOPS
H100 (Hopper sm_90)	✅ E4M3/E5M2	~2.0 PFLOPS

When you load Qwen/Qwen3.6-27B-FP8 on a 3090, vLLM doesn't error — but it routes through the Marlin FP8 W8A16 path:

Weights stay stored as FP8 in VRAM (saves ~50% vs BF16 storage)
At each matmul, a dequant kernel converts FP8 → BF16 in registers
Standard Ampere BF16 × BF16 tensor core GEMM runs

So FP8-on-Ampere = "compressed weight storage with BF16 compute + dequant overhead." You get the VRAM savings but pay 5-10% in compute speed for the dequant tax. No native FP8 acceleration.

How that compares to what you're already running

For 2× 3090 specifically, here's the kernel-path tradeoff:

Path	VRAM (TP=2, per card)	Compute speed on Ampere	Quality vs BF16
BF16 native	won't fit at 27B	100% (baseline)	identical
FP8 (Qwen official, Marlin W8A16 on Ampere)	~13.5 GB weights	~95% (dequant tax)	~99%
GPTQ/AWQ INT8	~13.5 GB weights	~110% (Marlin W8A16 mature)	~99%
INT4 AutoRound (Lorbus, what you're on)	~7 GB weights	~150% (Marlin W4A16, peak Ampere kernel)	~97%

Marlin INT4 has had years more optimization work than Marlin FP8 on Ampere. The W4A16 path is genuinely the fastest weight-quantized matmul on a 3090.

Net for FP8 on your rig: you'd give up ~30-40% of your INT4 TPS for ~2 percentage points of quality recovery. That's a steep trade for a marginal quality win on this hardware class.

Better paths for "quality between INT4 and Q8"

If the goal is closing the quality gap you're feeling vs llama.cpp Q8, FP8 is the wrong tool on Ampere. Two practical options that don't require building anything new:

1. Q8 GGUF on dual-card llama.cpp (exact quality match)

You're already familiar with the Q8 quality. llama.cpp can split the model across both 3090s with --split-mode layer --tensor-split 1,1. Slower than vLLM INT4 (probably 25-30 TPS aggregate vs 170 INT4) but exactly the quality you measured was good. The shipped models/qwen3.6-27b/llama-cpp/compose/docker-compose.yml already supports dual-card via env override.

2. Stay on INT4 if the speed delta matters more than the quality delta

The 2-3 percentage points of quality between AutoRound INT4 and Q8 is real but mostly visible on long-context, code-edit, and structured-output edge cases. For chat / day-to-day coding it's usually within noise. Worth A/B-ing both paths on YOUR workload before deciding the gap is meaningful — sometimes it isn't.

3. Wait for Qwen 3.6 27B INT8 GPTQ / W8A16 publication

INT8 on Ampere via Marlin W8A16 is the proper "between INT4 and Q8" sweet spot — same VRAM as FP8, faster than FP8 on this hardware, ~99% of BF16 quality. I haven't found one published for 3.6 27B yet (search shows GPTQ-4bit, AWQ-INT4, and Qwen's official FP8, but no INT8). RedHat / Neural Magic typically catch up within 4-6 weeks of model release for popular ones.

TL;DR

Goal	Best path on 2× 3090
Match Q8 quality exactly	llama.cpp Q8 dual-card — slower but exact quality match
Keep current INT4 speed	Stay on Lorbus INT4 — A/B against Q8 first to see if the gap actually matters for your workload
Future-proof "quality middle"	Watch for INT8 GPTQ publication on HF
FP8 specifically	Skip — Ampere doesn't have native FP8, you'd pay overhead for no benefit

Side note for any future 3090 users reading this: "FP8 quants are great" is a Hopper/Ada/Blackwell statement, not an Ampere one. On 3090 the W4A16 INT4 path (Marlin) is the speed-optimal weight quant; INT8 GPTQ is a quality-friendlier middle ground when published; FP8 is the worst trade.

2 replies

chriskerley78910 Jun 30, 2026

@noonghunna Seems there is now a INT8 W8A8 publication (close enough?)

https://huggingface.co/Avesed/Qwen3.6-27B-INT8-W8A8

I did some benchmarks, and it seems faster than the current fp8 max in terms of TTFT.

noonghunna Jun 30, 2026
Maintainer

@chriskerley78910 meaningful update to this thread — when the original answer here was written there was no INT8 W8A8 published (only weight-only W8A16, which doesn't help on Ampere). Avesed/Qwen3.6-27B-INT8-W8A8 is the real thing: W8A8 engages Ampere's native INT8 tensor cores (FP8 has no native path here), so your faster-TTFT result is expected — prefill is compute-bound and W8A8 skips the Marlin dequant tax FP8 pays. It won't beat INT4 on decode (8-bit = 2× INT4's bytes → bandwidth-bound), but it's the fastest prefill / TTFT option at 8-bit fidelity. Full numbers + where it fits in #515 — where I've asked for a with/without-reasoning 8-pack, since W8A8 quantizes activations and that's the open quality question. Nice hunt. 🙏

gm843838383 · 2026-05-08T14:55:40Z

gm843838383
May 8, 2026

Hi @noonghunna
First, thank you for the great work you are putting daily into this project!

I haven't found one published for 3.6 27B yet (search shows GPTQ-4bit, AWQ-INT4, and Qwen's official FP8, but no INT8). RedHat / Neural Magic typically catch up within 4-6 weeks of model release for popular ones.

How about https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound ?
It's Autoround W8A16.

However I'm getting this vllm error with this model : torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device.

0 replies

noonghunna · 2026-05-08T19:33:22Z

noonghunna
May 8, 2026
Maintainer

Hi @gm843838383, thanks for surfacing the Minachist quant — wasn't aware it existed. The CUDA error you're seeing is a known Marlin gotcha on Ampere consumer:

torch.AcceleratorError: CUDA error: no kernel image is available for execution on the device

This typically means vLLM dispatched the INT8 W8A16 path to Marlin, but Marlin doesn't have a compiled kernel for the specific (8-bit weights, 16-bit activations, sm_86) tuple your config produces. Marlin's well-trodden Ampere path is INT4 W4A16 (what Lorbus AutoRound INT4 uses); INT8 W8A16 support on sm_86 is partial / regressing across vLLM versions.

Quick diagnostic steps

Check vLLM's boot log for the actual quantization config Marlin tried to use:
```
docker logs <container> 2>&1 | grep -E "Marlin|quant_config|quantization" | head -20
```
You're looking for something like bits=8, group_size=128, sym=False, weight_dtype=int8. The combination of those four parameters tells you which Marlin kernel slot needs to exist on sm_86.
**Try ** to confirm the issue is in cudagraph capture vs the compute path itself. If error moves earlier in boot, it's definitely a kernel-dispatch issue (not graph capture).
Try forcing the quant backend explicitly: or (depending on the model card metadata). Some quants get auto-routed to different backends based on heuristics.

Realistic recommendation

For Qwen3.6-27B on Ampere consumer today, the well-trodden production path is Lorbus AutoRound INT4 + TQ3 KV ( ships this). It nets ~85 TPS @ 125K context single-card and handles tool calls, vision, MTP — all working. INT8 weights would give marginally better quality but Ampere's Marlin INT8 W8A16 path isn't reliable enough today to recommend over INT4 W4A16.

If you specifically need INT8 quality (some users do for nuanced reasoning), the INT4 weights + INT8 PTH KV path on Gemma 4 () is an alternate point on the curve — gives you 8-bit precision in the KV cache without relying on Marlin's INT8 weight kernels. Different model family, but if model choice is flexible it's a known-working path.

Worth filing with Minachist + Intel AutoRound (https://github.com/intel/auto-round/issues) — cross-checking whether their INT8 W8A16 quant has known sm_86 compatibility caveats. If you do file, link this discussion and I'll subscribe to track it for the broader rig audience.

0 replies

ghost1252 · 2026-05-12T13:25:31Z

ghost1252
May 12, 2026

I am currently using the official FP8 model and I am using the dual 3090 with NVLINK.
You can see from my parameter configuration that I used it without any third-party optimization:

vllm serve /mnt/data/AI/Qwen3.6-27B-FP8
--tensor-parallel-size 2
--gpu-memory-utilization 0.97
--max-model-len 132000
--max-num-seqs 2
--trust-remote-code
--port 8000
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--default-chat-template-kwargs '{"enable_thinking": false}'

Under such settings, a speed of approximately 40T/S can be achieved - which is already sufficient for me, although I am still learning how to further optimize it.

Of course, I also hope to have INT8

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen 3.6 27B FP8 ? #100

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Qwen 3.6 27B FP8 ? #100

Uh oh!

RossNE99 May 7, 2026

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

noonghunna May 7, 2026 Maintainer

The Ampere-FP8 catch

How that compares to what you're already running

Better paths for "quality between INT4 and Q8"

1. Q8 GGUF on dual-card llama.cpp (exact quality match)

2. Stay on INT4 if the speed delta matters more than the quality delta

3. Wait for Qwen 3.6 27B INT8 GPTQ / W8A16 publication

TL;DR

Uh oh!

chriskerley78910 Jun 30, 2026

Uh oh!

noonghunna Jun 30, 2026 Maintainer

Uh oh!

gm843838383 May 8, 2026

Uh oh!

noonghunna May 8, 2026 Maintainer

Quick diagnostic steps

Realistic recommendation

Uh oh!

ghost1252 May 12, 2026

RossNE99
May 7, 2026

Replies: 4 comments 2 replies

noonghunna
May 7, 2026
Maintainer

noonghunna Jun 30, 2026
Maintainer

gm843838383
May 8, 2026

noonghunna
May 8, 2026
Maintainer

ghost1252
May 12, 2026