NOTE: Gemma Models have not been tested for ppl and qualtity. YMMV
Serve Qwen3.6-35B-A3B, Qwen3.6-27B, Qwen3.5-27B, Gemma 4 26B, and other large models with large context on a single consumer GPU using RotorQuant KV cache compression (4.9x at 3-bit, 97% decode speed of fp16).
KV defaults:
iso3for MoE models (Qwen3.6-35B-A3B),planar3for dense models (Qwen3.6-27B, Qwen3.5-27B). Both are 3.125 bpe (4.9x compression). Benchmarks show rotation type matters: iso beats planar on MoE; planar beats iso on dense.
Before doing perf work or planning a new substrate, read:
docs/INFERENCE_LESSONS.md— verdicts on optimization techniques (what works, doesn't, when), profile-first decision tree, and roadmap recommendations carried over from the vortex Sprint 024 investigation.docs/THROUGHPUT_CONFIGURATION_MODEL.md— bandwidth-vs-compute regime model for picking resident-vs-streaming × N-parallel × ctx configurations.
Constants in those docs were measured on RTX 4090 / Qwen 27B; recalibrate on RTX 5090 / Qwen 3.6 before citing.
# 1. Build
make build
# 2. Run (first run downloads the model ~22 GB)
make run-qwen
# 3. Query
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen","messages":[{"role":"user","content":"Hello!"}],"temperature":1.0,"top_p":0.95,"max_tokens":500}'| Command | Model | Weights | Active Params | KV default |
|---|---|---|---|---|
make run-qwen |
Qwen3.6-35B-A3B (default) | UD-Q4_K_XL (20.8 GB) | 3B active (MoE) | iso3 |
docker compose --profile qwen36-q3 up |
Qwen3.6-35B-A3B | UD-Q3_K_XL (16.8 GB) | 3B active (MoE) | planar4 |
docker compose --profile qwen36-iq3 up |
Qwen3.6-35B-A3B | UD-IQ3_XXS (13.2 GB) | 3B active (MoE) | planar4 |
make run-qwen36-27b |
Qwen3.6-27B (dense) | UD-Q4_K_XL (16.4 GB) | 27B dense | planar3 |
docker compose --profile qwen36-27b-q3 up |
Qwen3.6-27B | UD-Q3_K_XL (~12 GB) | 27B dense | planar3 |
docker compose --profile qwen36-27b-iq3 up |
Qwen3.6-27B | UD-IQ3_XXS (~9 GB) | 27B dense | planar3 |
make run-reasoning |
Qwen3.5-27B Claude Opus Distilled | i1-Q4_K_M (16.6 GB) | 27B dense | planar3 |
make run-gemma |
Gemma 4 26B MoE | UD-Q4_K_XL (17.1 GB) | 3.8B active | planar4 |
Or use Docker Compose directly:
docker compose --profile qwen up # Qwen3.6-35B-A3B Q4 (32 GB+), iso3
docker compose --profile qwen36-q3 up # Qwen3.6-35B-A3B Q3 (24 GB)
docker compose --profile qwen36-iq3 up # Qwen3.6-35B-A3B IQ3 (16 GB)
docker compose --profile qwen36-27b up # Qwen3.6-27B dense (24 GB+), planar3
docker compose --profile qwen36-27b-q3 up # Qwen3.6-27B Q3 (16 GB)
docker compose --profile reasoning up # Qwen3.5-27B reasoning-tuned
docker compose --profile gemma up # Gemma 4 MoE| Use Case | Temperature | Top-P | Top-K | Presence Penalty |
|---|---|---|---|---|
| General | 1.0 | 0.95 | 20 | 1.5 |
| Coding / precise | 0.6 | 0.95 | 20 | 0.0 |
Override defaults via environment variables:
# Reduce context for lower VRAM GPUs
CTX_SIZE=32768 make run-qwen
# Disable RotorQuant (fp16 KV cache)
KV_CACHE_TYPE=f16 make run-qwen
# Custom port
PORT=9090 make run-qwen
# Gated model access
HF_TOKEN=hf_xxx make run-reasoning| Variable | Default | Description |
|---|---|---|
KV_CACHE_TYPE |
per-profile | KV cache type: iso3 (MoE default), planar3 (dense default), planar4, iso4, f16 |
CTX_SIZE |
per-model | Context window (e.g. 114688 for Q4_K_M on 24 GB) |
PORT |
8080 |
API port |
GPU_LAYERS |
99 |
Layers on GPU (99 = all) |
N_PARALLEL |
2 |
Concurrent request slots (set higher for throughput mode) |
CACHE_RAM |
8192 |
Prompt cache size in MiB (system RAM, not VRAM) |
HF_TOKEN |
— | HuggingFace token for gated models |
Recommended configurations per VRAM tier, based on measured perplexity (see docs/BENCHMARK-REPORT.md and docs/QUANTIZATION-GUIDE.md for full data).
Use Qwen3.6-35B-A3B (make run-qwen). 196 tok/s decode, 262K context, iso3 KV (4.9x compression).
| Use Case | Quant | Size | KV | Context | Command |
|---|---|---|---|---|---|
| Default | UD-Q4_K_XL | 20.8 GB | iso3 | 262K | make run-qwen |
| Max context | UD-Q4_K_XL | 20.8 GB | iso3 | 262K × 2 slots | default |
| Max throughput | UD-Q4_K_XL | 20.8 GB | iso3 | 65K × 8 slots | --profile qwen36-throughput |
Use Qwen3.6-27B (--profile qwen36-27b). 131K context, planar3 KV (4.9x compression).
| Use Case | Quant | Size | KV | Context |
|---|---|---|---|---|
| Recommended | UD-Q4_K_XL | 16.4 GB | planar3 | 131K |
| More context | UD-Q3_K_XL | ~12 GB | planar3 | ~180K |
| Max context | UD-IQ3_XXS | ~9 GB | planar3 | ~240K |
Qwen3.5-27B Q4_K_M doesn't fit. Use Unsloth imatrix quants (Qwen3.5-27B):
| Use Case | Quant | Size | PPL | Context (planar3) |
|---|---|---|---|---|
| Best quality | IQ4_XS | 13.9 GB | 6.29 | ~18K |
| Recommended | UD-Q3_K_XL | 13.4 GB | 6.38 | ~36K |
| Max context | UD-IQ3_XXS | 10.7 GB | 6.62 | ~74K |
UD-Q3_K_XL is the sweet spot — only +0.08 PPL over 4-bit, 2× more context.
| GPU | Model | Quant | PPL | KV | Context | Command |
|---|---|---|---|---|---|---|
| 32 GB | Qwen3.6-35B-A3B | UD-Q4_K_XL | 6.13 | iso3 | 262K | make run-qwen |
| 24 GB | Qwen3.6-27B | UD-Q4_K_XL | 7.09 | planar3 | 131K | --profile qwen36-27b |
| 24 GB | Qwen3.6-27B | UD-Q3_K_XL | — | planar3 | ~180K | --profile qwen36-27b-q3 |
| 16 GB | Qwen3.6-35B-A3B | UD-IQ3_XXS | — | planar4 | ~32K | --profile qwen36-iq3 |
| 16 GB | Qwen3.6-27B | UD-IQ3_XXS | — | planar3 | ~90K | --profile qwen36-27b-iq3 |
| 16 GB | Qwen3.5-27B | UD-Q3_K_XL | 6.38 | planar3 | ~36K | --profile qwen-q3 |
| 16 GB | Qwen3.5-27B | UD-IQ3_XXS | 6.62 | planar3 | ~74K | --profile qwen-q3-xxs |
| 16 GB | Gemma4 26B | UD-Q3_K_M | — | planar4 | ~40K | --profile gemma-q3 |
PPL values are f16 baseline (wikitext-2). All profiles auto-download the correct model on first run.
For serving multiple concurrent users, trade context length for parallel slots. During decode, single-user inference is memory-bandwidth bound (mat-vec). With N parallel slots, the weight multiplications become mat-mat and tensor cores engage — aggregate throughput scales near-linearly.
| Slots | Per-slot tok/s | Aggregate tok/s | Profile |
|---|---|---|---|
| 1 | 193 | 193 | --profile qwen |
| 2 | 150 | 299 | --profile qwen (default) |
| 4 | 99 | 397 | --profile qwen36-throughput |
| 8 | 54 | 431 | --profile qwen36-throughput |
P=4 is the practical optimum: 397 tok/s aggregate at 99 tok/s per slot.
docker compose --profile qwen36-throughput up # 8 slots × 65K ctxRotorQuant is the multiplier: planar4 KV at 16K = 1.06 GB/slot vs f16 = 4.0 GB/slot — 3.8× more concurrent users at the same VRAM.
| GPU | Slots (planar4) | Slots (f16) | Est. Aggregate tok/s | Command |
|---|---|---|---|---|
| 24 GB | 6 | 1 | ~320 | N_PARALLEL=6 docker compose --profile qwen-throughput up |
| 32 GB | 14 | 3 | ~660 | docker compose --profile qwen-throughput up |
| 40 GB | 22 | 5 | ~880 | N_PARALLEL=22 docker compose --profile qwen-throughput up |
| 80 GB | 59 | 15 | ~2,400 | N_PARALLEL=59 docker compose --profile qwen-throughput up |
make run-throughput # Qwen3.5-27B, 14 slots × 16K, RTX 5090- Docker 24+
- Docker Compose v2.20+
- NVIDIA Container Toolkit (
nvidia-ctk) - NVIDIA driver 570+ (CUDA 13.1 support)
- GPU with 16+ GB VRAM (RTX 5060/4060 Ti with Q3 quants, RTX 4090+, A100, H100)
Benchmarked on RTX 5090 (32 GB). Full results in docs/BENCHMARK-REPORT.md.
| Metric | iso3 (default) | iso4 | planar3 | f16 |
|---|---|---|---|---|
| Decode speed (1 slot) | 196 tok/s | ~196 tok/s | ~196 tok/s | ~196 tok/s |
| KV compression | 4.9x | 3.8x | 4.9x | 1x |
| Max ctx (RTX 5090, 2 slots) | 262K | ~190K | 262K | ~30K |
| PPL wikitext-2 (ctx=2048) | 6.25 | 6.23 | 6.29 | 6.13 |
| NIAH recall | — | — | — | — |
iso3andiso4both beatplanaron this Gated Delta Net MoE architecture.iso3is the default: same context as iso4 with 4.9× vs 3.8× compression.
| Metric | planar3 (default) | planar4 | iso3 | f16 |
|---|---|---|---|---|
| Decode speed | ~67 tok/s | ~67 tok/s | 67.5 tok/s | 69.3 tok/s |
| KV compression | 4.9x | 3.8x | 4.9x | 1x |
| Max ctx (RTX 5090) | 300K | 236K | 300K | ~45K |
| PPL wikitext-2 (ctx=2048) | 7.01 | 7.02 | 7.10 | 6.64 |
| NIAH recall (4K) | 100% | 100% | 100% | 100% |
planar3wins on all dense models: same compression as iso3 (4.9×) but lower PPL.
.
├── docker-compose.yml # Compose profiles: qwen, reasoning, gemma
├── Makefile # build, run-qwen, run-reasoning, run-gemma, test, bench
├── docker/
│ ├── Dockerfile # Multi-stage CUDA 13.1 build from source
│ ├── entrypoint.sh # Model download + config + server launch
│ └── test.sh # Automated smoke test
├── scripts/
│ ├── battery_test.py # Performance + competence test suite
│ ├── benchmark_rotorquant.py # Full benchmark (throughput, PPL, NIAH)
│ └── bench_tps.sh # Quick tok/s benchmark via llama-bench
├── turboquant/ # Python TurboQuant library (research)
│ ├── core.py # TurboQuantMSE, TurboQuantProd
│ ├── kv_cache.py # HuggingFace DynamicCache subclass
│ └── ...
├── tests/ # Python unit tests (pytest)
└── docs/
├── BENCHMARK-REPORT.md # Full benchmark results
└── sprints/ # Sprint planning documents
RotorQuant replaces the dense d×d rotation matrix in TurboQuant (ICLR 2026) with Clifford algebra rotors operating on groups of 3-4 dimensions. This gives:
- 44x fewer parameters (372 vs 16,384 for d=128)
- 10-19x faster fused CUDA kernels (no cuBLAS GEMM, all in-register)
- Identical attention fidelity on real model data
The KV cache is compressed from 16-bit to 3-bit per element, reducing memory by ~5x. Decode speed is unaffected because attention computation dominates — the dequantization overhead is hidden.
MIT