RotorQuant LLM Server

NOTE: Gemma Models have not been tested for ppl and qualtity. YMMV

Serve Qwen3.6-35B-A3B, Qwen3.6-27B, Qwen3.5-27B, Gemma 4 26B, and other large models with large context on a single consumer GPU using RotorQuant KV cache compression (4.9x at 3-bit, 97% decode speed of fp16).

KV defaults: iso3 for MoE models (Qwen3.6-35B-A3B), planar3 for dense models (Qwen3.6-27B, Qwen3.5-27B). Both are 3.125 bpe (4.9x compression). Benchmarks show rotation type matters: iso beats planar on MoE; planar beats iso on dense.

Inference engineering reference

Before doing perf work or planning a new substrate, read:

docs/INFERENCE_LESSONS.md — verdicts on optimization techniques (what works, doesn't, when), profile-first decision tree, and roadmap recommendations carried over from the vortex Sprint 024 investigation.
docs/THROUGHPUT_CONFIGURATION_MODEL.md — bandwidth-vs-compute regime model for picking resident-vs-streaming × N-parallel × ctx configurations.

Constants in those docs were measured on RTX 4090 / Qwen 27B; recalibrate on RTX 5090 / Qwen 3.6 before citing.

Quick Start

# 1. Build
make build

# 2. Run (first run downloads the model ~22 GB)
make run-qwen

# 3. Query
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen","messages":[{"role":"user","content":"Hello!"}],"temperature":1.0,"top_p":0.95,"max_tokens":500}'

Available Models

Command	Model	Weights	Active Params	KV default
`make run-qwen`	Qwen3.6-35B-A3B (default)	UD-Q4_K_XL (20.8 GB)	3B active (MoE)	`iso3`
`docker compose --profile qwen36-q3 up`	Qwen3.6-35B-A3B	UD-Q3_K_XL (16.8 GB)	3B active (MoE)	`planar4`
`docker compose --profile qwen36-iq3 up`	Qwen3.6-35B-A3B	UD-IQ3_XXS (13.2 GB)	3B active (MoE)	`planar4`
`make run-qwen36-27b`	Qwen3.6-27B (dense)	UD-Q4_K_XL (16.4 GB)	27B dense	`planar3`
`docker compose --profile qwen36-27b-q3 up`	Qwen3.6-27B	UD-Q3_K_XL (~12 GB)	27B dense	`planar3`
`docker compose --profile qwen36-27b-iq3 up`	Qwen3.6-27B	UD-IQ3_XXS (~9 GB)	27B dense	`planar3`
`make run-reasoning`	Qwen3.5-27B Claude Opus Distilled	i1-Q4_K_M (16.6 GB)	27B dense	`planar3`
`make run-gemma`	Gemma 4 26B MoE	UD-Q4_K_XL (17.1 GB)	3.8B active	`planar4`

Or use Docker Compose directly:

docker compose --profile qwen up           # Qwen3.6-35B-A3B Q4 (32 GB+), iso3
docker compose --profile qwen36-q3 up     # Qwen3.6-35B-A3B Q3 (24 GB)
docker compose --profile qwen36-iq3 up    # Qwen3.6-35B-A3B IQ3 (16 GB)
docker compose --profile qwen36-27b up    # Qwen3.6-27B dense (24 GB+), planar3
docker compose --profile qwen36-27b-q3 up # Qwen3.6-27B Q3 (16 GB)
docker compose --profile reasoning up     # Qwen3.5-27B reasoning-tuned
docker compose --profile gemma up         # Gemma 4 MoE

Recommended Sampling (Qwen3.6)

Use Case	Temperature	Top-P	Top-K	Presence Penalty
General	1.0	0.95	20	1.5
Coding / precise	0.6	0.95	20	0.0

Configuration

Override defaults via environment variables:

# Reduce context for lower VRAM GPUs
CTX_SIZE=32768 make run-qwen

# Disable RotorQuant (fp16 KV cache)
KV_CACHE_TYPE=f16 make run-qwen

# Custom port
PORT=9090 make run-qwen

# Gated model access
HF_TOKEN=hf_xxx make run-reasoning

Variable	Default	Description
`KV_CACHE_TYPE`	per-profile	KV cache type: `iso3` (MoE default), `planar3` (dense default), `planar4`, `iso4`, `f16`
`CTX_SIZE`	per-model	Context window (e.g. 114688 for Q4_K_M on 24 GB)
`PORT`	`8080`	API port
`GPU_LAYERS`	`99`	Layers on GPU (99 = all)
`N_PARALLEL`	`2`	Concurrent request slots (set higher for throughput mode)
`CACHE_RAM`	`8192`	Prompt cache size in MiB (system RAM, not VRAM)
`HF_TOKEN`	—	HuggingFace token for gated models

Best Config by GPU

Recommended configurations per VRAM tier, based on measured perplexity (see docs/BENCHMARK-REPORT.md and docs/QUANTIZATION-GUIDE.md for full data).

32 GB+ (RTX 5090) — Recommended Default

Use Qwen3.6-35B-A3B (make run-qwen). 196 tok/s decode, 262K context, iso3 KV (4.9x compression).

Use Case	Quant	Size	KV	Context	Command
Default	UD-Q4_K_XL	20.8 GB	iso3	262K	`make run-qwen`
Max context	UD-Q4_K_XL	20.8 GB	iso3	262K × 2 slots	default
Max throughput	UD-Q4_K_XL	20.8 GB	iso3	65K × 8 slots	`--profile qwen36-throughput`

24 GB (RTX 4090, RTX 3090) — Qwen3.6-27B Dense

Use Qwen3.6-27B (--profile qwen36-27b). 131K context, planar3 KV (4.9x compression).

Use Case	Quant	Size	KV	Context
Recommended	UD-Q4_K_XL	16.4 GB	planar3	131K
More context	UD-Q3_K_XL	~12 GB	planar3	~180K
Max context	UD-IQ3_XXS	~9 GB	planar3	~240K

16 GB (RTX 4060 Ti, RTX 5060, RTX 4080) — Qwen3.5-27B Quants

Qwen3.5-27B Q4_K_M doesn't fit. Use Unsloth imatrix quants (Qwen3.5-27B):

Use Case	Quant	Size	PPL	Context (planar3)
Best quality	IQ4_XS	13.9 GB	6.29	~18K
Recommended	UD-Q3_K_XL	13.4 GB	6.38	~36K
Max context	UD-IQ3_XXS	10.7 GB	6.62	~74K

UD-Q3_K_XL is the sweet spot — only +0.08 PPL over 4-bit, 2× more context.

Quick Reference

GPU	Model	Quant	PPL	KV	Context	Command
32 GB	Qwen3.6-35B-A3B	UD-Q4_K_XL	6.13	iso3	262K	`make run-qwen`
24 GB	Qwen3.6-27B	UD-Q4_K_XL	7.09	planar3	131K	`--profile qwen36-27b`
24 GB	Qwen3.6-27B	UD-Q3_K_XL	—	planar3	~180K	`--profile qwen36-27b-q3`
16 GB	Qwen3.6-35B-A3B	UD-IQ3_XXS	—	planar4	~32K	`--profile qwen36-iq3`
16 GB	Qwen3.6-27B	UD-IQ3_XXS	—	planar3	~90K	`--profile qwen36-27b-iq3`
16 GB	Qwen3.5-27B	UD-Q3_K_XL	6.38	planar3	~36K	`--profile qwen-q3`
16 GB	Qwen3.5-27B	UD-IQ3_XXS	6.62	planar3	~74K	`--profile qwen-q3-xxs`
16 GB	Gemma4 26B	UD-Q3_K_M	—	planar4	~40K	`--profile gemma-q3`

PPL values are f16 baseline (wikitext-2). All profiles auto-download the correct model on first run.

Throughput Mode (Parallel Slots)

For serving multiple concurrent users, trade context length for parallel slots. During decode, single-user inference is memory-bandwidth bound (mat-vec). With N parallel slots, the weight multiplications become mat-mat and tensor cores engage — aggregate throughput scales near-linearly.

Qwen3.6-35B-A3B MoE — measured on RTX 5090 (iso3, 65K ctx/slot)

Slots	Per-slot tok/s	Aggregate tok/s	Profile
1	193	193	`--profile qwen`
2	150	299	`--profile qwen` (default)
4	99	397	`--profile qwen36-throughput`
8	54	431	`--profile qwen36-throughput`

P=4 is the practical optimum: 397 tok/s aggregate at 99 tok/s per slot.

docker compose --profile qwen36-throughput up   # 8 slots × 65K ctx

Qwen3.5-27B Dense — estimated on RTX 5090 (planar4, 16K ctx/slot)

RotorQuant is the multiplier: planar4 KV at 16K = 1.06 GB/slot vs f16 = 4.0 GB/slot — 3.8× more concurrent users at the same VRAM.

GPU	Slots (planar4)	Slots (f16)	Est. Aggregate tok/s	Command
24 GB	6	1	~320	`N_PARALLEL=6 docker compose --profile qwen-throughput up`
32 GB	14	3	~660	`docker compose --profile qwen-throughput up`
40 GB	22	5	~880	`N_PARALLEL=22 docker compose --profile qwen-throughput up`
80 GB	59	15	~2,400	`N_PARALLEL=59 docker compose --profile qwen-throughput up`

make run-throughput   # Qwen3.5-27B, 14 slots × 16K, RTX 5090

Requirements

Docker 24+
Docker Compose v2.20+
NVIDIA Container Toolkit (nvidia-ctk)
NVIDIA driver 570+ (CUDA 13.1 support)
GPU with 16+ GB VRAM (RTX 5060/4060 Ti with Q3 quants, RTX 4090+, A100, H100)

Performance

Benchmarked on RTX 5090 (32 GB). Full results in docs/BENCHMARK-REPORT.md.

Qwen3.6-35B-A3B MoE (default, `--profile qwen`)

Metric	iso3 (default)	iso4	planar3	f16
Decode speed (1 slot)	196 tok/s	~196 tok/s	~196 tok/s	~196 tok/s
KV compression	4.9x	3.8x	4.9x	1x
Max ctx (RTX 5090, 2 slots)	262K	~190K	262K	~30K
PPL wikitext-2 (ctx=2048)	6.25	6.23	6.29	6.13
NIAH recall	—	—	—	—

iso3 and iso4 both beat planar on this Gated Delta Net MoE architecture. iso3 is the default: same context as iso4 with 4.9× vs 3.8× compression.

Qwen3.5-27B Q4_K_M (legacy dense, `--profile reasoning`)

Metric	planar3 (default)	planar4	iso3	f16
Decode speed	~67 tok/s	~67 tok/s	67.5 tok/s	69.3 tok/s
KV compression	4.9x	3.8x	4.9x	1x
Max ctx (RTX 5090)	300K	236K	300K	~45K
PPL wikitext-2 (ctx=2048)	7.01	7.02	7.10	6.64
NIAH recall (4K)	100%	100%	100%	100%

planar3 wins on all dense models: same compression as iso3 (4.9×) but lower PPL.

Project Structure

.
├── docker-compose.yml        # Compose profiles: qwen, reasoning, gemma
├── Makefile                  # build, run-qwen, run-reasoning, run-gemma, test, bench
├── docker/
│   ├── Dockerfile            # Multi-stage CUDA 13.1 build from source
│   ├── entrypoint.sh         # Model download + config + server launch
│   └── test.sh               # Automated smoke test
├── scripts/
│   ├── battery_test.py       # Performance + competence test suite
│   ├── benchmark_rotorquant.py  # Full benchmark (throughput, PPL, NIAH)
│   └── bench_tps.sh          # Quick tok/s benchmark via llama-bench
├── turboquant/               # Python TurboQuant library (research)
│   ├── core.py               # TurboQuantMSE, TurboQuantProd
│   ├── kv_cache.py           # HuggingFace DynamicCache subclass
│   └── ...
├── tests/                    # Python unit tests (pytest)
└── docs/
    ├── BENCHMARK-REPORT.md   # Full benchmark results
    └── sprints/              # Sprint planning documents

How It Works

RotorQuant replaces the dense d×d rotation matrix in TurboQuant (ICLR 2026) with Clifford algebra rotors operating on groups of 3-4 dimensions. This gives:

44x fewer parameters (372 vs 16,384 for d=128)
10-19x faster fused CUDA kernels (no cuBLAS GEMM, all in-register)
Identical attention fidelity on real model data

The KV cache is compressed from 16-bit to 3-bit per element, reducing memory by ~5x. Decode speed is unaffected because attention computation dominates — the dequantization overhead is hidden.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
calibration		calibration
docker		docker
docs		docs
k8s		k8s
scripts		scripts
tests		tests
turboquant		turboquant
.dockerignore		.dockerignore
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RotorQuant LLM Server

Inference engineering reference

Quick Start

Available Models

Recommended Sampling (Qwen3.6)

Configuration

Best Config by GPU

32 GB+ (RTX 5090) — Recommended Default

24 GB (RTX 4090, RTX 3090) — Qwen3.6-27B Dense

16 GB (RTX 4060 Ti, RTX 5060, RTX 4080) — Qwen3.5-27B Quants

Quick Reference

Throughput Mode (Parallel Slots)

Qwen3.6-35B-A3B MoE — measured on RTX 5090 (iso3, 65K ctx/slot)

Qwen3.5-27B Dense — estimated on RTX 5090 (planar4, 16K ctx/slot)

Requirements

Performance

Qwen3.6-35B-A3B MoE (default, `--profile qwen`)

Qwen3.5-27B Q4_K_M (legacy dense, `--profile reasoning`)

Project Structure

How It Works

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RotorQuant LLM Server

Inference engineering reference

Quick Start

Available Models

Recommended Sampling (Qwen3.6)

Configuration

Best Config by GPU

32 GB+ (RTX 5090) — Recommended Default

24 GB (RTX 4090, RTX 3090) — Qwen3.6-27B Dense

16 GB (RTX 4060 Ti, RTX 5060, RTX 4080) — Qwen3.5-27B Quants

Quick Reference

Throughput Mode (Parallel Slots)

Qwen3.6-35B-A3B MoE — measured on RTX 5090 (iso3, 65K ctx/slot)

Qwen3.5-27B Dense — estimated on RTX 5090 (planar4, 16K ctx/slot)

Requirements

Performance

Qwen3.6-35B-A3B MoE (default, --profile qwen)

Qwen3.5-27B Q4_K_M (legacy dense, --profile reasoning)

Project Structure

How It Works

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Qwen3.6-35B-A3B MoE (default, `--profile qwen`)

Qwen3.5-27B Q4_K_M (legacy dense, `--profile reasoning`)

Packages