A lightweight, cross-platform C inference engine for Qwen3-TTS text-to-speech models (0.6B and 1.7B). No Python, no PyTorch, no ONNX runtime — just C, a BLAS library, and raw model weights.
The engine runs the complete TTS pipeline: BPE tokenization, a 28-layer causal transformer (Talker), a multi-pass code predictor, and a convolutional speech decoder. Weights are memory-mapped directly from safetensors files in BF16, so loading is near-instant and memory usage stays low.
All samples generated with the 0.6B model (RTF ~1.3–1.7, Apple M1):
| Language | Speaker | Sample | Text |
|---|---|---|---|
| English | ryan | listen | Hello, this is a test of the text to speech system. |
| Italian | ryan | listen | Buongiorno a tutti, questa e una dimostrazione del sistema di sintesi vocale. |
| Italian | vivian | listen | Buongiorno a tutti, questa e una dimostrazione del sistema di sintesi vocale. |
| Spanish | ryan | listen | Hola, esta es una demostracion del sistema de sintesis de voz. |
| Portuguese | ryan | listen | Ola, esta e uma demonstracao do sistema de sintese de voz. |
| French | ryan | listen | Bonjour a tous, ceci est une demonstration du systeme de synthese vocale. |
| German | ryan | listen | Guten Tag, dies ist eine Demonstration des Sprachsynthesesystems. |
| Japanese | Ono_Anna | listen | こんにちは、私の名前はアンナです。今日はとても良い天気ですね。東京の桜がとても綺麗です。 |
| Japanese | Ono_Anna | listen | 頑張れ、アンドレア!あなたならできるよ。毎日少しずつ前に進もう。夢を諦めないで。応援してるよ! |
Clone and play locally:
afplay samples/english_ryan.wav(macOS) oraplay samples/english_ryan.wav(Linux)
# Clone and build
git clone https://github.com/gabriele-mastrapasqua/qwen3-tts.git
cd qwen3-tts
make blas
# Download a model (interactive: small, large, voice-design, base-small, base-large)
./download_model.sh
# Synthesize speech
./qwen_tts -d qwen3-tts-0.6b --text "Hello, how are you today?" -o hello.wavDependencies: Only a C compiler and BLAS (Accelerate on macOS, OpenBLAS on Linux). See docs/building.md for Linux, Windows/WSL2, and other build targets.
- Pure C, minimal dependencies — Only requires a C compiler and BLAS. No Python runtime needed.
- Cross-platform — macOS (ARM/x86) and Linux (ARM/x86). NEON and AVX SIMD paths. Windows/WSL2 beta.
- Both model sizes — Automatically detects 0.6B or 1.7B from weight files.
- 9 preset voices —
ryan,vivian,serena,aiden,eric,dylan,uncle_fu,ono_anna,sohee. - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.
- Memory-mapped weights — BF16 safetensors mmap'd directly. 0.6B ~3 GB, 1.7B ~8 GB.
- Voice cloning — Clone any voice from a short WAV clip. Save as reusable
.qvoiceprofile. - Custom voices with Delta
.qvoice— Bit-identical cloned voices on CustomVoice model. Create once, use forever — with style control, streaming, server. - Voice management — List, inspect, delete
.qvoiceprofiles (--list-voices,--delete-voice). No model required. - Style control —
--instructfor emotion/style on 1.7B: angry, whisper, cheerful, and more. - VoiceDesign — Create new voices from text descriptions.
- HTTP server —
/v1/tts,/v1/tts/stream, OpenAI-compatible/v1/audio/speech. - Streaming — Real-time audio via
--stream(WAV) or--stdout(raw PCM). - INT8/INT4 quantization — 15% speedup on 1.7B with
--int8. - Configurable sampling — Temperature, top-k, top-p, and repetition penalty.
- 24 kHz WAV output — 16-bit PCM, mono.
./qwen_tts [options]
Required:
-d, --model-dir <path> Model directory
--text <string> Text to synthesize
Optional:
-o, --output <path> Output WAV file (default: output.wav)
-s, --speaker <name> Speaker voice (default: ryan)
-l, --language <lang> Target language (default: English)
-I, --instruct <text> Style/emotion instruction (1.7B model only)
--temperature <f> Sampling temperature (default: 0.5)
--top-k <n> Top-k sampling (default: 50)
--top-p <f> Top-p nucleus sampling (default: 1.0)
--rep-penalty <f> Repetition penalty (default: 1.05)
--max-tokens <n> Max audio tokens (default: 8192)
--max-duration <secs> Max audio duration in seconds
--seed <n> Random seed for reproducible output
--ref-audio <path> Reference audio for voice cloning (Base model)
--save-voice <path> Save voice profile (.qvoice)
--load-voice <path> Load voice profile (.qvoice)
--target-cv <dir> CV model dir for delta encoding (bit-identical cross-model)
--list-voices <dir> List .qvoice files in directory (no model needed)
--delete-voice <path> Delete a .qvoice file
--voice-name <name> Name for the voice (stored in .qvoice metadata)
--voice-design VoiceDesign mode (create voice from --instruct)
--stream Stream audio (decode chunks during generation)
--stdout Output raw s16le PCM to stdout (implies --stream)
--int8 INT8 quantized (1.7B recommended)
--int4 Q4_0 quantized (1.7B only, experimental)
-j, --threads <n> Worker threads (default: 4)
--silent Suppress status output
--debug Verbose diagnostics
--serve <port> Start HTTP server
# Basic English
./qwen_tts -d qwen3-tts-0.6b --text "The quick brown fox jumps over the lazy dog." -o fox.wav
# Italian with a specific voice
./qwen_tts -d qwen3-tts-0.6b -s ryan -l Italian \
--text "Ciao, questa e una prova del sistema di sintesi vocale." -o test_it.wav
# Style/emotion control (1.7B only)
./qwen_tts -d qwen3-tts-1.7b -s ryan -l English \
--text "I cannot believe you did that to me." \
--instruct "Speak in a very angry and aggressive tone" -o angry.wav
# Reproducible output with seed
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --seed 42 -o hello.wavClone any voice from a reference audio clip. Requires a Base model.
# Clone a voice
./qwen_tts -d qwen3-tts-0.6b-base --ref-audio reference.wav \
--text "Hello, this is my cloned voice." -o cloned.wavFull guide: reference audio tips, model comparison, samples → docs/voice-cloning.md
The killer feature: clone a voice once, save it as a .qvoice with --target-cv,
and use it forever on the CustomVoice model — bit-identical to the original clone.
Works with --instruct, streaming, and the HTTP server.
# Create (one-time: needs both Base + CV models)
./qwen_tts -d qwen3-tts-0.6b-base --ref-audio mario.wav -l Italian \
--voice-name "Mario" --target-cv qwen3-tts-0.6b \
--save-voice mario.qvoice
# Use forever (only CV model + .qvoice needed)
./qwen_tts -d qwen3-tts-0.6b --load-voice mario.qvoice \
--text "Ciao, come stai?" -o output.wav
# On the server
./qwen_tts -d qwen3-tts-0.6b --load-voice mario.qvoice --serve 8080
# Manage voice profiles (no model needed)
./qwen_tts --list-voices ./my_voices/
./qwen_tts --delete-voice ./my_voices/old.qvoiceVoice clone samples — all generated via .qvoice delta on 0.6B CustomVoice:
| Language | Voice | Source | Output | Text |
|---|---|---|---|---|
| Italian | Pirandello Reader | LibriVox Public Domain | input → clone | Buongiorno a tutti, questa e una dimostrazione della clonazione vocale. |
| English | Sarac (F) | LibriTTS-R CC-BY | listen | Good morning everyone, this is a demonstration of voice cloning using a custom voice profile. |
| English | Peter (M) | LibriTTS-R CC-BY | listen | I love reading books aloud, there is something magical about bringing stories to life with your voice. |
| French | Baudelaire Reader | LibriVox Public Domain | listen | Bonjour a tous, ceci est une demonstration du clonage vocal avec un profil de voix personnalise. |
| Spanish | Lu | LibriVox Public Domain | listen | Buenos dias a todos, esta es una demostracion de la clonacion de voz con un perfil de voz personalizado. |
Full guide: delta vs standard, format internals, troubleshooting → docs/custom-voices.md
# Start server
./qwen_tts -d qwen3-tts-0.6b --serve 8080
# Generate speech
curl -s http://localhost:8080/v1/tts \
-d '{"text":"Hello, how are you?"}' -o output.wav
# Stream with real-time playback
curl -sN http://localhost:8080/v1/tts/stream \
-d '{"text":"Hello, how are you?"}' | \
play -t raw -r 24000 -e signed -b 16 -c 1 -
# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/audio/speech \
-d '{"input":"Hello world","voice":"ryan"}' -o output.wavFull guide: all endpoints, request body, performance → docs/server.md
# Stream to WAV file
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stream -o hello.wav
# Pipe raw PCM to audio player
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stdout | \
play -t raw -r 24000 -e signed -b 16 -c 1 -Text --> BPE Tokenizer --> Talker (LLM) --> Code Predictor --> Speech Decoder --> 24 kHz WAV
| Component | What it does |
|---|---|
| Talker | 28-layer Qwen3 transformer with GQA, RoPE, SwiGLU. Generates one audio frame token per step. |
| Code Predictor | 5-layer transformer running 15 sequential passes per frame. Predicts the remaining 15 codebook entries. |
| Speech Decoder | Causal ConvNet with 16-codebook RVQ dequantization and 480x upsampling. Converts codes to waveform. |
| 0.6B | 1.7B | |
|---|---|---|
| Hidden dim | 1024 | 2048 |
| Heads (Q/KV) | 16/8 | 16/8 |
| Layers | 28 | 28 |
| Code Predictor | 1024 hidden, 5 layers | 1024 hidden, 5 layers (+2048→1024 projection) |
| Memory | ~3 GB | ~8 GB |
Benchmarked on Apple M1 8-core, 16 GB RAM, 4 threads (make bench-full):
| Config | 0.6B RTF | 1.7B BF16 RTF | 1.7B INT8 RTF |
|---|---|---|---|
| CLI short | 1.37–1.69 | 4.10–4.40 | 3.69 |
| CLI long | 1.29–1.32 | 1.97–2.11 | 2.15 |
| CLI stream short | 1.31 | 2.59–4.01 | — |
| CLI stream long | 1.30–1.33 | 2.06–2.43 | — |
| Server (cold) | 1.34 | — | — |
| Server (warm) | 1.33 | — | — |
RTF = processing_time / audio_duration. Lower is better; <1.0 = faster than real-time.
Longer audio improves RTF (fixed costs amortize): 0.6B long text reaches RTF 1.29.
Streaming has identical performance to normal mode. --int8 gives ~11% speedup on 1.7B
(details).
Run benchmarks on your machine:
make bench # Quick: short+long, normal+stream (both models)
make bench-full # Full: + server, instruct, INT8, .qvoice (if available)vs other implementations:
| Hardware | 0.6B RTF | Notes |
|---|---|---|
| This project (C, Apple M1 CPU) | 1.26–1.39 | Pure C, no GPU |
| Python + PyTorch (Ryzen 9 7950X CPU) | 4.5–5.8 | Official Python, CPU-only |
| NVIDIA RTX 3090 | 0.52–0.68 | Python + PyTorch + FlashAttention 2 |
3–4x faster than Python on CPU. Within 2x of an RTX 3090 — on a 2020 laptop with no GPU.
Per-component breakdown, full GPU table, optimization history → docs/performance.md
| Guide | Contents |
|---|---|
| Voice Cloning | Reference audio tips, ECAPA-TDNN internals, model comparison, samples |
| Custom Voices | .qvoice format, delta vs standard, managing profiles, troubleshooting |
| HTTP Server | All endpoints, request body, streaming, server performance |
| VoiceDesign | Creating voices from text descriptions |
| Quantization | INT8/INT4, comparison table, recommendations |
| Performance | RTF benchmarks, component breakdown, CPU vs GPU, optimization history |
| Building | All platforms, build targets, testing |
| Post | Topic |
|---|---|
| Voice Cloning Internals | ECAPA-TDNN architecture deep-dive |
| Cross-Model Voice Analysis | Why delta format works (weight analysis) |
| Optimization Notes | RTF 3.5 → 1.3: the full optimization story |
- Salvatore Sanfilippo (antirez) — This project wouldn't exist without his qwen-asr, a pure C Qwen2-Audio ASR engine that proved you can do real neural inference in plain C with mmap'd safetensors, BF16 NEON kernels, and zero dependencies. The entire architecture of this TTS engine — the approach, the style, the philosophy of minimal C inference — is directly inspired by his work. If you like this project, go star qwen-asr first.
- Michael Abrash — His Graphics Programming Black Book (1997) shaped how we think about performance. The chapters on data alignment, struct layout, and cache-friendly access patterns for the 386/486 are still relevant today — we got a 24% speedup from cache-line alignment (
posix_memalign(64)), applying the same principles Abrash taught 30 years ago to modern SIMD and BLAS. - John Carmack — His
.planfiles and QuakeCon talks on micro-optimization and cache friendliness were a constant reference. Where Abrash gave you the systematic rules and benchmarks, Carmack showed you the mindset: always think about how data flows through the CPU. - Qwen3-TTS by the Qwen team at Alibaba — the model architecture, weights, and research. Models on Hugging Face. Paper.
- Qwen2.5 by the Qwen team — the base LLM architecture (GQA, RoPE, SwiGLU) used in the Talker and Code Predictor.
MIT