Qwen3-TTS Pure C Implementation

A lightweight, cross-platform C inference engine for Qwen3-TTS text-to-speech models (0.6B and 1.7B). No Python, no PyTorch, no ONNX runtime — just C, a BLAS library, and raw model weights.

The engine runs the complete TTS pipeline: BPE tokenization, a 28-layer causal transformer (Talker), a multi-pass code predictor, and a convolutional speech decoder. Weights are memory-mapped directly from safetensors files in BF16, so loading is near-instant and memory usage stays low.

Audio Samples

All samples generated with the 0.6B model (RTF ~1.3–1.7, Apple M1):

Language	Speaker	Sample	Text
English	ryan	listen	Hello, this is a test of the text to speech system.
Italian	ryan	listen	Buongiorno a tutti, questa e una dimostrazione del sistema di sintesi vocale.
Italian	vivian	listen	Buongiorno a tutti, questa e una dimostrazione del sistema di sintesi vocale.
Spanish	ryan	listen	Hola, esta es una demostracion del sistema de sintesis de voz.
Portuguese	ryan	listen	Ola, esta e uma demonstracao do sistema de sintese de voz.
French	ryan	listen	Bonjour a tous, ceci est une demonstration du systeme de synthese vocale.
German	ryan	listen	Guten Tag, dies ist eine Demonstration des Sprachsynthesesystems.
Japanese	Ono_Anna	listen	こんにちは、私の名前はアンナです。今日はとても良い天気ですね。東京の桜がとても綺麗です。
Japanese	Ono_Anna	listen	頑張れ、アンドレア！あなたならできるよ。毎日少しずつ前に進もう。夢を諦めないで。応援してるよ！

Clone and play locally: afplay samples/english_ryan.wav (macOS) or aplay samples/english_ryan.wav (Linux)

Quick Start

# Clone and build
git clone https://github.com/gabriele-mastrapasqua/qwen3-tts.git
cd qwen3-tts
make blas

# Download a model (interactive: small, large, voice-design, base-small, base-large)
./download_model.sh

# Synthesize speech
./qwen_tts -d qwen3-tts-0.6b --text "Hello, how are you today?" -o hello.wav

Dependencies: Only a C compiler and BLAS (Accelerate on macOS, OpenBLAS on Linux). See docs/building.md for Linux, Windows/WSL2, and other build targets.

Features

Pure C, minimal dependencies — Only requires a C compiler and BLAS. No Python runtime needed.
Cross-platform — macOS (ARM/x86) and Linux (ARM/x86). NEON and AVX SIMD paths. Windows/WSL2 beta.
Both model sizes — Automatically detects 0.6B or 1.7B from weight files.
9 preset voices — ryan, vivian, serena, aiden, eric, dylan, uncle_fu, ono_anna, sohee.
10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.
Memory-mapped weights — BF16 safetensors mmap'd directly. 0.6B ~3 GB, 1.7B ~8 GB.
Voice cloning — Clone any voice from a short WAV clip. Save as reusable .qvoice profile.
Custom voices with Delta .qvoice — Bit-identical cloned voices on CustomVoice model. Create once, use forever — with style control, streaming, server.
Voice management — List, inspect, delete .qvoice profiles (--list-voices, --delete-voice). No model required.
Style control — --instruct for emotion/style on 1.7B: angry, whisper, cheerful, and more.
VoiceDesign — Create new voices from text descriptions.
HTTP server — /v1/tts, /v1/tts/stream, OpenAI-compatible /v1/audio/speech.
Streaming — Real-time audio via --stream (WAV) or --stdout (raw PCM).
INT8/INT4 quantization — 15% speedup on 1.7B with --int8.
Configurable sampling — Temperature, top-k, top-p, and repetition penalty.
24 kHz WAV output — 16-bit PCM, mono.

Usage

./qwen_tts [options]

Required:
  -d, --model-dir <path>     Model directory
  --text <string>            Text to synthesize

Optional:
  -o, --output <path>        Output WAV file (default: output.wav)
  -s, --speaker <name>       Speaker voice (default: ryan)
  -l, --language <lang>      Target language (default: English)
  -I, --instruct <text>      Style/emotion instruction (1.7B model only)
  --temperature <f>          Sampling temperature (default: 0.5)
  --top-k <n>                Top-k sampling (default: 50)
  --top-p <f>                Top-p nucleus sampling (default: 1.0)
  --rep-penalty <f>          Repetition penalty (default: 1.05)
  --max-tokens <n>           Max audio tokens (default: 8192)
  --max-duration <secs>      Max audio duration in seconds
  --seed <n>                 Random seed for reproducible output
  --ref-audio <path>         Reference audio for voice cloning (Base model)
  --save-voice <path>        Save voice profile (.qvoice)
  --load-voice <path>        Load voice profile (.qvoice)
  --target-cv <dir>          CV model dir for delta encoding (bit-identical cross-model)
  --list-voices <dir>        List .qvoice files in directory (no model needed)
  --delete-voice <path>      Delete a .qvoice file
  --voice-name <name>        Name for the voice (stored in .qvoice metadata)
  --voice-design             VoiceDesign mode (create voice from --instruct)
  --stream                   Stream audio (decode chunks during generation)
  --stdout                   Output raw s16le PCM to stdout (implies --stream)
  --int8                     INT8 quantized (1.7B recommended)
  --int4                     Q4_0 quantized (1.7B only, experimental)
  -j, --threads <n>          Worker threads (default: 4)
  --silent                   Suppress status output
  --debug                    Verbose diagnostics
  --serve <port>             Start HTTP server

Examples

# Basic English
./qwen_tts -d qwen3-tts-0.6b --text "The quick brown fox jumps over the lazy dog." -o fox.wav

# Italian with a specific voice
./qwen_tts -d qwen3-tts-0.6b -s ryan -l Italian \
    --text "Ciao, questa e una prova del sistema di sintesi vocale." -o test_it.wav

# Style/emotion control (1.7B only)
./qwen_tts -d qwen3-tts-1.7b -s ryan -l English \
    --text "I cannot believe you did that to me." \
    --instruct "Speak in a very angry and aggressive tone" -o angry.wav

# Reproducible output with seed
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --seed 42 -o hello.wav

Voice Cloning

Clone any voice from a reference audio clip. Requires a Base model.

# Clone a voice
./qwen_tts -d qwen3-tts-0.6b-base --ref-audio reference.wav \
    --text "Hello, this is my cloned voice." -o cloned.wav

Full guide: reference audio tips, model comparison, samples → docs/voice-cloning.md

Custom Voices with Delta `.qvoice`

The killer feature: clone a voice once, save it as a .qvoice with --target-cv, and use it forever on the CustomVoice model — bit-identical to the original clone. Works with --instruct, streaming, and the HTTP server.

# Create (one-time: needs both Base + CV models)
./qwen_tts -d qwen3-tts-0.6b-base --ref-audio mario.wav -l Italian \
    --voice-name "Mario" --target-cv qwen3-tts-0.6b \
    --save-voice mario.qvoice

# Use forever (only CV model + .qvoice needed)
./qwen_tts -d qwen3-tts-0.6b --load-voice mario.qvoice \
    --text "Ciao, come stai?" -o output.wav

# On the server
./qwen_tts -d qwen3-tts-0.6b --load-voice mario.qvoice --serve 8080

# Manage voice profiles (no model needed)
./qwen_tts --list-voices ./my_voices/
./qwen_tts --delete-voice ./my_voices/old.qvoice

Voice clone samples — all generated via .qvoice delta on 0.6B CustomVoice:

Language	Voice	Source	Output	Text
Italian	Pirandello Reader	LibriVox Public Domain	input → clone	Buongiorno a tutti, questa e una dimostrazione della clonazione vocale.
English	Sarac (F)	LibriTTS-R CC-BY	listen	Good morning everyone, this is a demonstration of voice cloning using a custom voice profile.
English	Peter (M)	LibriTTS-R CC-BY	listen	I love reading books aloud, there is something magical about bringing stories to life with your voice.
French	Baudelaire Reader	LibriVox Public Domain	listen	Bonjour a tous, ceci est une demonstration du clonage vocal avec un profil de voix personnalise.
Spanish	Lu	LibriVox Public Domain	listen	Buenos dias a todos, esta es una demostracion de la clonacion de voz con un perfil de voz personalizado.

Full guide: delta vs standard, format internals, troubleshooting → docs/custom-voices.md

HTTP Server

# Start server
./qwen_tts -d qwen3-tts-0.6b --serve 8080

# Generate speech
curl -s http://localhost:8080/v1/tts \
  -d '{"text":"Hello, how are you?"}' -o output.wav

# Stream with real-time playback
curl -sN http://localhost:8080/v1/tts/stream \
  -d '{"text":"Hello, how are you?"}' | \
  play -t raw -r 24000 -e signed -b 16 -c 1 -

# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/audio/speech \
  -d '{"input":"Hello world","voice":"ryan"}' -o output.wav

Full guide: all endpoints, request body, performance → docs/server.md

Streaming

# Stream to WAV file
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stream -o hello.wav

# Pipe raw PCM to audio player
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stdout | \
    play -t raw -r 24000 -e signed -b 16 -c 1 -

How It Works

Text --> BPE Tokenizer --> Talker (LLM) --> Code Predictor --> Speech Decoder --> 24 kHz WAV

Component	What it does
Talker	28-layer Qwen3 transformer with GQA, RoPE, SwiGLU. Generates one audio frame token per step.
Code Predictor	5-layer transformer running 15 sequential passes per frame. Predicts the remaining 15 codebook entries.
Speech Decoder	Causal ConvNet with 16-codebook RVQ dequantization and 480x upsampling. Converts codes to waveform.

	0.6B	1.7B
Hidden dim	1024	2048
Heads (Q/KV)	16/8	16/8
Layers	28	28
Code Predictor	1024 hidden, 5 layers	1024 hidden, 5 layers (+2048→1024 projection)
Memory	~3 GB	~8 GB

Performance

Benchmarked on Apple M1 8-core, 16 GB RAM, 4 threads (make bench-full):

Config	0.6B RTF	1.7B BF16 RTF	1.7B INT8 RTF
CLI short	1.37–1.69	4.10–4.40	3.69
CLI long	1.29–1.32	1.97–2.11	2.15
CLI stream short	1.31	2.59–4.01	—
CLI stream long	1.30–1.33	2.06–2.43	—
Server (cold)	1.34	—	—
Server (warm)	1.33	—	—

RTF = processing_time / audio_duration. Lower is better; <1.0 = faster than real-time.

Longer audio improves RTF (fixed costs amortize): 0.6B long text reaches RTF 1.29. Streaming has identical performance to normal mode. --int8 gives ~11% speedup on 1.7B (details).

Run benchmarks on your machine:

make bench         # Quick: short+long, normal+stream (both models)
make bench-full    # Full: + server, instruct, INT8, .qvoice (if available)

vs other implementations:

Hardware	0.6B RTF	Notes
This project (C, Apple M1 CPU)	1.26–1.39	Pure C, no GPU
Python + PyTorch (Ryzen 9 7950X CPU)	4.5–5.8	Official Python, CPU-only
NVIDIA RTX 3090	0.52–0.68	Python + PyTorch + FlashAttention 2

3–4x faster than Python on CPU. Within 2x of an RTX 3090 — on a 2020 laptop with no GPU.

Per-component breakdown, full GPU table, optimization history → docs/performance.md

Documentation

Guide	Contents
Voice Cloning	Reference audio tips, ECAPA-TDNN internals, model comparison, samples
Custom Voices	`.qvoice` format, delta vs standard, managing profiles, troubleshooting
HTTP Server	All endpoints, request body, streaming, server performance
VoiceDesign	Creating voices from text descriptions
Quantization	INT8/INT4, comparison table, recommendations
Performance	RTF benchmarks, component breakdown, CPU vs GPU, optimization history
Building	All platforms, build targets, testing

Blog Posts

Post	Topic
Voice Cloning Internals	ECAPA-TDNN architecture deep-dive
Cross-Model Voice Analysis	Why delta format works (weight analysis)
Optimization Notes	RTF 3.5 → 1.3: the full optimization story

Credits & Acknowledgments

Salvatore Sanfilippo (antirez) — This project wouldn't exist without his qwen-asr, a pure C Qwen2-Audio ASR engine that proved you can do real neural inference in plain C with mmap'd safetensors, BF16 NEON kernels, and zero dependencies. The entire architecture of this TTS engine — the approach, the style, the philosophy of minimal C inference — is directly inspired by his work. If you like this project, go star qwen-asr first.
Michael Abrash — His Graphics Programming Black Book (1997) shaped how we think about performance. The chapters on data alignment, struct layout, and cache-friendly access patterns for the 386/486 are still relevant today — we got a 24% speedup from cache-line alignment (posix_memalign(64)), applying the same principles Abrash taught 30 years ago to modern SIMD and BLAS.
John Carmack — His .plan files and QuakeCon talks on micro-optimization and cache friendliness were a constant reference. Where Abrash gave you the systematic rules and benchmarks, Carmack showed you the mindset: always think about how data flows through the CPU.
Qwen3-TTS by the Qwen team at Alibaba — the model architecture, weights, and research. Models on Hugging Face. Paper.
Qwen2.5 by the Qwen team — the base LLM architecture (GQA, RoPE, SwiGLU) used in the Talker and Code Predictor.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
.github/workflows		.github/workflows
blog		blog
docs		docs
samples		samples
vendor		vendor
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MODEL.md		MODEL.md
Makefile		Makefile
PLAN.md		PLAN.md
README.md		README.md
bench.sh		bench.sh
bench_scalar.c		bench_scalar.c
bench_simd.c		bench_simd.c
config.json		config.json
download_model.sh		download_model.sh
main.c		main.c
qwen_tts.c		qwen_tts.c
qwen_tts.h		qwen_tts.h
qwen_tts_audio.c		qwen_tts_audio.c
qwen_tts_audio.h		qwen_tts_audio.h
qwen_tts_code_predictor.c		qwen_tts_code_predictor.c
qwen_tts_kernels.c		qwen_tts_kernels.c
qwen_tts_kernels.h		qwen_tts_kernels.h
qwen_tts_kernels_avx.c		qwen_tts_kernels_avx.c
qwen_tts_kernels_generic.c		qwen_tts_kernels_generic.c
qwen_tts_kernels_impl.h		qwen_tts_kernels_impl.h
qwen_tts_kernels_neon.c		qwen_tts_kernels_neon.c
qwen_tts_safetensors.c		qwen_tts_safetensors.c
qwen_tts_safetensors.h		qwen_tts_safetensors.h
qwen_tts_sampling.c		qwen_tts_sampling.c
qwen_tts_server.c		qwen_tts_server.c
qwen_tts_server.h		qwen_tts_server.h
qwen_tts_speech_decoder.c		qwen_tts_speech_decoder.c
qwen_tts_speech_encoder.c		qwen_tts_speech_encoder.c
qwen_tts_talker.c		qwen_tts_talker.c
qwen_tts_tokenizer.c		qwen_tts_tokenizer.c
qwen_tts_tokenizer.h		qwen_tts_tokenizer.h
qwen_tts_voice_clone.c		qwen_tts_voice_clone.c
qwen_tts_voice_clone.h		qwen_tts_voice_clone.h
speech_tokenizer_config.json		speech_tokenizer_config.json
test_decoder_standalone.c		test_decoder_standalone.c
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen3-TTS Pure C Implementation

Audio Samples

Quick Start

Features

Usage

Examples

Voice Cloning

Custom Voices with Delta `.qvoice`

HTTP Server

Streaming

How It Works

Performance

Documentation

Blog Posts

Credits & Acknowledgments

License

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Qwen3-TTS Pure C Implementation

Audio Samples

Quick Start

Features

Usage

Examples

Voice Cloning

Custom Voices with Delta .qvoice

HTTP Server

Streaming

How It Works

Performance

Documentation

Blog Posts

Credits & Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Custom Voices with Delta `.qvoice`

Packages