Skip to content

gabriele-mastrapasqua/qwen3-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

245 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen3-TTS Pure C Implementation

Build CodeQL Memory Safety

A lightweight, cross-platform C inference engine for Qwen3-TTS text-to-speech models (0.6B and 1.7B). No Python, no PyTorch, no ONNX runtime — just C, a BLAS library, and raw model weights.

The engine runs the complete TTS pipeline: BPE tokenization, a 28-layer causal transformer (Talker), a multi-pass code predictor, and a convolutional speech decoder. Weights are memory-mapped directly from safetensors files in BF16, so loading is near-instant and memory usage stays low.

Audio Samples

All samples generated with the 0.6B model (RTF ~1.3–1.7, Apple M1):

Language Speaker Sample Text
English ryan listen Hello, this is a test of the text to speech system.
Italian ryan listen Buongiorno a tutti, questa e una dimostrazione del sistema di sintesi vocale.
Italian vivian listen Buongiorno a tutti, questa e una dimostrazione del sistema di sintesi vocale.
Spanish ryan listen Hola, esta es una demostracion del sistema de sintesis de voz.
Portuguese ryan listen Ola, esta e uma demonstracao do sistema de sintese de voz.
French ryan listen Bonjour a tous, ceci est une demonstration du systeme de synthese vocale.
German ryan listen Guten Tag, dies ist eine Demonstration des Sprachsynthesesystems.
Japanese Ono_Anna listen こんにちは、私の名前はアンナです。今日はとても良い天気ですね。東京の桜がとても綺麗です。
Japanese Ono_Anna listen 頑張れ、アンドレア!あなたならできるよ。毎日少しずつ前に進もう。夢を諦めないで。応援してるよ!

Clone and play locally: afplay samples/english_ryan.wav (macOS) or aplay samples/english_ryan.wav (Linux)

Quick Start

# Clone and build
git clone https://github.com/gabriele-mastrapasqua/qwen3-tts.git
cd qwen3-tts
make blas

# Download a model (interactive: small, large, voice-design, base-small, base-large)
./download_model.sh

# Synthesize speech
./qwen_tts -d qwen3-tts-0.6b --text "Hello, how are you today?" -o hello.wav

Dependencies: Only a C compiler and BLAS (Accelerate on macOS, OpenBLAS on Linux). See docs/building.md for Linux, Windows/WSL2, and other build targets.

Features

  • Pure C, minimal dependencies — Only requires a C compiler and BLAS. No Python runtime needed.
  • Cross-platform — macOS (ARM/x86) and Linux (ARM/x86). NEON and AVX SIMD paths. Windows/WSL2 beta.
  • Both model sizes — Automatically detects 0.6B or 1.7B from weight files.
  • 9 preset voicesryan, vivian, serena, aiden, eric, dylan, uncle_fu, ono_anna, sohee.
  • 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian.
  • Memory-mapped weights — BF16 safetensors mmap'd directly. 0.6B ~3 GB, 1.7B ~8 GB.
  • Voice cloning — Clone any voice from a short WAV clip. Save as reusable .qvoice profile.
  • Custom voices with Delta .qvoice — Bit-identical cloned voices on CustomVoice model. Create once, use forever — with style control, streaming, server.
  • Voice management — List, inspect, delete .qvoice profiles (--list-voices, --delete-voice). No model required.
  • Style control--instruct for emotion/style on 1.7B: angry, whisper, cheerful, and more.
  • VoiceDesign — Create new voices from text descriptions.
  • HTTP server/v1/tts, /v1/tts/stream, OpenAI-compatible /v1/audio/speech.
  • Streaming — Real-time audio via --stream (WAV) or --stdout (raw PCM).
  • INT8/INT4 quantization — 15% speedup on 1.7B with --int8.
  • Configurable sampling — Temperature, top-k, top-p, and repetition penalty.
  • 24 kHz WAV output — 16-bit PCM, mono.

Usage

./qwen_tts [options]

Required:
  -d, --model-dir <path>     Model directory
  --text <string>            Text to synthesize

Optional:
  -o, --output <path>        Output WAV file (default: output.wav)
  -s, --speaker <name>       Speaker voice (default: ryan)
  -l, --language <lang>      Target language (default: English)
  -I, --instruct <text>      Style/emotion instruction (1.7B model only)
  --temperature <f>          Sampling temperature (default: 0.5)
  --top-k <n>                Top-k sampling (default: 50)
  --top-p <f>                Top-p nucleus sampling (default: 1.0)
  --rep-penalty <f>          Repetition penalty (default: 1.05)
  --max-tokens <n>           Max audio tokens (default: 8192)
  --max-duration <secs>      Max audio duration in seconds
  --seed <n>                 Random seed for reproducible output
  --ref-audio <path>         Reference audio for voice cloning (Base model)
  --save-voice <path>        Save voice profile (.qvoice)
  --load-voice <path>        Load voice profile (.qvoice)
  --target-cv <dir>          CV model dir for delta encoding (bit-identical cross-model)
  --list-voices <dir>        List .qvoice files in directory (no model needed)
  --delete-voice <path>      Delete a .qvoice file
  --voice-name <name>        Name for the voice (stored in .qvoice metadata)
  --voice-design             VoiceDesign mode (create voice from --instruct)
  --stream                   Stream audio (decode chunks during generation)
  --stdout                   Output raw s16le PCM to stdout (implies --stream)
  --int8                     INT8 quantized (1.7B recommended)
  --int4                     Q4_0 quantized (1.7B only, experimental)
  -j, --threads <n>          Worker threads (default: 4)
  --silent                   Suppress status output
  --debug                    Verbose diagnostics
  --serve <port>             Start HTTP server

Examples

# Basic English
./qwen_tts -d qwen3-tts-0.6b --text "The quick brown fox jumps over the lazy dog." -o fox.wav

# Italian with a specific voice
./qwen_tts -d qwen3-tts-0.6b -s ryan -l Italian \
    --text "Ciao, questa e una prova del sistema di sintesi vocale." -o test_it.wav

# Style/emotion control (1.7B only)
./qwen_tts -d qwen3-tts-1.7b -s ryan -l English \
    --text "I cannot believe you did that to me." \
    --instruct "Speak in a very angry and aggressive tone" -o angry.wav

# Reproducible output with seed
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --seed 42 -o hello.wav

Voice Cloning

Clone any voice from a reference audio clip. Requires a Base model.

# Clone a voice
./qwen_tts -d qwen3-tts-0.6b-base --ref-audio reference.wav \
    --text "Hello, this is my cloned voice." -o cloned.wav

Full guide: reference audio tips, model comparison, samples → docs/voice-cloning.md

Custom Voices with Delta .qvoice

The killer feature: clone a voice once, save it as a .qvoice with --target-cv, and use it forever on the CustomVoice model — bit-identical to the original clone. Works with --instruct, streaming, and the HTTP server.

# Create (one-time: needs both Base + CV models)
./qwen_tts -d qwen3-tts-0.6b-base --ref-audio mario.wav -l Italian \
    --voice-name "Mario" --target-cv qwen3-tts-0.6b \
    --save-voice mario.qvoice

# Use forever (only CV model + .qvoice needed)
./qwen_tts -d qwen3-tts-0.6b --load-voice mario.qvoice \
    --text "Ciao, come stai?" -o output.wav

# On the server
./qwen_tts -d qwen3-tts-0.6b --load-voice mario.qvoice --serve 8080

# Manage voice profiles (no model needed)
./qwen_tts --list-voices ./my_voices/
./qwen_tts --delete-voice ./my_voices/old.qvoice

Voice clone samples — all generated via .qvoice delta on 0.6B CustomVoice:

Language Voice Source Output Text
Italian Pirandello Reader LibriVox Public Domain inputclone Buongiorno a tutti, questa e una dimostrazione della clonazione vocale.
English Sarac (F) LibriTTS-R CC-BY listen Good morning everyone, this is a demonstration of voice cloning using a custom voice profile.
English Peter (M) LibriTTS-R CC-BY listen I love reading books aloud, there is something magical about bringing stories to life with your voice.
French Baudelaire Reader LibriVox Public Domain listen Bonjour a tous, ceci est une demonstration du clonage vocal avec un profil de voix personnalise.
Spanish Lu LibriVox Public Domain listen Buenos dias a todos, esta es una demostracion de la clonacion de voz con un perfil de voz personalizado.

Full guide: delta vs standard, format internals, troubleshooting → docs/custom-voices.md

HTTP Server

# Start server
./qwen_tts -d qwen3-tts-0.6b --serve 8080

# Generate speech
curl -s http://localhost:8080/v1/tts \
  -d '{"text":"Hello, how are you?"}' -o output.wav

# Stream with real-time playback
curl -sN http://localhost:8080/v1/tts/stream \
  -d '{"text":"Hello, how are you?"}' | \
  play -t raw -r 24000 -e signed -b 16 -c 1 -

# OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/audio/speech \
  -d '{"input":"Hello world","voice":"ryan"}' -o output.wav

Full guide: all endpoints, request body, performance → docs/server.md

Streaming

# Stream to WAV file
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stream -o hello.wav

# Pipe raw PCM to audio player
./qwen_tts -d qwen3-tts-0.6b --text "Hello world" --stdout | \
    play -t raw -r 24000 -e signed -b 16 -c 1 -

How It Works

Text --> BPE Tokenizer --> Talker (LLM) --> Code Predictor --> Speech Decoder --> 24 kHz WAV
Component What it does
Talker 28-layer Qwen3 transformer with GQA, RoPE, SwiGLU. Generates one audio frame token per step.
Code Predictor 5-layer transformer running 15 sequential passes per frame. Predicts the remaining 15 codebook entries.
Speech Decoder Causal ConvNet with 16-codebook RVQ dequantization and 480x upsampling. Converts codes to waveform.
0.6B 1.7B
Hidden dim 1024 2048
Heads (Q/KV) 16/8 16/8
Layers 28 28
Code Predictor 1024 hidden, 5 layers 1024 hidden, 5 layers (+2048→1024 projection)
Memory ~3 GB ~8 GB

Performance

Benchmarked on Apple M1 8-core, 16 GB RAM, 4 threads (make bench-full):

Config 0.6B RTF 1.7B BF16 RTF 1.7B INT8 RTF
CLI short 1.37–1.69 4.10–4.40 3.69
CLI long 1.29–1.32 1.97–2.11 2.15
CLI stream short 1.31 2.59–4.01
CLI stream long 1.30–1.33 2.06–2.43
Server (cold) 1.34
Server (warm) 1.33

RTF = processing_time / audio_duration. Lower is better; <1.0 = faster than real-time.

Longer audio improves RTF (fixed costs amortize): 0.6B long text reaches RTF 1.29. Streaming has identical performance to normal mode. --int8 gives ~11% speedup on 1.7B (details).

Run benchmarks on your machine:

make bench         # Quick: short+long, normal+stream (both models)
make bench-full    # Full: + server, instruct, INT8, .qvoice (if available)

vs other implementations:

Hardware 0.6B RTF Notes
This project (C, Apple M1 CPU) 1.26–1.39 Pure C, no GPU
Python + PyTorch (Ryzen 9 7950X CPU) 4.5–5.8 Official Python, CPU-only
NVIDIA RTX 3090 0.52–0.68 Python + PyTorch + FlashAttention 2

3–4x faster than Python on CPU. Within 2x of an RTX 3090 — on a 2020 laptop with no GPU.

Per-component breakdown, full GPU table, optimization history → docs/performance.md

Documentation

Guide Contents
Voice Cloning Reference audio tips, ECAPA-TDNN internals, model comparison, samples
Custom Voices .qvoice format, delta vs standard, managing profiles, troubleshooting
HTTP Server All endpoints, request body, streaming, server performance
VoiceDesign Creating voices from text descriptions
Quantization INT8/INT4, comparison table, recommendations
Performance RTF benchmarks, component breakdown, CPU vs GPU, optimization history
Building All platforms, build targets, testing

Blog Posts

Post Topic
Voice Cloning Internals ECAPA-TDNN architecture deep-dive
Cross-Model Voice Analysis Why delta format works (weight analysis)
Optimization Notes RTF 3.5 → 1.3: the full optimization story

Credits & Acknowledgments

  • Salvatore Sanfilippo (antirez) — This project wouldn't exist without his qwen-asr, a pure C Qwen2-Audio ASR engine that proved you can do real neural inference in plain C with mmap'd safetensors, BF16 NEON kernels, and zero dependencies. The entire architecture of this TTS engine — the approach, the style, the philosophy of minimal C inference — is directly inspired by his work. If you like this project, go star qwen-asr first.
  • Michael Abrash — His Graphics Programming Black Book (1997) shaped how we think about performance. The chapters on data alignment, struct layout, and cache-friendly access patterns for the 386/486 are still relevant today — we got a 24% speedup from cache-line alignment (posix_memalign(64)), applying the same principles Abrash taught 30 years ago to modern SIMD and BLAS.
  • John Carmack — His .plan files and QuakeCon talks on micro-optimization and cache friendliness were a constant reference. Where Abrash gave you the systematic rules and benchmarks, Carmack showed you the mindset: always think about how data flows through the CPU.
  • Qwen3-TTS by the Qwen team at Alibaba — the model architecture, weights, and research. Models on Hugging Face. Paper.
  • Qwen2.5 by the Qwen team — the base LLM architecture (GQA, RoPE, SwiGLU) used in the Talker and Code Predictor.

License

MIT

About

Pure C inference engine for Qwen3-TTS text-to-speech. No Python, no PyTorch — just C and BLAS. Supports 0.6B and 1.7B models, 9 voices, 10 languages.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors