Voice cloning from a short audio clip. Powered by Qwen3-TTS — an open-source model by Alibaba.
Clone any voice from a 3 to 15 second reference audio clip and generate new speech in that voice. Run it interactively, via CLI flags, or through a browser-based UI.
| Category | Details |
|---|---|
| Voice Cloning | Clone any voice from a 3 to 15 second clean reference clip |
| Languages | English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese |
| Interfaces | Interactive runner, CLI script, and Gradio web UI |
| Models | Lightweight 0.6B and higher quality 1.7B model options |
| Configuration | Fully configurable via .env — no code changes needed |
| Hardware | Runs on CUDA GPU (4 to 8GB VRAM) or CPU |
MimicTTS uses a sophisticated voice cloning pipeline to capture the unique characteristics of a reference voice and transfer them to new speech.
- Python 3.10 or higher
- CUDA GPU with 4 to 8GB VRAM (or CPU for slower testing)
- A reference audio clip:
.wavor.mp3, 3 to 15 seconds, clean speech, no background noise
1. Clone the repository
git clone https://github.com/rajjitlai/MimicTTS.git
cd MimicTTS2. Create and activate a virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activate3. Install dependencies
pip install -r requirements.txt4. (Optional) Install Flash Attention for faster GPU inference
pip install flash-attn --no-build-isolation5. Configure your environment
# Windows
copy .env.example .env
# Linux / macOS
cp .env.example .envOpen .env and fill in your values. At minimum, set HF_TOKEN — a read-access token from huggingface.co/settings/tokens — to allow model downloads.
6. (Optional) Log in to HuggingFace CLI
huggingface-cli loginDrop your reference audio into reference_audio/, then run:
python runner.pyThe runner guides you through every step with clear prompts:
+------------------------------------------+
| MimicTTS |
| Interactive Voice Cloner |
+------------------------------------------+
Reference audio files available:
[1] my_voice.wav
[2] sample.wav
Pick a file by number: 1
Using: reference_audio/my_voice.wav
Reference transcript
(Type out exactly what is spoken in your reference audio)
Transcript: Hello, my name is John and this is my voice.
Text to speak
(What should the cloned voice say?)
Text: Welcome to my project, thanks for watching!
Language selection:
[1] English <- default
[2] Chinese
...
Pick a language (or press Enter for English):
Using default: English
------------------------------------------
Review your inputs before generating:
------------------------------------------
Reference audio : reference_audio/my_voice.wav
Transcript : Hello, my name is John and this is my voice.
Text to speak : Welcome to my project, thanks for watching!
Language : English
Output file : outputs/result.wav
------------------------------------------
Looks good? Generate now? [Y/n]:
Output is saved to outputs/result.wav (configurable in .env).
For power users who prefer flags:
python voice_clone.py \
--ref_audio reference_audio/my_voice.wav \
--ref_text "This is what is spoken in the reference audio." \
--text "Hello, this is my cloned voice speaking something new!" \
--language English \
--output outputs/result.wavArguments:
| Argument | Required | Default | Description |
|---|---|---|---|
--ref_audio |
Yes | — | Path to reference audio (.wav or .mp3) |
--ref_text |
Yes | — | Exact transcript of the reference audio |
--text |
Yes | — | New text for the cloned voice to speak |
--language |
No | English | Output language |
--output |
No | outputs/result.wav |
Output file path |
python app.pyOpen http://localhost:7860 in your browser. Upload your reference audio, fill in the transcript, type what you want the cloned voice to say, and click Clone Voice.
To expose the UI on your local network (for example, running on a remote machine or WSL), set GRADIO_SHARE=true in your .env.
MimicTTS/
├── runner.py # Interactive step-by-step prompt (recommended entry point)
├── app.py # Gradio web UI
├── voice_clone.py # CLI script with argument flags
├── model.py # Model loading and inference (shared singleton)
├── config.py # Central config — reads from .env
├── reference_audio/ # Place your reference .wav/.mp3 files here
│ └── transcripts.json # Saved transcripts per audio file (auto-managed)
├── outputs/ # Generated audio files are saved here
├── requirements.txt # Python dependencies
├── .env # Your local config (not committed to git)
├── .env.example # Config template — copy to .env to get started
├── .gitignore
└── README.md
The diagram below illustrates the modular relationship between the user interfaces and the core engine.
This specialized flow shows how reference audio, transcripts, and target text are processed by the Qwen3-TTS model to generate high-fidelity speech.
All settings are controlled via your .env file. Copy .env.example to .env to get started.
| Variable | Default | Description |
|---|---|---|
MODEL_ID |
Qwen/Qwen3-TTS-12Hz-0.6B-Base |
HuggingFace model to use |
DEVICE |
Auto-detected | cuda:0, cuda:1, or cpu |
REFERENCE_AUDIO_DIR |
reference_audio |
Folder for input audio files |
OUTPUT_DIR |
outputs |
Folder for generated audio files |
DEFAULT_LANGUAGE |
English |
Fallback language |
DEFAULT_OUTPUT_FILE |
outputs/result.wav |
Where runner.py saves output |
GRADIO_SHARE |
false |
Set true to expose UI on your network |
GRADIO_PORT |
7860 |
Port for the Gradio web UI |
HF_TOKEN |
— | HuggingFace read token for model downloads |
| Model | Size | VRAM | Best For |
|---|---|---|---|
Qwen3-TTS-12Hz-0.6B-Base |
2.5 GB | ~4 GB | Quick tests, lighter hardware |
Qwen3-TTS-12Hz-1.7B-Base |
4.5 GB | 6 to 8 GB | Better quality, production use |
Switch models by changing MODEL_ID in your .env file.
MimicTTS includes a built-in reference transcript in reference_audio/transcripts.json to make recording your own reference clip straightforward.
Step 1 — Read the provided text aloud and record it
Open reference_audio/transcripts.json. The default transcript reads:
The quiet night gathers my scattered thoughts. Moonlight drifts across the empty road.
Streetlights hum like distant memories. And shadows stretch where silence grows.
I walk alone but never empty. Carrying questions I never chose,
Until the dawn begins its whisper. And turns my doubts to something close.
Record yourself reading this text naturally, at a comfortable pace. Aim for a 5 to 10 second clip — you do not need to read the entire passage, just enough to capture your voice clearly.
Step 2 — Save the recording
Save your recording as a .wav or .mp3 file and place it in the reference_audio/ folder:
reference_audio/
my_voice.wav <- your recording goes here
transcripts.json <- transcript is already saved
Step 3 — Run the interactive runner
python runner.pyThe runner will detect your audio file, load the saved transcript automatically, and skip the manual transcript entry step. Just pick your file, press Enter to confirm the transcript, type what you want the cloned voice to say, and generate.
Recording tips:
- Record in a quiet room with no background noise or echo
- Use a decent microphone — even a phone mic works fine if the room is quiet
- Speak naturally, at your normal pace and tone
- Avoid clipping (distortion from speaking too loudly)
If you record audio with different content, the runner will prompt you to type the transcript on first use and save it automatically. On every subsequent run with that file, it loads the saved transcript — no retyping needed.
You can also edit reference_audio/transcripts.json directly:
{
"my_voice.wav": "Your custom transcript text goes here.",
"another_clip.wav": "A second transcript for a different voice."
}- Reference audio quality is the single biggest factor. Record in a quiet room with no background noise.
- A 5 to 10 second clip is the sweet spot. Too short loses voice character; too long adds no benefit.
- Always provide the reference transcript. Skipping it noticeably degrades clone quality.
- Match the language to the language you are generating, not the language of the reference audio.
- If you encounter GPU out-of-memory errors, set
DEVICE=cpuin.envor switch to the 0.6B model.
Built and maintained by Rajjit Laishram.
Feel free to reach out via the website for collaboration, feedback, or questions.
Contributions of all kinds are welcome — bug fixes, new features, documentation improvements, and more.
| Document | Description |
|---|---|
| CONTRIBUTING.md | How to set up, branch, commit, and open a PR |
| CODE_OF_CONDUCT.md | Community standards and enforcement |
| CHANGELOG.md | Full history of changes by version |
| SECURITY.md | How to report vulnerabilities privately |
To get started: fork the repo, create a branch, make your changes, and open a pull request against main.
Copyright 2026 Rajjit Laishram
Licensed under the Apache License, Version 2.0. You may not use this project except in compliance with the License.
See the LICENSE file for the full license text, or visit: http://www.apache.org/licenses/LICENSE-2.0
Note: The underlying Qwen3-TTS model is subject to its own license on HuggingFace. Please review it before any commercial use of the model weights.



