Skip to content

rajjitlai/MimicTTS

MimicTTS

MimicTTS Hero

Python Model License Platform CUDA Contributions Welcome Code of Conduct Author

Voice cloning from a short audio clip. Powered by Qwen3-TTS — an open-source model by Alibaba.

Clone any voice from a 3 to 15 second reference audio clip and generate new speech in that voice. Run it interactively, via CLI flags, or through a browser-based UI.


Features

Category Details
Voice Cloning Clone any voice from a 3 to 15 second clean reference clip
Languages English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese
Interfaces Interactive runner, CLI script, and Gradio web UI
Models Lightweight 0.6B and higher quality 1.7B model options
Configuration Fully configurable via .env — no code changes needed
Hardware Runs on CUDA GPU (4 to 8GB VRAM) or CPU

How it Works

MimicTTS uses a sophisticated voice cloning pipeline to capture the unique characteristics of a reference voice and transfer them to new speech.

MimicTTS Process


Requirements

  • Python 3.10 or higher
  • CUDA GPU with 4 to 8GB VRAM (or CPU for slower testing)
  • A reference audio clip: .wav or .mp3, 3 to 15 seconds, clean speech, no background noise

Setup

1. Clone the repository

git clone https://github.com/rajjitlai/MimicTTS.git
cd MimicTTS

2. Create and activate a virtual environment

python -m venv venv

# Windows
venv\Scripts\activate

# Linux / macOS
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. (Optional) Install Flash Attention for faster GPU inference

pip install flash-attn --no-build-isolation

5. Configure your environment

# Windows
copy .env.example .env

# Linux / macOS
cp .env.example .env

Open .env and fill in your values. At minimum, set HF_TOKEN — a read-access token from huggingface.co/settings/tokens — to allow model downloads.

6. (Optional) Log in to HuggingFace CLI

huggingface-cli login

Usage

Interactive Runner (Recommended)

Drop your reference audio into reference_audio/, then run:

python runner.py

The runner guides you through every step with clear prompts:

+------------------------------------------+
|              MimicTTS                    |
|       Interactive Voice Cloner           |
+------------------------------------------+

Reference audio files available:
   [1] my_voice.wav
   [2] sample.wav

Pick a file by number: 1
   Using: reference_audio/my_voice.wav

Reference transcript
   (Type out exactly what is spoken in your reference audio)
Transcript: Hello, my name is John and this is my voice.

Text to speak
   (What should the cloned voice say?)
Text: Welcome to my project, thanks for watching!

Language selection:
   [1] English  <- default
   [2] Chinese
   ...

Pick a language (or press Enter for English):
   Using default: English

------------------------------------------
  Review your inputs before generating:
------------------------------------------
  Reference audio : reference_audio/my_voice.wav
  Transcript      : Hello, my name is John and this is my voice.
  Text to speak   : Welcome to my project, thanks for watching!
  Language        : English
  Output file     : outputs/result.wav
------------------------------------------

Looks good? Generate now? [Y/n]:

Output is saved to outputs/result.wav (configurable in .env).


CLI

For power users who prefer flags:

python voice_clone.py \
  --ref_audio reference_audio/my_voice.wav \
  --ref_text "This is what is spoken in the reference audio." \
  --text "Hello, this is my cloned voice speaking something new!" \
  --language English \
  --output outputs/result.wav

Arguments:

Argument Required Default Description
--ref_audio Yes Path to reference audio (.wav or .mp3)
--ref_text Yes Exact transcript of the reference audio
--text Yes New text for the cloned voice to speak
--language No English Output language
--output No outputs/result.wav Output file path

Web UI

python app.py

Open http://localhost:7860 in your browser. Upload your reference audio, fill in the transcript, type what you want the cloned voice to say, and click Clone Voice.

To expose the UI on your local network (for example, running on a remote machine or WSL), set GRADIO_SHARE=true in your .env.


Project Structure

MimicTTS/
├── runner.py               # Interactive step-by-step prompt (recommended entry point)
├── app.py                  # Gradio web UI
├── voice_clone.py          # CLI script with argument flags
├── model.py                # Model loading and inference (shared singleton)
├── config.py               # Central config — reads from .env
├── reference_audio/        # Place your reference .wav/.mp3 files here
│   └── transcripts.json    # Saved transcripts per audio file (auto-managed)
├── outputs/                # Generated audio files are saved here
├── requirements.txt        # Python dependencies
├── .env                    # Your local config (not committed to git)
├── .env.example            # Config template — copy to .env to get started
├── .gitignore
└── README.md

Logic Flows

System Architecture

The diagram below illustrates the modular relationship between the user interfaces and the core engine.

MimicTTS Architecture Flow

Voice Cloning (Inference) Flow

This specialized flow shows how reference audio, transcripts, and target text are processed by the Qwen3-TTS model to generate high-fidelity speech.

MimicTTS Inference Flow


Configuration

All settings are controlled via your .env file. Copy .env.example to .env to get started.

Variable Default Description
MODEL_ID Qwen/Qwen3-TTS-12Hz-0.6B-Base HuggingFace model to use
DEVICE Auto-detected cuda:0, cuda:1, or cpu
REFERENCE_AUDIO_DIR reference_audio Folder for input audio files
OUTPUT_DIR outputs Folder for generated audio files
DEFAULT_LANGUAGE English Fallback language
DEFAULT_OUTPUT_FILE outputs/result.wav Where runner.py saves output
GRADIO_SHARE false Set true to expose UI on your network
GRADIO_PORT 7860 Port for the Gradio web UI
HF_TOKEN HuggingFace read token for model downloads

Model Options

Model Size VRAM Best For
Qwen3-TTS-12Hz-0.6B-Base 2.5 GB ~4 GB Quick tests, lighter hardware
Qwen3-TTS-12Hz-1.7B-Base 4.5 GB 6 to 8 GB Better quality, production use

Switch models by changing MODEL_ID in your .env file.

Recording Your Reference Voice

MimicTTS includes a built-in reference transcript in reference_audio/transcripts.json to make recording your own reference clip straightforward.

Step 1 — Read the provided text aloud and record it

Open reference_audio/transcripts.json. The default transcript reads:

The quiet night gathers my scattered thoughts. Moonlight drifts across the empty road.
Streetlights hum like distant memories. And shadows stretch where silence grows.
I walk alone but never empty. Carrying questions I never chose,
Until the dawn begins its whisper. And turns my doubts to something close.

Record yourself reading this text naturally, at a comfortable pace. Aim for a 5 to 10 second clip — you do not need to read the entire passage, just enough to capture your voice clearly.

Step 2 — Save the recording

Save your recording as a .wav or .mp3 file and place it in the reference_audio/ folder:

reference_audio/
    my_voice.wav       <- your recording goes here
    transcripts.json   <- transcript is already saved

Step 3 — Run the interactive runner

python runner.py

The runner will detect your audio file, load the saved transcript automatically, and skip the manual transcript entry step. Just pick your file, press Enter to confirm the transcript, type what you want the cloned voice to say, and generate.

Recording tips:

  • Record in a quiet room with no background noise or echo
  • Use a decent microphone — even a phone mic works fine if the room is quiet
  • Speak naturally, at your normal pace and tone
  • Avoid clipping (distortion from speaking too loudly)

Adding Your Own Transcript

If you record audio with different content, the runner will prompt you to type the transcript on first use and save it automatically. On every subsequent run with that file, it loads the saved transcript — no retyping needed.

You can also edit reference_audio/transcripts.json directly:

{
  "my_voice.wav": "Your custom transcript text goes here.",
  "another_clip.wav": "A second transcript for a different voice."
}

Tips for Best Results

  • Reference audio quality is the single biggest factor. Record in a quiet room with no background noise.
  • A 5 to 10 second clip is the sweet spot. Too short loses voice character; too long adds no benefit.
  • Always provide the reference transcript. Skipping it noticeably degrades clone quality.
  • Match the language to the language you are generating, not the language of the reference audio.
  • If you encounter GPU out-of-memory errors, set DEVICE=cpu in .env or switch to the 0.6B model.

Author

Built and maintained by Rajjit Laishram.

Feel free to reach out via the website for collaboration, feedback, or questions.


Contributing

Contributions of all kinds are welcome — bug fixes, new features, documentation improvements, and more.

Document Description
CONTRIBUTING.md How to set up, branch, commit, and open a PR
CODE_OF_CONDUCT.md Community standards and enforcement
CHANGELOG.md Full history of changes by version
SECURITY.md How to report vulnerabilities privately

To get started: fork the repo, create a branch, make your changes, and open a pull request against main.


License

Copyright 2026 Rajjit Laishram

Licensed under the Apache License, Version 2.0. You may not use this project except in compliance with the License.

See the LICENSE file for the full license text, or visit: http://www.apache.org/licenses/LICENSE-2.0

Note: The underlying Qwen3-TTS model is subject to its own license on HuggingFace. Please review it before any commercial use of the model weights.

About

MimicTTS is a tool for Voice cloning from a short audio clip. Powered by Qwen3-TTS — an open-source model by Alibaba.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors

Languages