MimicTTS

Voice cloning from a short audio clip. Powered by Qwen3-TTS — an open-source model by Alibaba.

Clone any voice from a 3 to 15 second reference audio clip and generate new speech in that voice. Run it interactively, via CLI flags, or through a browser-based UI.

Features

Category	Details
Voice Cloning	Clone any voice from a 3 to 15 second clean reference clip
Languages	English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese
Interfaces	Interactive runner, CLI script, and Gradio web UI
Models	Lightweight 0.6B and higher quality 1.7B model options
Configuration	Fully configurable via `.env` — no code changes needed
Hardware	Runs on CUDA GPU (4 to 8GB VRAM) or CPU

How it Works

MimicTTS uses a sophisticated voice cloning pipeline to capture the unique characteristics of a reference voice and transfer them to new speech.

Requirements

Python 3.10 or higher
CUDA GPU with 4 to 8GB VRAM (or CPU for slower testing)
A reference audio clip: .wav or .mp3, 3 to 15 seconds, clean speech, no background noise

Setup

1. Clone the repository

git clone https://github.com/rajjitlai/MimicTTS.git
cd MimicTTS

2. Create and activate a virtual environment

python -m venv venv

# Windows
venv\Scripts\activate

# Linux / macOS
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. (Optional) Install Flash Attention for faster GPU inference

pip install flash-attn --no-build-isolation

5. Configure your environment

# Windows
copy .env.example .env

# Linux / macOS
cp .env.example .env

Open .env and fill in your values. At minimum, set HF_TOKEN — a read-access token from huggingface.co/settings/tokens — to allow model downloads.

6. (Optional) Log in to HuggingFace CLI

huggingface-cli login

Usage

Interactive Runner (Recommended)

Drop your reference audio into reference_audio/, then run:

python runner.py

The runner guides you through every step with clear prompts:

+------------------------------------------+
|              MimicTTS                    |
|       Interactive Voice Cloner           |
+------------------------------------------+

Reference audio files available:
   [1] my_voice.wav
   [2] sample.wav

Pick a file by number: 1
   Using: reference_audio/my_voice.wav

Reference transcript
   (Type out exactly what is spoken in your reference audio)
Transcript: Hello, my name is John and this is my voice.

Text to speak
   (What should the cloned voice say?)
Text: Welcome to my project, thanks for watching!

Language selection:
   [1] English  <- default
   [2] Chinese
   ...

Pick a language (or press Enter for English):
   Using default: English

------------------------------------------
  Review your inputs before generating:
------------------------------------------
  Reference audio : reference_audio/my_voice.wav
  Transcript      : Hello, my name is John and this is my voice.
  Text to speak   : Welcome to my project, thanks for watching!
  Language        : English
  Output file     : outputs/result.wav
------------------------------------------

Looks good? Generate now? [Y/n]:

Output is saved to outputs/result.wav (configurable in .env).

CLI

For power users who prefer flags:

python voice_clone.py \
  --ref_audio reference_audio/my_voice.wav \
  --ref_text "This is what is spoken in the reference audio." \
  --text "Hello, this is my cloned voice speaking something new!" \
  --language English \
  --output outputs/result.wav

Arguments:

Argument	Required	Default	Description
`--ref_audio`	Yes	—	Path to reference audio (.wav or .mp3)
`--ref_text`	Yes	—	Exact transcript of the reference audio
`--text`	Yes	—	New text for the cloned voice to speak
`--language`	No	English	Output language
`--output`	No	`outputs/result.wav`	Output file path

Web UI

python app.py

Open http://localhost:7860 in your browser. Upload your reference audio, fill in the transcript, type what you want the cloned voice to say, and click Clone Voice.

To expose the UI on your local network (for example, running on a remote machine or WSL), set GRADIO_SHARE=true in your .env.

Project Structure

MimicTTS/
├── runner.py               # Interactive step-by-step prompt (recommended entry point)
├── app.py                  # Gradio web UI
├── voice_clone.py          # CLI script with argument flags
├── model.py                # Model loading and inference (shared singleton)
├── config.py               # Central config — reads from .env
├── reference_audio/        # Place your reference .wav/.mp3 files here
│   └── transcripts.json    # Saved transcripts per audio file (auto-managed)
├── outputs/                # Generated audio files are saved here
├── requirements.txt        # Python dependencies
├── .env                    # Your local config (not committed to git)
├── .env.example            # Config template — copy to .env to get started
├── .gitignore
└── README.md

Logic Flows

System Architecture

The diagram below illustrates the modular relationship between the user interfaces and the core engine.

Voice Cloning (Inference) Flow

This specialized flow shows how reference audio, transcripts, and target text are processed by the Qwen3-TTS model to generate high-fidelity speech.

Configuration

All settings are controlled via your .env file. Copy .env.example to .env to get started.

Variable	Default	Description
`MODEL_ID`	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	HuggingFace model to use
`DEVICE`	Auto-detected	`cuda:0`, `cuda:1`, or `cpu`
`REFERENCE_AUDIO_DIR`	`reference_audio`	Folder for input audio files
`OUTPUT_DIR`	`outputs`	Folder for generated audio files
`DEFAULT_LANGUAGE`	`English`	Fallback language
`DEFAULT_OUTPUT_FILE`	`outputs/result.wav`	Where `runner.py` saves output
`GRADIO_SHARE`	`false`	Set `true` to expose UI on your network
`GRADIO_PORT`	`7860`	Port for the Gradio web UI
`HF_TOKEN`	—	HuggingFace read token for model downloads

Model Options

Model	Size	VRAM	Best For
`Qwen3-TTS-12Hz-0.6B-Base`	2.5 GB	~4 GB	Quick tests, lighter hardware
`Qwen3-TTS-12Hz-1.7B-Base`	4.5 GB	6 to 8 GB	Better quality, production use

Switch models by changing MODEL_ID in your .env file.

Recording Your Reference Voice

MimicTTS includes a built-in reference transcript in reference_audio/transcripts.json to make recording your own reference clip straightforward.

Step 1 — Read the provided text aloud and record it

Open reference_audio/transcripts.json. The default transcript reads:

The quiet night gathers my scattered thoughts. Moonlight drifts across the empty road.
Streetlights hum like distant memories. And shadows stretch where silence grows.
I walk alone but never empty. Carrying questions I never chose,
Until the dawn begins its whisper. And turns my doubts to something close.

Record yourself reading this text naturally, at a comfortable pace. Aim for a 5 to 10 second clip — you do not need to read the entire passage, just enough to capture your voice clearly.

Step 2 — Save the recording

Save your recording as a .wav or .mp3 file and place it in the reference_audio/ folder:

reference_audio/
    my_voice.wav       <- your recording goes here
    transcripts.json   <- transcript is already saved

Step 3 — Run the interactive runner

python runner.py

The runner will detect your audio file, load the saved transcript automatically, and skip the manual transcript entry step. Just pick your file, press Enter to confirm the transcript, type what you want the cloned voice to say, and generate.

Recording tips:

Record in a quiet room with no background noise or echo

Use a decent microphone — even a phone mic works fine if the room is quiet

Speak naturally, at your normal pace and tone

Avoid clipping (distortion from speaking too loudly)

Adding Your Own Transcript

If you record audio with different content, the runner will prompt you to type the transcript on first use and save it automatically. On every subsequent run with that file, it loads the saved transcript — no retyping needed.

You can also edit reference_audio/transcripts.json directly:

{
  "my_voice.wav": "Your custom transcript text goes here.",
  "another_clip.wav": "A second transcript for a different voice."
}

Tips for Best Results

Reference audio quality is the single biggest factor. Record in a quiet room with no background noise.
A 5 to 10 second clip is the sweet spot. Too short loses voice character; too long adds no benefit.
Always provide the reference transcript. Skipping it noticeably degrades clone quality.
Match the language to the language you are generating, not the language of the reference audio.
If you encounter GPU out-of-memory errors, set DEVICE=cpu in .env or switch to the 0.6B model.

Author

Built and maintained by Rajjit Laishram.

Feel free to reach out via the website for collaboration, feedback, or questions.

Contributing

Contributions of all kinds are welcome — bug fixes, new features, documentation improvements, and more.

Document	Description
CONTRIBUTING.md	How to set up, branch, commit, and open a PR
CODE_OF_CONDUCT.md	Community standards and enforcement
CHANGELOG.md	Full history of changes by version
SECURITY.md	How to report vulnerabilities privately

To get started: fork the repo, create a branch, make your changes, and open a pull request against main.

License

Licensed under the Apache License, Version 2.0. You may not use this project except in compliance with the License.

See the LICENSE file for the full license text, or visit: http://www.apache.org/licenses/LICENSE-2.0

Note: The underlying Qwen3-TTS model is subject to its own license on HuggingFace. Please review it before any commercial use of the model weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MimicTTS

Features

How it Works

Requirements

Setup

Usage

Interactive Runner (Recommended)

CLI

Web UI

Project Structure

Logic Flows

System Architecture

Voice Cloning (Inference) Flow

Configuration

Model Options

Recording Your Reference Voice

Adding Your Own Transcript

Tips for Best Results

Author

Contributing

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
outputs		outputs
reference_audio		reference_audio
resources		resources
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
app.py		app.py
config.py		config.py
model.py		model.py
progress.py		progress.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runner.py		runner.py
voice_clone.py		voice_clone.py

Folders and files

Latest commit

History

Repository files navigation

MimicTTS

Features

How it Works

Requirements

Setup

Usage

Interactive Runner (Recommended)

CLI

Web UI

Project Structure

Logic Flows

System Architecture

Voice Cloning (Inference) Flow

Configuration

Model Options

Recording Your Reference Voice

Adding Your Own Transcript

Tips for Best Results

Author

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages