A multi-model, modular Gradio-based web UI for voice cloning, voice design, multi-speaker conversation, voice conversion, voice training and sound effects. Basically, One app, many engines, to tinker with all of them without juggling separate repos or setups. Powered by VibeVoice, Qwen3-TTS, LuxTTS, Chatterbox, Fish Speech and MMAudio. Supports Qwen3-ASR, VibeVoice-ASR and Whisper for automatic transcription. As well as llama.cpp and Ollama for Prompt Generation and a Prompt Saving, based on ComfyUI Prompt-Manager
Voice Clone Studio is fully modular. The main file dynamically loads self-contained tools as tabs. Each tool can be enabled or disabled from Settings without touching any code. It supports multipe engine for voice cloning, as well as Model finetuning. More features are also planned.
Clone voices from your own audio samples. Provide a short reference audio clip with its transcript, and generate new speech in that voice.
- Multiple engines - Qwen3-TTS (0.6B/1.7B), VibeVoice (1.5B/Large/Large-4bit), LuxTTS, Chatterbox, and Fish Speech S2 Pro (4B)
- Fish Speech Expression Tags - Embed
[tag]markers like[whisper],[laughing],[excited]directly in text for fine-grained delivery control (15,000+ supported tags) - Automatic Tag Stripping - Fish Speech
[tags]are automatically removed when using other engines, so the same text works everywhere - Voice prompt caching - First generation processes the sample, subsequent ones are instant
- Seed control - Reproducible results with saved seeds
- Emotion presets - 40+ emotion presets with adjustable intensity
- Split by Paragraph - Generate a separate audio clip for each paragraph, with automatic naming and a combined preview
- Prompt Hub - Access saved prompts directly from the tool without switching tabs
- Metadata tracking - Each output saves generation info (sample, seed, text)
Create multi-speaker dialogues using either Qwen's premium voices or your own custom voice samples using VibeVoice:
Choose Your Engine:
- Qwen - Fast generation with 9 preset voices, optimized for their native languages
- VibeVoice - High-quality custom voices, up to 90 minutes continuous, perfect for podcasts/audiobooks
- LuxTTS -
Unified Script Format:
Write scripts using [N]: format - works seamlessly with both engines:
[1]: Hey, how's it going?
[2]: I'm doing great, thanks for asking!
[3]: Mind if I join this conversation?
Qwen Mode:
- Mix any of the 9 premium speakers
- Adjustable pause duration between lines
- Fast generation with cached prompts
Speaker Mapping:
- [1] = Vivian, [2] = Serena, [3] = Uncle_Fu, [4] = Dylan, [5] = Eric
- [6] = Ryan, [7] = Aiden, [8] = Ono_Anna, [9] = Sohee
VibeVoice Mode:
- Up to 90 minutes of continuous speech
- Up to 4 distinct speakers using your own voice samples
- Cross-lingual support
- May spontaneously add background music/sounds for realism
- Numbers beyond 4 wrap around (5→1, 6→2, 7→3, 8→4, etc.)
Perfect for:
- Podcasts
- Audiobooks
- Long-form conversations
- Multi-speaker narratives
Models:
- Small - Faster generation (Qwen: 0.6B, VibeVoice: 1.5B)
- Large - Best quality (Qwen: 1.7B, VibeVoice: Large model)
Change the voice in any audio using Chatterbox speech-to-speech voice conversion (Resemble AI, MIT license):
- Speech-to-Speech - Upload or record audio, select a target voice sample, and re-speak the content in the target voice
- Microphone support - Record directly from your microphone for real-time voice conversion
- Any voice sample - Use the same voice samples from Voice Clone as conversion targets
- English optimized - Best results with English speech; multilingual support available with the Multilingual model
- Multiple models - TTS (English), Multilingual (23 languages)
Generate with premium pre-built voices, trained models, and streaming speakers:
VibeVoice Trained:
- Generate with your own VibeVoice LoRA-trained voices
- Optional voice sample conditioning with adjustable LoRA strength
- Advanced params: CFG scale, DDPM steps, temperature, top_k, top_p, repetition penalty
Qwen Speakers:
- 9 premium Qwen3-TTS speakers with style instructions (emotion, tone, speed)
- Each speaker works best in native language but supports all
VibeVoice Speakers:
- 7 built-in VibeVoice Streaming 0.5B preset voices (Carter, Davis, Emma, Frank, Grace, Mike, Samuel)
- Lightweight 0.5B model with fast generation
- Auto-downloads voice prompt files from GitHub, caches locally
Qwen Trained:
- Generate with Qwen3-TTS finetuned models
- ICL (In-Context Learning) mode for enhanced voice cloning
Create voices from natural language descriptions - no audio needed, using Qwen3-TTS Voice Design Model:
- Describe age, gender, emotion, accent, speaking style
- Generate unique voices matching your description
Fine-tune your own custom voice models with your training data:
- Dual Engine - Train with Qwen3-TTS or VibeVoice LoRA finetuning
- Dataset Management - Organize training samples in the
datasets/folder - Audio Preparation - Auto-converts to 24kHz 16-bit mono format
- Training Pipeline - Complete 3-step workflow (validation → extract codes → train)
- Epoch Selection - Compare different training checkpoints
- Live Progress - Real-time training logs and loss monitoring
- Stop Training - Terminate training mid-run with clean status
- Voice Presets Integration - Use trained models alongside premium speakers
VibeVoice Training Features:
- LoRA finetuning via subprocess with full parameter UI
- Configurable: batch size, learning rate, epochs, save interval, DDPM batch multiplier, diffusion/CE loss weights, voice prompt drop, gradient accumulation, warmup steps
- Train diffusion head toggle, EMA on/off with auto-decay calculation
- Verbose output filtering — clean per-epoch summaries instead of raw logs
Requirements:
- CUDA GPU required
- Multiple audio samples with transcripts
- Training time: ~10-30 minutes depending on dataset size
Workflow:
- Prepare audio files (WAV/MP3) and organize in
datasets/YourSpeakerName/folder - Use Batch Transcribe to automatically transcribe all files at once
- Review and edit individual transcripts as needed
- Configure training parameters (model size, epochs, learning rate)
- Monitor training progress in real-time
- Use trained model in Voice Presets tab
Unified audio preparation workspace for both voice samples and training datasets:
- Trim - Use waveform selection to cut audio
- Normalize - Balance audio levels
- Convert to Mono - Ensure single-channel audio
- Denoise - Clean audio with DeepFilterNet
- Extract from Video - Automatically extract audio tracks from video files
- Auto-Split Audio - Split long recordings into sentence-level clips using Qwen3-ASR-detected boundaries
- Transcribe - Qwen3 ASR, Whisper, or VibeVoice ASR automatic transcription
- Batch Transcribe - Process entire folders of audio files at once
- Save as Sample - One-click sample creation
- Dataset Management - Create, delete, and organize dataset folders directly from the UI
Generate sound effects and ambient audio using MMAudio (CVPR 2025, MIT license):
- Text-to-Audio - Describe any sound and generate high-quality 44.1kHz audio
- Video-to-Audio - Drop in a video clip and generate synchronized sound effects
- Multiple Models - Medium (2.4GB) and Large v2 (3.9GB) built-in, plus custom model support
- Custom Models - Load your own
.pthor.safetensorscheckpoints with automatic architecture detection - Video Preview - Source/Result toggle to compare original video against the audio-muxed result
- Fine Controls - Adjustable duration, guidance strength, and negative prompts
Save, browse, and generate text prompts for your TTS sessions. Includes a built-in LLM generator powered by llama.cpp or Ollama:
- Saved Prompts - Store and organize prompts in a local
prompts.jsonfile, browse with the file lister - Prompt Hub - Every generation tool has a built-in Prompt Loader for one-click access to saved prompts without switching tabs
- LLM Generation - Generate prompts locally using Qwen3 language models via llama.cpp or Ollama (no cloud API needed)
- Ollama Support - Use any model from your local Ollama installation as an alternative to llama.cpp
- System Prompt Presets - Built-in presets for TTS/Voice, TTS/Voice (Fish Speech) with
[tag]instructions, and Sound Design/SFX workflows, or write your own - Model Auto-Download - Download Qwen3-4B (~4.8GB) or Qwen3-8B (~8.5GB) GGUF models directly from HuggingFace
- Custom Models - Drop any
.gguffile intomodels/llama/to use your own models - Automatic Server Management - llama.cpp server starts/stops automatically, cleaned up on exit or Clear VRAM
Inspired by ComfyUI-Prompt-Manager by FranckyB.
View, play back, and manage your previously generated audio files. Multi-select for batch deletion, double-click to play.
Centralized application configuration:
- Model loading - Attention mechanism, offline mode, low CPU memory usage
- CUDA Graphs Acceleration - 5-10x faster Qwen3 inference via Faster-Qwen3-TTS (CUDA only, toggle in Settings)
- Multi-GPU Assignment - Assign TTS, ASR, and Llama.cpp to different GPUs on multi-GPU systems
- LLM Backend - Choose between llama.cpp and Ollama for prompt generation
- Folder paths - Configurable directories for samples, output, datasets, models
- Model downloads - Download models directly to local storage
- Visible Tools - Enable or disable any tool tab (restart to apply)
- Help Guide - Built-in documentation for all tools
- Python 3.10-3.12 (3.11 recommended — DeepFilterNet lacks wheels for 3.12; 3.13+ is not supported due to dependency conflicts)
- Windows/Linux: CUDA-compatible GPU (recommended: 8GB+ VRAM)
- macOS: Apple Silicon (M1/M2/M3/M4) for MPS acceleration, or Intel Mac (CPU-only)
- SOX (Sound eXchange) - Required for audio processing
- FFMPEG - Multimedia framework required for audio format conversion
- llama.cpp (optional) - Required only for the Prompt Manager's LLM generation feature. See llama.cpp
- Ollama (optional) - Alternative LLM backend for prompt generation. See Ollama
- Flash Attention 2 (optional, CUDA only)
Note for Linux/macOS users: openai-whisper is skipped (compatibility issues). Use VibeVoice ASR or Qwen3 ASR for transcription instead.
Note for macOS users: Model training is not supported on macOS. The Train Model tab is automatically hidden.
- Clone the repository:
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio- Run the setup script:
setup-windows.batThis will automatically:
- Install SOX (audio processing)
- Create virtual environment
- Install PyTorch with CUDA support
- Install all dependencies
- Display your Python version
- Show instructions for optional Flash Attention 2 installation
- Clone the repository:
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio- Make the setup script executable and run it:
chmod +x setup-linux.sh
./setup-linux.shThis will automatically:
- Detect your Python version
- Create virtual environment
- Install PyTorch with CUDA support
- Install all dependencies (using requirements file)
- Handle ONNX Runtime installation issues
- Warn about Whisper compatibility if needed
- Clone the repository:
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio- Make the setup script executable and run it:
chmod +x setup-mac.sh
./setup-mac.shThis will automatically:
- Detect Apple Silicon vs Intel Mac
- Install ffmpeg and sox via Homebrew
- Create virtual environment
- Install PyTorch with MPS support (no CUDA needed)
- Install all dependencies with macOS-compatible fallbacks
- Offer optional LuxTTS and Qwen3 ASR installation
- Clone the repository:
git clone https://github.com/FranckyB/Voice-Clone-Studio.git
cd Voice-Clone-Studio- Create a virtual environment:
python -m venv venv
# Windows
venv\Scripts\activate
# Linux/MacOs
source venv/bin/activate- Install PyTorch:
# Windows/Linux (NVIDIA GPU)
pip install torch==2.9.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu130
# macOS (MPS support built-in)
pip install torch==2.9.1 torchaudio==2.9.1- Install dependencies:
# All platforms (Windows, Linux, macOS)
pip install -r requirements.txtNote: The requirements file uses platform markers to automatically install the correct packages:
- Windows: Includes
openai-whisperfor transcription - Linux/macOS: Excludes
openai-whisper(uses VibeVoice ASR instead)
- Install Sox
# Windows
winget install -e --id ChrisBagwell.SoX
# Linux
# Debian/Ubuntu
sudo apt install sox libsox-dev
# Fedora/RHEL
sudo dnf install sox sox-devel
# MacOs
brew install sox- Install ffmpeg
# Windows
winget install -e --id Gyan.FFmpeg
# Linux
# Debian/Ubuntu
sudo apt install ffmpeg
# Fedora/RHEL
sudo dnf install ffmpeg
# MacOs
brew install ffmpeg- (Optional) Install llama.cpp for the Prompt Manager's LLM generation feature:
# Windows
winget install llama.cpp
# Linux
brew install llama.cpp
# Or build from source: https://github.com/ggml-org/llama.cpp- (Optional) Install FlashAttention 2 for faster generation (CUDA only):
Note: The application automatically detects and uses the best available attention mechanism. Configure in Settings tab:
flash_attention_2(CUDA only) →sdpa(CUDA/MPS) →eager(all devices)
For troubleshooting solutions, see docs/troubleshooting.md.
-
Install NVIDIA Drivers (Windows Side)
- Install the latest standard NVIDIA driver (Game Ready or Studio) for Windows from the NVIDIA Drivers page.
- Crucial: Do not try to install NVIDIA drivers inside your WSL Linux terminal. It will conflict with the host driver.
-
Update WSL 2
- Open PowerShell as Administrator and ensure your WSL kernel is up to date:
wsl --update - (If you don't have WSL installed yet, run
wsl --installand restart your computer).
- Open PowerShell as Administrator and ensure your WSL kernel is up to date:
-
Configure Docker Desktop
- Install the latest version of Docker Desktop for Windows.
- Open Docker Desktop Settings (gear icon).
- Under General, ensure "Use the WSL 2 based engine" is checked.
- Under Resources > WSL Integration, ensure the switch is enabled for your default Linux distro (e.g., Ubuntu).
-
Run with Docker Compose
- Run the following command in the repository root:
docker-compose up --build
- The application will be accessible at
http://127.0.0.1:7860.
- Run the following command in the repository root:
To verify the installation and features (like the DeepFilterNet denoiser), runs the integration tests inside the container:
# Run the Denoiser Integration Test
docker-compose exec voice-clone-studio python tests/integration_test_denoiser.pypython voice_clone_studio.pyOr use the launcher scripts:
# Windows
launch.bat
# Linux/macOS
./launch.shThe UI will open at http://127.0.0.1:7860
- Go to the Prep Audio tab
- Upload or record audio (3-10 seconds of clear speech)
- Trim, normalize, and denoise as needed
- Transcribe or manually enter the text
- Save as a sample with a name
- Go to the Voice Clone tab
- Select your sample from the dropdown
- Enter the text you want to speak
- Click Generate
- Go to the Voice Design tab
- Enter the text to speak
- Describe the voice (e.g., "Young female, warm and friendly, slight British accent")
- Click Generate
Voice-Clone-Studio/
├── voice_clone_studio.py # Main orchestrator (~230 lines)
├── config.json # User preferences & enabled tools
├── requirements.txt # Python dependencies
├── launch.bat / launch.sh # Launcher scripts
├── setup-windows.bat / setup-linux.sh / setup-mac.sh # Platform setup scripts
├── wheel/ # Pre-built custom Gradio components
│ └── gradio_filelister-0.4.0-py3-none-any.whl
├── samples/ # Voice samples (.wav + .json)
├── output/ # Generated audio outputs
├── datasets/ # Training datasets
├── models/ # Downloaded & trained models
├── docs/ # Documentation
│ ├── updates.md # Version history
│ ├── troubleshooting.md # Troubleshooting guide
│ └── MODEL_MANAGEMENT_README.md # AI model manager docs
└── modules/
├── core_components/ # Core app code
│ ├── tools/ # All UI tools (tabs)
│ │ ├── voice_clone.py
│ │ ├── voice_presets.py
│ │ ├── conversation.py
│ │ ├── voice_design.py
│ │ ├── voice_changer.py
│ │ ├── sound_effects.py
│ │ ├── prep_audio.py
│ │ ├── output_history.py
│ │ ├── train_model.py
│ │ └── settings.py
│ ├── ai_models/ # TTS & ASR model managers
│ ├── ui_components/ # Modals, theme
│ ├── gradio_filelister/ # Custom file browser component
│ ├── constants.py # Central constants
│ ├── emotion_manager.py # Emotion presets
│ ├── audio_utils.py # Audio processing
│ └── help_page.py # Help content
├── deepfilternet/ # Audio denoising
├── qwen_finetune/ # Training scripts
├── chatterbox/ # Chatterbox voice conversion
├── vibevoice_tts/ # VibeVoice TTS
└── vibevoice_asr/ # VibeVoice ASR
Each tab lets you choose between model sizes:
| Model | Sizes | Use Case |
|---|---|---|
| Qwen3-TTS Base | Small, Large | Voice cloning from samples |
| Qwen3-TTS CustomVoice | Small, Large | Premium speakers with style control |
| Qwen3-TTS VoiceDesign | 1.7B only | Voice design from descriptions |
| LuxTTS | Large | Voice cloning with speaker encoder |
| VibeVoice-TTS | Small, Large | Voice cloning & Long-form multi-speaker (up to 90 min) |
| Chatterbox | TTS, Multilingual | Speech-to-speech voice conversion |
| Fish Speech S2 Pro | 4B | Voice cloning with inline expression tags |
| VibeVoice-ASR | Large | Audio transcription |
| Whisper | Medium | Audio transcription |
| MMAudio | Medium, Large v2 | Sound effects generation (text & video to audio) |
- Small = Faster, less VRAM (Qwen: 0.6B ~4GB, VibeVoice: 1.5B)
- Large = Better quality, more expressive (Qwen: 1.7B ~8GB, VibeVoice: Large model)
- 4 Bit Quantized version of the Large model is also included for VibeVoice.
Models are automatically downloaded on first use via HuggingFace.
- Reference Audio: Use clear, noise-free recordings (3-10 seconds)
- Transcripts: Should exactly match what's spoken in the audio
- Caching: Voice prompts are cached - first generation is slow, subsequent ones are fast
- Seeds: Use the same seed to reproduce identical outputs
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This project is based on and uses code from:
- Qwen3-TTS - Apache 2.0 License (Alibaba)
- VibeVoice - MIT License
- LuxTTS - Apache 2.0 License
- Gradio - Apache 2.0 License
- MMAudio - MIT License
- Chatterbox - MIT License (Resemble AI)
- DeepFilterNet - MIT License
For detailed version history and release notes, see docs/updates.md.
