You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+7-4Lines changed: 7 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
1
# Voice Clone Studio
2
2
3
-
A multi-model, modular Gradio-based web UI for voice cloning, voice design, multi-speaker conversation, voice conversion, voice training and sound effects. Basically, One app, many engines, to tinker with all of them without juggling separate repos or setups. Powered by [VibeVoice](https://github.com/microsoft/VibeVoice), [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS), [LuxTTS](https://github.com/ysharma3501/LuxTTS), [Chatterbox](https://github.com/resemble-ai/chatterbox) and [MMAudio](https://github.com/hkchengrex/MMAudio). Supports [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR), [VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md) and [Whisper](https://github.com/openai/whisper) for automatic transcription. As well as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [Ollama](https://ollama.com/) for Prompt Generation and a Prompt Saving, based on [ComfyUI Prompt-Manager](https://github.com/FranckyB/ComfyUI-Prompt-Manager)
3
+
A multi-model, modular Gradio-based web UI for voice cloning, voice design, multi-speaker conversation, voice conversion, voice training and sound effects. Basically, One app, many engines, to tinker with all of them without juggling separate repos or setups. Powered by [VibeVoice](https://github.com/microsoft/VibeVoice), [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS), [LuxTTS](https://github.com/ysharma3501/LuxTTS), [Chatterbox](https://github.com/resemble-ai/chatterbox), [Fish Speech](https://github.com/fishaudio/fish-speech) and [MMAudio](https://github.com/hkchengrex/MMAudio). Supports [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR), [VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md) and [Whisper](https://github.com/openai/whisper) for automatic transcription. As well as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [Ollama](https://ollama.com/) for Prompt Generation and a Prompt Saving, based on [ComfyUI Prompt-Manager](https://github.com/FranckyB/ComfyUI-Prompt-Manager)
<ahref="docs/preview.png"><imgsrc="docs/preview.png"alt="Voice Clone Studio Preview"width="600"></a>
8
8
@@ -15,7 +15,9 @@ Voice Clone Studio is fully modular. The main file dynamically loads self-contai
15
15
### Voice Clone
16
16
Clone voices from your own audio samples. Provide a short reference audio clip with its transcript, and generate new speech in that voice.
17
17
18
-
-**Multiple engines** - Qwen3-TTS (0.6B/1.7B) or VibeVoice (1.5B/Large/Large-4bit)
18
+
-**Multiple engines** - Qwen3-TTS (0.6B/1.7B), VibeVoice (1.5B/Large/Large-4bit), LuxTTS, Chatterbox, and Fish Speech S2 Pro (4B)
19
+
-**Fish Speech Expression Tags** - Embed `[tag]` markers like `[whisper]`, `[laughing]`, `[excited]` directly in text for fine-grained delivery control (15,000+ supported tags)
20
+
-**Automatic Tag Stripping** - Fish Speech `[tags]` are automatically removed when using other engines, so the same text works everywhere
19
21
-**Voice prompt caching** - First generation processes the sample, subsequent ones are instant
20
22
-**Seed control** - Reproducible results with saved seeds
21
23
-**Emotion presets** - 40+ emotion presets with adjustable intensity
@@ -164,7 +166,7 @@ Save, browse, and generate text prompts for your TTS sessions. Includes a built-
164
166
-**Prompt Hub** - Every generation tool has a built-in Prompt Loader for one-click access to saved prompts without switching tabs
165
167
-**LLM Generation** - Generate prompts locally using Qwen3 language models via llama.cpp or Ollama (no cloud API needed)
166
168
-**Ollama Support** - Use any model from your local Ollama installation as an alternative to llama.cpp
167
-
-**System Prompt Presets** - Built-in presets for TTS/Voice and Sound Design/SFX workflows, or write your own
169
+
-**System Prompt Presets** - Built-in presets for TTS/Voice, TTS/Voice (Fish Speech) with `[tag]` instructions, and Sound Design/SFX workflows, or write your own
168
170
-**Model Auto-Download** - Download Qwen3-4B (~4.8GB) or Qwen3-8B (~8.5GB) GGUF models directly from HuggingFace
169
171
-**Custom Models** - Drop any `.gguf` file into `models/llama/` to use your own models
170
172
-**Automatic Server Management** - llama.cpp server starts/stops automatically, cleaned up on exit or Clear VRAM
@@ -491,6 +493,7 @@ Each tab lets you choose between model sizes:
491
493
|**LuxTTS**| Large | Voice cloning with speaker encoder |
492
494
|**VibeVoice-TTS**| Small, Large | Voice cloning & Long-form multi-speaker (up to 90 min) |
Fish Speech uses inline tags in text: `[happy]Hello![/happy]`, `[sad]Oh no[/sad]`
218
+
219
+
**Generation Parameters:**
220
+
-`temperature` (0.7-1.0, default 0.8)
221
+
-`top_p` (0.7-0.95, default 0.8)
222
+
-`top_k` (1-100, default 30)
223
+
-`repetition_penalty` (1.0-1.2, default 1.1)
224
+
-`max_new_tokens` (0=auto, up to 4096)
225
+
-`chunk_length` (100-512, default 300)
226
+
227
+
**Note:** The protobuf version conflict between `descript-audiotools` (pins `<3.20`) and `onnxruntime` (needs `>=4.25.1`) is resolved by force-upgrading protobuf to 5.x+ after installation. Both packages work at runtime despite the declared constraint.
Copy file name to clipboardExpand all lines: docs/updates.md
+33Lines changed: 33 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,38 @@
1
1
# Version History
2
2
3
+
## April 5, 2026
4
+
5
+
#### Version 1.12.5 - Fish Speech Compilation Speed-Up & Stability
6
+
7
+
**Triton/Inductor Compilation (thanks to [Mixomo](https://github.com/Mixomo))**
8
+
-**Compiled Kernel Caching** - Fish Speech now uses Triton/Inductor compilation with persistent caching in `models/.cache`, dramatically speeding up repeat generations
9
+
-**Windows Compatibility Patches** - Automatic runtime patching of `torch._inductor` to fix `cluster_dims` AttributeError crashes with triton-windows
10
+
-**First-Run Cache Warning** - Clear console message displayed just before the first generation to inform users that kernel compilation is a one-time process that may take several minutes
11
+
-**Persistent Cache Detection** - Startup reports compiled kernel count when cache exists; skips warning on subsequent launches
12
+
13
+
**Dependencies**
14
+
-**triton-windows** - Added `triton-windows` and `ninja` packages for GPU kernel compilation on Windows
15
+
16
+
## April 3, 2026
17
+
18
+
#### Version 1.12.0 - Fish Speech S2 Pro Integration
-**Inline Expression Tags** - Fish Speech supports `[tag]` syntax for fine-grained control over speech delivery — embed tags like `[whisper]`, `[laughing]`, `[excited]`, `[pause]` directly in text
23
+
-**15,000+ Tags** - Supports both predefined tags and free-form natural-language descriptions like `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]`
24
+
-**Full Parameter Controls** - Temperature, Top-P, Top-K, Repetition Penalty, Max New Tokens, and Chunk Length sliders with ranges aligned to the official Fish Speech webui
25
+
-**Automatic Tag Stripping** - When using other TTS engines, Fish Speech `[tags]` are automatically stripped from text so other models don't read them aloud
26
+
-**Split by Paragraph** - Fully compatible with the existing Split by Paragraph feature for batch generation
27
+
-**Auto-Download** - Model (~8 GB) downloads automatically from HuggingFace on first use with progress indication
28
+
29
+
**Prompt Manager**
30
+
-**Fish Speech LLM Preset** - New "TTS / Voice (Fish Speech)" system prompt preset instructs the LLM to generate text with inline `[tag]` expression markers, including a full tag reference and usage examples
31
+
32
+
**Dependency Management**
33
+
-**Clean Pip Resolution** - Resolved protobuf and transformers version conflicts using `--no-deps` install pattern for `descript-audio-codec` and `qwen-asr`
34
+
-**Setup Scripts Updated** - All three setup scripts (Windows, Linux, macOS) updated with Fish Speech codec dependencies and conflict-free installs
35
+
3
36
## March 28, 2026
4
37
5
38
#### Version 1.11.0 - VibeVoice 7B Training, Split by Paragraph & Bug Fixes
0 commit comments