Skip to content

Commit cff8079

Browse files
authored
Merge pull request #92 from FranckyB/dev
Dev
2 parents c188b12 + d37e92b commit cff8079

33 files changed

Lines changed: 5324 additions & 19 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ emotions.json
5353
prompts.json
5454
Qwen/
5555
.hf_cache/
56+
modules/models/.cache/
5657

5758
# Jupyter
5859
.ipynb_checkpoints/

README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Voice Clone Studio
22

3-
A multi-model, modular Gradio-based web UI for voice cloning, voice design, multi-speaker conversation, voice conversion, voice training and sound effects. Basically, One app, many engines, to tinker with all of them without juggling separate repos or setups. Powered by [VibeVoice](https://github.com/microsoft/VibeVoice), [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS), [LuxTTS](https://github.com/ysharma3501/LuxTTS), [Chatterbox](https://github.com/resemble-ai/chatterbox) and [MMAudio](https://github.com/hkchengrex/MMAudio). Supports [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR), [VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md) and [Whisper](https://github.com/openai/whisper) for automatic transcription. As well as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [Ollama](https://ollama.com/) for Prompt Generation and a Prompt Saving, based on [ComfyUI Prompt-Manager](https://github.com/FranckyB/ComfyUI-Prompt-Manager)
3+
A multi-model, modular Gradio-based web UI for voice cloning, voice design, multi-speaker conversation, voice conversion, voice training and sound effects. Basically, One app, many engines, to tinker with all of them without juggling separate repos or setups. Powered by [VibeVoice](https://github.com/microsoft/VibeVoice), [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS), [LuxTTS](https://github.com/ysharma3501/LuxTTS), [Chatterbox](https://github.com/resemble-ai/chatterbox), [Fish Speech](https://github.com/fishaudio/fish-speech) and [MMAudio](https://github.com/hkchengrex/MMAudio). Supports [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR), [VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md) and [Whisper](https://github.com/openai/whisper) for automatic transcription. As well as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [Ollama](https://ollama.com/) for Prompt Generation and a Prompt Saving, based on [ComfyUI Prompt-Manager](https://github.com/FranckyB/ComfyUI-Prompt-Manager)
44

5-
<img src="https://img.shields.io/badge/VibeVoice-TTS-green" alt="VibeVoice TTS"> <img src="https://img.shields.io/badge/VibeVoice-ASR-green" alt="VibeVoice ASR"> <img src="https://img.shields.io/badge/Qwen3-TTS-blue" alt="Qwen3-TTS"> <img src="https://img.shields.io/badge/Qwen3-ASR-blue" alt="Qwen3-ASR"> <img src="https://img.shields.io/badge/LuxTTS-TTS-orange" alt="LuxTTS"> <img src="https://img.shields.io/badge/Chatterbox-TTS-red" alt="Chatterbox-TTS"> <img src="https://img.shields.io/badge/Whisper-yellow" alt="Whisper"> <img src="https://img.shields.io/badge/MMAudio-SFX-purple" alt="MMAudio">
5+
<img src="https://img.shields.io/badge/VibeVoice-TTS-green" alt="VibeVoice TTS"> <img src="https://img.shields.io/badge/VibeVoice-ASR-green" alt="VibeVoice ASR"> <img src="https://img.shields.io/badge/Qwen3-TTS-blue" alt="Qwen3-TTS"> <img src="https://img.shields.io/badge/Qwen3-ASR-blue" alt="Qwen3-ASR"> <img src="https://img.shields.io/badge/LuxTTS-TTS-orange" alt="LuxTTS"> <img src="https://img.shields.io/badge/Chatterbox-TTS-red" alt="Chatterbox-TTS"> <img src="https://img.shields.io/badge/Fish_Speech-TTS-teal" alt="Fish Speech TTS"> <img src="https://img.shields.io/badge/Whisper-yellow" alt="Whisper"> <img src="https://img.shields.io/badge/MMAudio-SFX-purple" alt="MMAudio">
66

77
<a href="docs/preview.png"><img src="docs/preview.png" alt="Voice Clone Studio Preview" width="600"></a>
88

@@ -15,7 +15,9 @@ Voice Clone Studio is fully modular. The main file dynamically loads self-contai
1515
### Voice Clone
1616
Clone voices from your own audio samples. Provide a short reference audio clip with its transcript, and generate new speech in that voice.
1717

18-
- **Multiple engines** - Qwen3-TTS (0.6B/1.7B) or VibeVoice (1.5B/Large/Large-4bit)
18+
- **Multiple engines** - Qwen3-TTS (0.6B/1.7B), VibeVoice (1.5B/Large/Large-4bit), LuxTTS, Chatterbox, and Fish Speech S2 Pro (4B)
19+
- **Fish Speech Expression Tags** - Embed `[tag]` markers like `[whisper]`, `[laughing]`, `[excited]` directly in text for fine-grained delivery control (15,000+ supported tags)
20+
- **Automatic Tag Stripping** - Fish Speech `[tags]` are automatically removed when using other engines, so the same text works everywhere
1921
- **Voice prompt caching** - First generation processes the sample, subsequent ones are instant
2022
- **Seed control** - Reproducible results with saved seeds
2123
- **Emotion presets** - 40+ emotion presets with adjustable intensity
@@ -164,7 +166,7 @@ Save, browse, and generate text prompts for your TTS sessions. Includes a built-
164166
- **Prompt Hub** - Every generation tool has a built-in Prompt Loader for one-click access to saved prompts without switching tabs
165167
- **LLM Generation** - Generate prompts locally using Qwen3 language models via llama.cpp or Ollama (no cloud API needed)
166168
- **Ollama Support** - Use any model from your local Ollama installation as an alternative to llama.cpp
167-
- **System Prompt Presets** - Built-in presets for TTS/Voice and Sound Design/SFX workflows, or write your own
169+
- **System Prompt Presets** - Built-in presets for TTS/Voice, TTS/Voice (Fish Speech) with `[tag]` instructions, and Sound Design/SFX workflows, or write your own
168170
- **Model Auto-Download** - Download Qwen3-4B (~4.8GB) or Qwen3-8B (~8.5GB) GGUF models directly from HuggingFace
169171
- **Custom Models** - Drop any `.gguf` file into `models/llama/` to use your own models
170172
- **Automatic Server Management** - llama.cpp server starts/stops automatically, cleaned up on exit or Clear VRAM
@@ -491,6 +493,7 @@ Each tab lets you choose between model sizes:
491493
| **LuxTTS** | Large | Voice cloning with speaker encoder |
492494
| **VibeVoice-TTS** | Small, Large | Voice cloning & Long-form multi-speaker (up to 90 min) |
493495
| **Chatterbox** | TTS, Multilingual | Speech-to-speech voice conversion |
496+
| **Fish Speech S2 Pro** | 4B | Voice cloning with inline expression tags |
494497
| **VibeVoice-ASR** | Large | Audio transcription |
495498
| **Whisper** | Medium | Audio transcription |
496499
| **MMAudio** | Medium, Large v2 | Sound effects generation (text & video to audio) |

docs/MODEL_MANAGEMENT_README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,10 @@ voice_design_model = tts_manager.get_qwen3_voice_design()
3838
custom_voice_model = tts_manager.get_qwen3_custom_voice(size="0.6B")
3939
vibevoice_model = tts_manager.get_vibevoice_tts(size="1.5B")
4040

41+
# Fish Speech S2 Pro (4B model, ~24GB VRAM, 80+ languages)
42+
# Vendored in modules/fish_speech/ — Fish Audio Research License
43+
fish_speech = tts_manager.get_fish_speech()
44+
4145
# Unload all when done
4246
tts_manager.unload_all()
4347

@@ -98,6 +102,7 @@ TTS Model Manager:
98102
- `get_qwen3_voice_design()` - Load Qwen3 VoiceDesign
99103
- `get_qwen3_custom_voice(size)` - Load Qwen3 CustomVoice
100104
- `get_vibevoice_tts(size)` - Load VibeVoice TTS
105+
- `get_fish_speech()` - Load Fish Speech S2 Pro
101106
- `unload_all()` - Free all VRAM
102107
- `compute_sample_hash(wav_path, ref_text)` - Hash sample
103108
- `load_voice_prompt(sample_name, hash, model_size)` - Load cached prompt
@@ -197,6 +202,30 @@ User configuration (`config.json`) controls:
197202
**Optimized** - Smart VRAM management
198203
**Configurable** - User control over model behavior
199204

205+
## Fish Speech S2 Pro
206+
207+
Fish Speech S2 Pro is a 4B parameter TTS model supporting 80+ languages with inline emotion tags.
208+
209+
**Details:**
210+
- Model: `fishaudio/s2-pro` on HuggingFace (~24GB VRAM)
211+
- Architecture: DualARTransformer + custom DAC audio codec
212+
- License: Fish Audio Research License (free for non-commercial use, commercial requires license)
213+
- Vendored source: `modules/fish_speech/` (inference-only subset)
214+
- Dependencies: `descript-audio-codec`, `hydra-core`, `loguru`
215+
216+
**Emotion Tags:**
217+
Fish Speech uses inline tags in text: `[happy]Hello![/happy]`, `[sad]Oh no[/sad]`
218+
219+
**Generation Parameters:**
220+
- `temperature` (0.7-1.0, default 0.8)
221+
- `top_p` (0.7-0.95, default 0.8)
222+
- `top_k` (1-100, default 30)
223+
- `repetition_penalty` (1.0-1.2, default 1.1)
224+
- `max_new_tokens` (0=auto, up to 4096)
225+
- `chunk_length` (100-512, default 300)
226+
227+
**Note:** The protobuf version conflict between `descript-audiotools` (pins `<3.20`) and `onnxruntime` (needs `>=4.25.1`) is resolved by force-upgrading protobuf to 5.x+ after installation. Both packages work at runtime despite the declared constraint.
228+
200229
## Future Improvements
201230

202231
- [ ] Model preloading for faster startup

docs/updates.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,38 @@
11
# Version History
22

3+
## April 5, 2026
4+
5+
#### Version 1.12.5 - Fish Speech Compilation Speed-Up & Stability
6+
7+
**Triton/Inductor Compilation (thanks to [Mixomo](https://github.com/Mixomo))**
8+
- **Compiled Kernel Caching** - Fish Speech now uses Triton/Inductor compilation with persistent caching in `models/.cache`, dramatically speeding up repeat generations
9+
- **Windows Compatibility Patches** - Automatic runtime patching of `torch._inductor` to fix `cluster_dims` AttributeError crashes with triton-windows
10+
- **First-Run Cache Warning** - Clear console message displayed just before the first generation to inform users that kernel compilation is a one-time process that may take several minutes
11+
- **Persistent Cache Detection** - Startup reports compiled kernel count when cache exists; skips warning on subsequent launches
12+
13+
**Dependencies**
14+
- **triton-windows** - Added `triton-windows` and `ninja` packages for GPU kernel compilation on Windows
15+
16+
## April 3, 2026
17+
18+
#### Version 1.12.0 - Fish Speech S2 Pro Integration
19+
20+
**Fish Speech S2 Pro (6th TTS Engine)**
21+
- **New TTS Engine** - Integrated [Fish Speech S2 Pro](https://huggingface.co/fishaudio/s2-pro) (4B param DualAR Transformer + custom DAC codec) as a voice cloning engine
22+
- **Inline Expression Tags** - Fish Speech supports `[tag]` syntax for fine-grained control over speech delivery — embed tags like `[whisper]`, `[laughing]`, `[excited]`, `[pause]` directly in text
23+
- **15,000+ Tags** - Supports both predefined tags and free-form natural-language descriptions like `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]`
24+
- **Full Parameter Controls** - Temperature, Top-P, Top-K, Repetition Penalty, Max New Tokens, and Chunk Length sliders with ranges aligned to the official Fish Speech webui
25+
- **Automatic Tag Stripping** - When using other TTS engines, Fish Speech `[tags]` are automatically stripped from text so other models don't read them aloud
26+
- **Split by Paragraph** - Fully compatible with the existing Split by Paragraph feature for batch generation
27+
- **Auto-Download** - Model (~8 GB) downloads automatically from HuggingFace on first use with progress indication
28+
29+
**Prompt Manager**
30+
- **Fish Speech LLM Preset** - New "TTS / Voice (Fish Speech)" system prompt preset instructs the LLM to generate text with inline `[tag]` expression markers, including a full tag reference and usage examples
31+
32+
**Dependency Management**
33+
- **Clean Pip Resolution** - Resolved protobuf and transformers version conflicts using `--no-deps` install pattern for `descript-audio-codec` and `qwen-asr`
34+
- **Setup Scripts Updated** - All three setup scripts (Windows, Linux, macOS) updated with Fish Speech codec dependencies and conflict-free installs
35+
336
## March 28, 2026
437

538
#### Version 1.11.0 - VibeVoice 7B Training, Split by Paragraph & Bug Fixes

0 commit comments

Comments
 (0)