FranckyB
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 4 deletions b/‎README.md‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎docs/MODEL_MANAGEMENT_README.md‎
Lines changed: 29 additions & 0 deletions b/‎docs/MODEL_MANAGEMENT_README.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎docs/updates.md‎
Lines changed: 33 additions & 0 deletions b/‎docs/updates.md‎
Lines changed: 33 additions & 0 deletions
@@ -53,6 +53,7 @@ emotions.json
 prompts.json
 Qwen/
 .hf_cache/
+modules/models/.cache/
 
 # Jupyter
 .ipynb_checkpoints/
 
@@ -1,8 +1,8 @@
 # Voice Clone Studio
 
-A multi-model, modular Gradio-based web UI for voice cloning, voice design, multi-speaker conversation, voice conversion, voice training and sound effects. Basically, One app, many engines, to tinker with all of them without juggling separate repos or setups.  Powered by [VibeVoice](https://github.com/microsoft/VibeVoice), [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS), [LuxTTS](https://github.com/ysharma3501/LuxTTS), [Chatterbox](https://github.com/resemble-ai/chatterbox) and [MMAudio](https://github.com/hkchengrex/MMAudio). Supports [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR), [VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md) and [Whisper](https://github.com/openai/whisper) for automatic transcription. As well as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [Ollama](https://ollama.com/) for Prompt Generation and a Prompt Saving, based on [ComfyUI Prompt-Manager](https://github.com/FranckyB/ComfyUI-Prompt-Manager)
+A multi-model, modular Gradio-based web UI for voice cloning, voice design, multi-speaker conversation, voice conversion, voice training and sound effects. Basically, One app, many engines, to tinker with all of them without juggling separate repos or setups.  Powered by [VibeVoice](https://github.com/microsoft/VibeVoice), [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS), [LuxTTS](https://github.com/ysharma3501/LuxTTS), [Chatterbox](https://github.com/resemble-ai/chatterbox), [Fish Speech](https://github.com/fishaudio/fish-speech) and [MMAudio](https://github.com/hkchengrex/MMAudio). Supports [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR), [VibeVoice-ASR](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-asr.md) and [Whisper](https://github.com/openai/whisper) for automatic transcription. As well as [llama.cpp](https://github.com/ggerganov/llama.cpp) and [Ollama](https://ollama.com/) for Prompt Generation and a Prompt Saving, based on [ComfyUI Prompt-Manager](https://github.com/FranckyB/ComfyUI-Prompt-Manager)
 
-<img src="https://img.shields.io/badge/VibeVoice-TTS-green" alt="VibeVoice TTS"> <img src="https://img.shields.io/badge/VibeVoice-ASR-green" alt="VibeVoice ASR"> <img src="https://img.shields.io/badge/Qwen3-TTS-blue" alt="Qwen3-TTS"> <img src="https://img.shields.io/badge/Qwen3-ASR-blue" alt="Qwen3-ASR"> <img src="https://img.shields.io/badge/LuxTTS-TTS-orange" alt="LuxTTS"> <img src="https://img.shields.io/badge/Chatterbox-TTS-red" alt="Chatterbox-TTS"> <img src="https://img.shields.io/badge/Whisper-yellow" alt="Whisper"> <img src="https://img.shields.io/badge/MMAudio-SFX-purple" alt="MMAudio">
+<img src="https://img.shields.io/badge/VibeVoice-TTS-green" alt="VibeVoice TTS"> <img src="https://img.shields.io/badge/VibeVoice-ASR-green" alt="VibeVoice ASR"> <img src="https://img.shields.io/badge/Qwen3-TTS-blue" alt="Qwen3-TTS"> <img src="https://img.shields.io/badge/Qwen3-ASR-blue" alt="Qwen3-ASR"> <img src="https://img.shields.io/badge/LuxTTS-TTS-orange" alt="LuxTTS"> <img src="https://img.shields.io/badge/Chatterbox-TTS-red" alt="Chatterbox-TTS"> <img src="https://img.shields.io/badge/Fish_Speech-TTS-teal" alt="Fish Speech TTS"> <img src="https://img.shields.io/badge/Whisper-yellow" alt="Whisper"> <img src="https://img.shields.io/badge/MMAudio-SFX-purple" alt="MMAudio">
 
 <a href="docs/preview.png"><img src="docs/preview.png" alt="Voice Clone Studio Preview" width="600"></a>
 
@@ -15,7 +15,9 @@ Voice Clone Studio is fully modular. The main file dynamically loads self-contai
 ### Voice Clone
 Clone voices from your own audio samples. Provide a short reference audio clip with its transcript, and generate new speech in that voice.
 
-- **Multiple engines** - Qwen3-TTS (0.6B/1.7B) or VibeVoice (1.5B/Large/Large-4bit)
+- **Multiple engines** - Qwen3-TTS (0.6B/1.7B), VibeVoice (1.5B/Large/Large-4bit), LuxTTS, Chatterbox, and Fish Speech S2 Pro (4B)
+- **Fish Speech Expression Tags** - Embed `[tag]` markers like `[whisper]`, `[laughing]`, `[excited]` directly in text for fine-grained delivery control (15,000+ supported tags)
+- **Automatic Tag Stripping** - Fish Speech `[tags]` are automatically removed when using other engines, so the same text works everywhere
 - **Voice prompt caching** - First generation processes the sample, subsequent ones are instant
 - **Seed control** - Reproducible results with saved seeds
 - **Emotion presets** - 40+ emotion presets with adjustable intensity
@@ -164,7 +166,7 @@ Save, browse, and generate text prompts for your TTS sessions. Includes a built-
 - **Prompt Hub** - Every generation tool has a built-in Prompt Loader for one-click access to saved prompts without switching tabs
 - **LLM Generation** - Generate prompts locally using Qwen3 language models via llama.cpp or Ollama (no cloud API needed)
 - **Ollama Support** - Use any model from your local Ollama installation as an alternative to llama.cpp
-- **System Prompt Presets** - Built-in presets for TTS/Voice and Sound Design/SFX workflows, or write your own
+- **System Prompt Presets** - Built-in presets for TTS/Voice, TTS/Voice (Fish Speech) with `[tag]` instructions, and Sound Design/SFX workflows, or write your own
 - **Model Auto-Download** - Download Qwen3-4B (~4.8GB) or Qwen3-8B (~8.5GB) GGUF models directly from HuggingFace
 - **Custom Models** - Drop any `.gguf` file into `models/llama/` to use your own models
 - **Automatic Server Management** - llama.cpp server starts/stops automatically, cleaned up on exit or Clear VRAM
@@ -491,6 +493,7 @@ Each tab lets you choose between model sizes:
 | **LuxTTS** | Large | Voice cloning with speaker encoder |
 | **VibeVoice-TTS** | Small, Large | Voice cloning & Long-form multi-speaker (up to 90 min) |
 | **Chatterbox** | TTS, Multilingual | Speech-to-speech voice conversion |
+| **Fish Speech S2 Pro** | 4B | Voice cloning with inline expression tags |
 | **VibeVoice-ASR** | Large | Audio transcription |
 | **Whisper** | Medium | Audio transcription |
 | **MMAudio** | Medium, Large v2 | Sound effects generation (text & video to audio) |
 
@@ -38,6 +38,10 @@ voice_design_model = tts_manager.get_qwen3_voice_design()
 custom_voice_model = tts_manager.get_qwen3_custom_voice(size="0.6B")
 vibevoice_model = tts_manager.get_vibevoice_tts(size="1.5B")
 
+# Fish Speech S2 Pro (4B model, ~24GB VRAM, 80+ languages)
+# Vendored in modules/fish_speech/ — Fish Audio Research License
+fish_speech = tts_manager.get_fish_speech()
+
 # Unload all when done
 tts_manager.unload_all()
 
@@ -98,6 +102,7 @@ TTS Model Manager:
 - `get_qwen3_voice_design()` - Load Qwen3 VoiceDesign
 - `get_qwen3_custom_voice(size)` - Load Qwen3 CustomVoice
 - `get_vibevoice_tts(size)` - Load VibeVoice TTS
+- `get_fish_speech()` - Load Fish Speech S2 Pro
 - `unload_all()` - Free all VRAM
 - `compute_sample_hash(wav_path, ref_text)` - Hash sample
 - `load_voice_prompt(sample_name, hash, model_size)` - Load cached prompt
@@ -197,6 +202,30 @@ User configuration (`config.json`) controls:
 ✅ **Optimized** - Smart VRAM management  
 ✅ **Configurable** - User control over model behavior  
 
+## Fish Speech S2 Pro
+
+Fish Speech S2 Pro is a 4B parameter TTS model supporting 80+ languages with inline emotion tags.
+
+**Details:**
+- Model: `fishaudio/s2-pro` on HuggingFace (~24GB VRAM)
+- Architecture: DualARTransformer + custom DAC audio codec
+- License: Fish Audio Research License (free for non-commercial use, commercial requires license)
+- Vendored source: `modules/fish_speech/` (inference-only subset)
+- Dependencies: `descript-audio-codec`, `hydra-core`, `loguru`
+
+**Emotion Tags:**
+Fish Speech uses inline tags in text: `[happy]Hello![/happy]`, `[sad]Oh no[/sad]`
+
+**Generation Parameters:**
+- `temperature` (0.7-1.0, default 0.8)
+- `top_p` (0.7-0.95, default 0.8)
+- `top_k` (1-100, default 30)
+- `repetition_penalty` (1.0-1.2, default 1.1)
+- `max_new_tokens` (0=auto, up to 4096)
+- `chunk_length` (100-512, default 300)
+
+**Note:** The protobuf version conflict between `descript-audiotools` (pins `<3.20`) and `onnxruntime` (needs `>=4.25.1`) is resolved by force-upgrading protobuf to 5.x+ after installation. Both packages work at runtime despite the declared constraint.
+
 ## Future Improvements
 
 - [ ] Model preloading for faster startup
 
@@ -1,5 +1,38 @@
 # Version History
 
+## April 5, 2026
+
+#### Version 1.12.5 - Fish Speech Compilation Speed-Up & Stability
+
+**Triton/Inductor Compilation (thanks to [Mixomo](https://github.com/Mixomo))**
+- **Compiled Kernel Caching** - Fish Speech now uses Triton/Inductor compilation with persistent caching in `models/.cache`, dramatically speeding up repeat generations
+- **Windows Compatibility Patches** - Automatic runtime patching of `torch._inductor` to fix `cluster_dims` AttributeError crashes with triton-windows
+- **First-Run Cache Warning** - Clear console message displayed just before the first generation to inform users that kernel compilation is a one-time process that may take several minutes
+- **Persistent Cache Detection** - Startup reports compiled kernel count when cache exists; skips warning on subsequent launches
+
+**Dependencies**
+- **triton-windows** - Added `triton-windows` and `ninja` packages for GPU kernel compilation on Windows
+
+## April 3, 2026
+
+#### Version 1.12.0 - Fish Speech S2 Pro Integration
+
+**Fish Speech S2 Pro (6th TTS Engine)**
+- **New TTS Engine** - Integrated [Fish Speech S2 Pro](https://huggingface.co/fishaudio/s2-pro) (4B param DualAR Transformer + custom DAC codec) as a voice cloning engine
+- **Inline Expression Tags** - Fish Speech supports `[tag]` syntax for fine-grained control over speech delivery — embed tags like `[whisper]`, `[laughing]`, `[excited]`, `[pause]` directly in text
+- **15,000+ Tags** - Supports both predefined tags and free-form natural-language descriptions like `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]`
+- **Full Parameter Controls** - Temperature, Top-P, Top-K, Repetition Penalty, Max New Tokens, and Chunk Length sliders with ranges aligned to the official Fish Speech webui
+- **Automatic Tag Stripping** - When using other TTS engines, Fish Speech `[tags]` are automatically stripped from text so other models don't read them aloud
+- **Split by Paragraph** - Fully compatible with the existing Split by Paragraph feature for batch generation
+- **Auto-Download** - Model (~8 GB) downloads automatically from HuggingFace on first use with progress indication
+
+**Prompt Manager**
+- **Fish Speech LLM Preset** - New "TTS / Voice (Fish Speech)" system prompt preset instructs the LLM to generate text with inline `[tag]` expression markers, including a full tag reference and usage examples
+
+**Dependency Management**
+- **Clean Pip Resolution** - Resolved protobuf and transformers version conflicts using `--no-deps` install pattern for `descript-audio-codec` and `qwen-asr`
+- **Setup Scripts Updated** - All three setup scripts (Windows, Linux, macOS) updated with Fish Speech codec dependencies and conflict-free installs
+
 ## March 28, 2026
 
 #### Version 1.11.0 - VibeVoice 7B Training, Split by Paragraph & Bug Fixes