This repository provides a local HTTP API for voice-conditioned text-to-speech.
It is designed to be easy for AI agents, tool wrappers, and skills to use. The repo now runs a vendored legacy synthesis path locally instead of depending on a live upstream checkout during runtime.
If you are building an agent skill, the shortest useful summary is:
- call
GET /healthto verify the server is ready - call
GET /api/v1/voices/listto discover validvoice_profilenames - call
POST /api/v1/tts/synthesizewithtextandvoice_profile - save the returned WAV bytes to a file
This project exists to expose a stable local voice synthesis engine over HTTP so that:
- AI assistants can generate speech without shelling into an older CLI repo
- automation tools can discover available voice profiles dynamically
- local workflows can produce WAV output from a simple authenticated API
The API accepts plain text plus a named voice profile and returns a generated WAV file.
Voice profiles are stored on disk in voice_profiles/. Model assets are stored in weights/. The synthesis engine itself is local to this repo and exposed through FastAPI.
To synthesize speech, an agent needs:
text: the text to speakvoice_profile: the exact folder name of an available voice profileAuthorizationheader:Bearer <token>
The synth endpoint returns:
- HTTP
200 OK - response body as
audio/wav - suggested filename
synthesized_speech.wav
Recommended agent flow:
- Check
GET /health - Get token
- Call
GET /api/v1/voices/list - Pick a valid voice
- Call
POST /api/v1/tts/synthesize - Save returned bytes as
.wav
Simple readiness check.
Example response:
{
"status": "healthy"
}Returns available voice profile names from the local voice_profiles/ folder.
Example response:
{
"profiles": ["Wayne", "House", "Tony_Stark"]
}Generate speech using a named voice profile.
Request body:
{
"text": "Hello from the voice API.",
"voice_profile": "Wayne"
}Successful response:
- status:
200 - content type:
audio/wav
Error response:
{
"detail": "error message"
}The API expects a bearer token signed with the configured SECRET_KEY.
Local helper:
python scripts/generate_token.pyThat script prints a token you can use as:
Authorization: Bearer <token>
From the project root:
python scripts/setup_directories.py
python scripts/verify_setup.py
docker compose build
docker compose up -dHealth check:
curl.exe -s http://localhost:8081/healthIncluded helper scripts:
scripts/test_health.batscripts/test_list_voices.batscripts/test_synthesize.bat
Examples:
scripts\test_health.bat
scripts\test_list_voices.bat
scripts\test_synthesize.bat Wayne "Hello from Wayne."
scripts\test_synthesize.bat Wayne "Hello from Wayne." test_outputs\wayne.wavEach voice profile is a folder under voice_profiles/.
Example:
voice_profiles/
Wayne/
1_Wayne.mp3
1_Wayne.txt
samples.txt
generated/
The API reads the first line of samples.txt.
Required format:
1_Wayne.mp3|It's okay with like a quad though, like my buddy Big T's got a snorkel kit on his and that's pretty punk rock.
Important rules:
- the filename must exist inside the voice folder
- the transcript should match the spoken audio exactly
- the folder name is the
voice_profilevalue agents must send
Expected files in weights/:
final_finetuned_model.ptmodel_1200000.ptF5TTS_Base_vocab.txt
The service checks for final_finetuned_model.pt first, then falls back to model_1200000.pt.
Model weights are intentionally not part of this repository and should not be committed or published with it.
This repo expects you to supply the model assets locally in weights/.
Use the same base model assets referenced by the original F5-TTS project:
- checkpoint:
F5TTS_Base/model_1200000.pt - vocab:
F5TTS_Base/vocab.txt
You can obtain them from the model release referenced by the original F5-TTS distribution:
- Hugging Face model repo:
SWivid/F5-TTS
After downloading:
- place
model_1200000.ptinweights/ - place
vocab.txtinweights/asF5TTS_Base_vocab.txt - optionally duplicate or rename the checkpoint to
final_finetuned_model.ptif you want that path to be the primary file the API picks up
curl -X POST "http://localhost:8081/api/v1/tts/synthesize" ^
-H "Authorization: Bearer YOUR_TOKEN" ^
-H "Content-Type: application/json" ^
-d "{\"text\":\"Hello from the API.\",\"voice_profile\":\"Wayne\"}" ^
--output output.wav$token = "YOUR_TOKEN"
$body = @{
text = "Hello from the API."
voice_profile = "Wayne"
} | ConvertTo-Json -Compress
Invoke-WebRequest `
-Uri "http://localhost:8081/api/v1/tts/synthesize" `
-Method Post `
-Headers @{ Authorization = "Bearer $token" } `
-ContentType "application/json" `
-Body $body `
-OutFile "output.wav"This section is intentionally written for people building agent skills, MCP wrappers, or tool-call adapters.
A good tool wrapper should:
- validate server health before synthesis
- fetch available voices instead of hardcoding them
- surface voice names exactly as returned by the API
- store audio output to disk and return the saved path
- return meaningful errors when auth fails or the voice name is missing
Minimal arguments for an agent tool:
{
"text": "string",
"voice_profile": "string",
"output_path": "optional string"
}Optional skill behavior:
- auto-list voices when the requested one is missing
- default
output_pathto a temp wav path - preserve exact text instead of rewriting it unless asked
- Call
/health - Get or refresh token
- Call
/api/v1/voices/list - Match requested voice name
- Call
/api/v1/tts/synthesize - Save WAV
- Return file path and selected voice
If voice_profile is invalid:
- call
/api/v1/voices/list - show the valid choices
If synthesis returns 500:
- surface the error detail
- keep the original request text and voice name in the error context
If you are creating a skill for an AI model, a prompt seed like this works well:
Use the local Legacy Voice API.
Always check /health first.
Discover voices from /api/v1/voices/list before choosing a voice_profile.
When synthesizing, send exact user text unless the user asked for rewriting.
Save returned WAV bytes to a file and report the final file path.
app/
api/
core/
services/
model/
scripts/
test_outputs/
voice_profiles/
weights/
docker-compose.yml
Dockerfile
requirements.txt
Important implementation files:
app/services/tts_service.pyapp/services/legacy_f5_infer.pyapp/api/routes/voices.pyapp/api/routes/tts.py
This repo intentionally prioritizes reproducing a known-good local synthesis path over tracking newer upstream behavior.
If you change synthesis logic:
docker compose build
docker compose up -d --force-recreate
docker compose logs --tail 200 apiThis repo is a local API layer plus vendored legacy synthesis logic. You are responsible for ensuring your use of the model assets and voice material is appropriate for your environment and use case.