Skip to content

helix4u/f5tts-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Legacy Voice API

This repository provides a local HTTP API for voice-conditioned text-to-speech.

It is designed to be easy for AI agents, tool wrappers, and skills to use. The repo now runs a vendored legacy synthesis path locally instead of depending on a live upstream checkout during runtime.

If you are building an agent skill, the shortest useful summary is:

  • call GET /health to verify the server is ready
  • call GET /api/v1/voices/list to discover valid voice_profile names
  • call POST /api/v1/tts/synthesize with text and voice_profile
  • save the returned WAV bytes to a file

Purpose

This project exists to expose a stable local voice synthesis engine over HTTP so that:

  • AI assistants can generate speech without shelling into an older CLI repo
  • automation tools can discover available voice profiles dynamically
  • local workflows can produce WAV output from a simple authenticated API

High-Level Behavior

The API accepts plain text plus a named voice profile and returns a generated WAV file.

Voice profiles are stored on disk in voice_profiles/. Model assets are stored in weights/. The synthesis engine itself is local to this repo and exposed through FastAPI.

Agent-Friendly Summary

Input contract

To synthesize speech, an agent needs:

  • text: the text to speak
  • voice_profile: the exact folder name of an available voice profile
  • Authorization header: Bearer <token>

Output contract

The synth endpoint returns:

  • HTTP 200 OK
  • response body as audio/wav
  • suggested filename synthesized_speech.wav

Discovery flow for tools or skills

Recommended agent flow:

  1. Check GET /health
  2. Get token
  3. Call GET /api/v1/voices/list
  4. Pick a valid voice
  5. Call POST /api/v1/tts/synthesize
  6. Save returned bytes as .wav

Endpoints

GET /health

Simple readiness check.

Example response:

{
  "status": "healthy"
}

GET /api/v1/voices/list

Returns available voice profile names from the local voice_profiles/ folder.

Example response:

{
  "profiles": ["Wayne", "House", "Tony_Stark"]
}

POST /api/v1/tts/synthesize

Generate speech using a named voice profile.

Request body:

{
  "text": "Hello from the voice API.",
  "voice_profile": "Wayne"
}

Successful response:

  • status: 200
  • content type: audio/wav

Error response:

{
  "detail": "error message"
}

Authentication

The API expects a bearer token signed with the configured SECRET_KEY.

Local helper:

python scripts/generate_token.py

That script prints a token you can use as:

Authorization: Bearer <token>

Quick Start

From the project root:

python scripts/setup_directories.py
python scripts/verify_setup.py
docker compose build
docker compose up -d

Health check:

curl.exe -s http://localhost:8081/health

Windows Test Scripts

Included helper scripts:

  • scripts/test_health.bat
  • scripts/test_list_voices.bat
  • scripts/test_synthesize.bat

Examples:

scripts\test_health.bat
scripts\test_list_voices.bat
scripts\test_synthesize.bat Wayne "Hello from Wayne."
scripts\test_synthesize.bat Wayne "Hello from Wayne." test_outputs\wayne.wav

Voice Profile Format

Each voice profile is a folder under voice_profiles/.

Example:

voice_profiles/
  Wayne/
    1_Wayne.mp3
    1_Wayne.txt
    samples.txt
    generated/

The API reads the first line of samples.txt.

Required format:

1_Wayne.mp3|It's okay with like a quad though, like my buddy Big T's got a snorkel kit on his and that's pretty punk rock.

Important rules:

  • the filename must exist inside the voice folder
  • the transcript should match the spoken audio exactly
  • the folder name is the voice_profile value agents must send

Model Assets

Expected files in weights/:

  • final_finetuned_model.pt
  • model_1200000.pt
  • F5TTS_Base_vocab.txt

The service checks for final_finetuned_model.pt first, then falls back to model_1200000.pt.

Important

Model weights are intentionally not part of this repository and should not be committed or published with it.

This repo expects you to supply the model assets locally in weights/.

Where to get the weights

Use the same base model assets referenced by the original F5-TTS project:

  • checkpoint: F5TTS_Base/model_1200000.pt
  • vocab: F5TTS_Base/vocab.txt

You can obtain them from the model release referenced by the original F5-TTS distribution:

  • Hugging Face model repo: SWivid/F5-TTS

After downloading:

  • place model_1200000.pt in weights/
  • place vocab.txt in weights/ as F5TTS_Base_vocab.txt
  • optionally duplicate or rename the checkpoint to final_finetuned_model.pt if you want that path to be the primary file the API picks up

Example Requests

Curl

curl -X POST "http://localhost:8081/api/v1/tts/synthesize" ^
  -H "Authorization: Bearer YOUR_TOKEN" ^
  -H "Content-Type: application/json" ^
  -d "{\"text\":\"Hello from the API.\",\"voice_profile\":\"Wayne\"}" ^
  --output output.wav

PowerShell

$token = "YOUR_TOKEN"
$body = @{
  text = "Hello from the API."
  voice_profile = "Wayne"
} | ConvertTo-Json -Compress

Invoke-WebRequest `
  -Uri "http://localhost:8081/api/v1/tts/synthesize" `
  -Method Post `
  -Headers @{ Authorization = "Bearer $token" } `
  -ContentType "application/json" `
  -Body $body `
  -OutFile "output.wav"

Skill / Tool Builder Notes

This section is intentionally written for people building agent skills, MCP wrappers, or tool-call adapters.

Recommended tool behavior

A good tool wrapper should:

  • validate server health before synthesis
  • fetch available voices instead of hardcoding them
  • surface voice names exactly as returned by the API
  • store audio output to disk and return the saved path
  • return meaningful errors when auth fails or the voice name is missing

Suggested tool schema

Minimal arguments for an agent tool:

{
  "text": "string",
  "voice_profile": "string",
  "output_path": "optional string"
}

Optional skill behavior:

  • auto-list voices when the requested one is missing
  • default output_path to a temp wav path
  • preserve exact text instead of rewriting it unless asked

Suggested skill workflow

  1. Call /health
  2. Get or refresh token
  3. Call /api/v1/voices/list
  4. Match requested voice name
  5. Call /api/v1/tts/synthesize
  6. Save WAV
  7. Return file path and selected voice

Suggested failure handling

If voice_profile is invalid:

  • call /api/v1/voices/list
  • show the valid choices

If synthesis returns 500:

  • surface the error detail
  • keep the original request text and voice name in the error context

Suggested Skill Prompt Snippet

If you are creating a skill for an AI model, a prompt seed like this works well:

Use the local Legacy Voice API.
Always check /health first.
Discover voices from /api/v1/voices/list before choosing a voice_profile.
When synthesizing, send exact user text unless the user asked for rewriting.
Save returned WAV bytes to a file and report the final file path.

Repo Structure

app/
  api/
  core/
  services/
model/
scripts/
test_outputs/
voice_profiles/
weights/
docker-compose.yml
Dockerfile
requirements.txt

Important implementation files:

  • app/services/tts_service.py
  • app/services/legacy_f5_infer.py
  • app/api/routes/voices.py
  • app/api/routes/tts.py

Development Notes

This repo intentionally prioritizes reproducing a known-good local synthesis path over tracking newer upstream behavior.

If you change synthesis logic:

docker compose build
docker compose up -d --force-recreate
docker compose logs --tail 200 api

License / Model Responsibility

This repo is a local API layer plus vendored legacy synthesis logic. You are responsible for ensuring your use of the model assets and voice material is appropriate for your environment and use case.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages