Description
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
The current pipeline for AudioLDM 2 does not take in "transcript" field.
Hence, it does not create phonemes and hence does not allow for text-to-speech generation.
https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline
Currently, only text-to-music and text-to-audio are supported. The latent difussion model is not guided for creating phonemes as in the original implementation with these two checkpoints:
- audioldm2-speech-ljspeech
- audioldm2-speech-gigaspeech
Here:
https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/pipeline.py#L78
commandline from original repo:
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"
These two checkpoints naturally take phonemes into the batch so the checkpoints do consume "phoneme" as one of the fields in the batch natively.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Add the "transcription" input param to allow to choose a TTS model from the two checkpoints above and hence allow for TTS task.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Original repo implementation - is very slow and unoptimized.
Additional context
Add any other context or screenshots about the feature request here.
I believe the already implemented pipeline AudioLDM2 could be updated to take in the transcript field, update the batch, and load the additional two checkpoints trained on TTS task. However, I currently don't have enough knowledge to assess which part of the pipeline needs to be updated vs the original implementation in https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482