Feature request: Update the pipeline for AudioLDM 2 so that 'transcript' can be consumed and text to speech created

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The current pipeline for AudioLDM 2 does not take in "transcript" field.
Hence, it does  not create phonemes and hence does not allow for text-to-speech generation. 

https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline

Currently, only text-to-music and text-to-audio are supported. The latent difussion model is not guided for creating phonemes as in the original implementation with these two checkpoints:

- audioldm2-speech-ljspeech
- audioldm2-speech-gigaspeech

Here:
https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/pipeline.py#L78

and here:
https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482

commandline from original repo:
`audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"`

These two checkpoints naturally take phonemes into the batch so the checkpoints do consume "phoneme" as one of the fields in the batch natively.

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

Add the "transcription" input param to allow to choose a TTS model from the two checkpoints above and hence allow for TTS task.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

Original repo implementation - is very slow and unoptimized.

**Additional context**
Add any other context or screenshots about the feature request here.

I believe the already implemented pipeline AudioLDM2 could be updated to take in the transcript field, update the batch, and load the additional two checkpoints trained on TTS task. However, I currently don't have enough knowledge to assess which part of the pipeline needs to be updated vs the original implementation in https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: Update the pipeline for AudioLDM 2 so that 'transcript' can be consumed and text to speech created #4923

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: Update the pipeline for AudioLDM 2 so that 'transcript' can be consumed and text to speech created #4923

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions