Skip to content

Feature request: Update the pipeline for AudioLDM 2 so that 'transcript' can be consumed and text to speech created #4923

Open
@filip-michalsky

Description

@filip-michalsky

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

The current pipeline for AudioLDM 2 does not take in "transcript" field.
Hence, it does not create phonemes and hence does not allow for text-to-speech generation.

https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline

Currently, only text-to-music and text-to-audio are supported. The latent difussion model is not guided for creating phonemes as in the original implementation with these two checkpoints:

  • audioldm2-speech-ljspeech
  • audioldm2-speech-gigaspeech

Here:
https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/pipeline.py#L78

and here:
https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482

commandline from original repo:
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"

These two checkpoints naturally take phonemes into the batch so the checkpoints do consume "phoneme" as one of the fields in the batch natively.

Describe the solution you'd like
A clear and concise description of what you want to happen.

Add the "transcription" input param to allow to choose a TTS model from the two checkpoints above and hence allow for TTS task.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Original repo implementation - is very slow and unoptimized.

Additional context
Add any other context or screenshots about the feature request here.

I believe the already implemented pipeline AudioLDM2 could be updated to take in the transcript field, update the batch, and load the additional two checkpoints trained on TTS task. However, I currently don't have enough knowledge to assess which part of the pipeline needs to be updated vs the original implementation in https://github.com/haoheliu/AudioLDM2/blob/main/audioldm2/latent_diffusion/models/ddpm.py#L482

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions