[New Pipeline]: Audio-Journey: Visual+LLM-aided Audio Encodec Diffusion #3826

lijuncheng16 · 2023-06-19T20:16:02Z

Model/Pipeline/Scheduler description

We efficiently trained an Audio Diffusion model with the aid of Alpaca augmented audio captions using AudioSet labels;
website
preprint
Appendix
Implementation
Weights will be released soon!

Open source status

The model implementation is available
The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

@jacksonmichaels

No response

lijuncheng16 · 2023-07-02T15:38:29Z

@sanchit-gandhi
We are interested in making a pull request and get this integrated, what are the checks we need to do perform before we make a pull request?
Baseline: we will make it a separated audio_journey_pipeline.py making sure it works and will not touch existing modules in diffuser.

sanchit-gandhi · 2023-07-03T17:12:06Z

Hey @lijuncheng16 - just to understand better, Audio-Journey is an LDM trained on audio-text data, where the text captions are generated as pseudo-labels from an LLM?

The eval results look super promising (beats AudioLDM by quite some margin) - congrats! Are the samples on the website those obtained from the most performant T5 variant of the model? They seem to be super noisy compared to the equivalent AudioLDM samples.

The first thing to check regarding an integration is the interest from the community around this pipeline - AudioLDM ended up not being super popular (see downloads, which are only a fraction of those for say Stable Diffusion), so we'd want to have some really strong evidence that the community are interested in using this model (e.g. through a super high visibility announcement post). Unfortunately we can't add every new model to the library, since this adds a huge overhead for maintenance. Thus, we've got to focus on the ones deemed to be super in-demand from the community!

If Audio-Journey doesn't fit this category, then a community pipeline may be an excellent way of integrating the pipeline with diffusers without adding code to the main branch. LMK what you think!

lijuncheng16 · 2023-07-03T18:49:03Z

@sanchit-gandhi Yes, that's the direction, we believe more diverse textual representation with better attention control brings better quality and eventually would be better for the audio research community.
Thank you for the comments! Our samples are from earlier iterations, and we are continuing on perfecting this model, will refresh the demo with much better quality and much better control.
And yes, we understand the community aspect, and we will keep improving/innovating this pipeline.
We will keep you posted :)
Kudos to you guys for providing super valuable resources for diffusion!

patrickvonplaten added New pipeline/model community-examples labels Jun 21, 2023

lijuncheng16 closed this as completed Jul 2, 2023

lijuncheng16 reopened this Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[New Pipeline]: Audio-Journey: Visual+LLM-aided Audio Encodec Diffusion #3826

[New Pipeline]: Audio-Journey: Visual+LLM-aided Audio Encodec Diffusion #3826

lijuncheng16 commented Jun 19, 2023 •

edited

Loading

lijuncheng16 commented Jul 2, 2023

Uh oh!

sanchit-gandhi commented Jul 3, 2023

Uh oh!

lijuncheng16 commented Jul 3, 2023

Uh oh!

[New Pipeline]: Audio-Journey: Visual+LLM-aided Audio Encodec Diffusion #3826

[New Pipeline]: Audio-Journey: Visual+LLM-aided Audio Encodec Diffusion #3826

Comments

lijuncheng16 commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Model/Pipeline/Scheduler description

Open source status

Provide useful links for the implementation

lijuncheng16 commented Jul 2, 2023

Uh oh!

sanchit-gandhi commented Jul 3, 2023

Uh oh!

lijuncheng16 commented Jul 3, 2023

Uh oh!

lijuncheng16 commented Jun 19, 2023 •

edited

Loading