Skip to content

[New Pipeline]: Audio-Journey: Visual+LLM-aided Audio Encodec Diffusion #3826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 2 tasks
lijuncheng16 opened this issue Jun 19, 2023 · 3 comments
Open
1 of 2 tasks

Comments

@lijuncheng16
Copy link

lijuncheng16 commented Jun 19, 2023

Model/Pipeline/Scheduler description

We efficiently trained an Audio Diffusion model with the aid of Alpaca augmented audio captions using AudioSet labels;
website
preprint
Appendix
Implementation
Weights will be released soon!

Open source status

  • The model implementation is available
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

@jacksonmichaels

No response

@lijuncheng16
Copy link
Author

@sanchit-gandhi
We are interested in making a pull request and get this integrated, what are the checks we need to do perform before we make a pull request?
Baseline: we will make it a separated audio_journey_pipeline.py making sure it works and will not touch existing modules in diffuser.

@sanchit-gandhi
Copy link
Contributor

Hey @lijuncheng16 - just to understand better, Audio-Journey is an LDM trained on audio-text data, where the text captions are generated as pseudo-labels from an LLM?

The eval results look super promising (beats AudioLDM by quite some margin) - congrats! Are the samples on the website those obtained from the most performant T5 variant of the model? They seem to be super noisy compared to the equivalent AudioLDM samples.

The first thing to check regarding an integration is the interest from the community around this pipeline - AudioLDM ended up not being super popular (see downloads, which are only a fraction of those for say Stable Diffusion), so we'd want to have some really strong evidence that the community are interested in using this model (e.g. through a super high visibility announcement post). Unfortunately we can't add every new model to the library, since this adds a huge overhead for maintenance. Thus, we've got to focus on the ones deemed to be super in-demand from the community!

If Audio-Journey doesn't fit this category, then a community pipeline may be an excellent way of integrating the pipeline with diffusers without adding code to the main branch. LMK what you think!

@lijuncheng16
Copy link
Author

@sanchit-gandhi Yes, that's the direction, we believe more diverse textual representation with better attention control brings better quality and eventually would be better for the audio research community.
Thank you for the comments! Our samples are from earlier iterations, and we are continuing on perfecting this model, will refresh the demo with much better quality and much better control.
And yes, we understand the community aspect, and we will keep improving/innovating this pipeline.
We will keep you posted :)
Kudos to you guys for providing super valuable resources for diffusion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants