Skip to content

Stable Audio integration #8716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Jul 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
6151db5
WIP modeling code and pipeline
ylacombe Jun 26, 2024
656561b
add custom attention processor + custom activation + add to init
ylacombe Jul 1, 2024
819d746
correct ProjectionModel forward
ylacombe Jul 2, 2024
8a1a9d8
add stable audio to __initèè
ylacombe Jul 9, 2024
960339d
add autoencoder and update pipeline and modeling code
ylacombe Jul 9, 2024
51c838f
add half Rope
ylacombe Jul 9, 2024
87f1e26
add partial rotary v2
ylacombe Jul 9, 2024
2f2bb8a
add temporary modfis to scheduler
ylacombe Jul 9, 2024
dc3f0eb
add EDM DPM Solver
ylacombe Jul 10, 2024
07fc3c3
remove TODOs
ylacombe Jul 10, 2024
b49a3d5
clean GLU
ylacombe Jul 10, 2024
d1b3e20
remove att.group_norm to attn processor
ylacombe Jul 10, 2024
23be1a3
revert back src/diffusers/schedulers/scheduling_dpmsolver_multistep.py
ylacombe Jul 10, 2024
9d32408
refactor GLU -> SwiGLU
ylacombe Jul 15, 2024
661d4f1
Merge branch 'main' into add-stable-audio
ylacombe Jul 15, 2024
3689af0
remove redundant args
ylacombe Jul 15, 2024
282e478
add channel multiples in autoencoder docstrings
ylacombe Jul 15, 2024
c9fef25
changes in docsrtings and copyright headers
ylacombe Jul 15, 2024
e51ffb2
clean pipeline
ylacombe Jul 15, 2024
ab6824c
further cleaning
ylacombe Jul 15, 2024
eeb19fe
remove peft and lora and fromoriginalmodel
ylacombe Jul 15, 2024
a43dfc5
Delete src/diffusers/pipelines/stable_audio/diffusers.code-workspace
ylacombe Jul 15, 2024
e7185e5
make style
ylacombe Jul 15, 2024
3c6715e
dummy models
ylacombe Jul 15, 2024
14fa2bf
fix copied from
ylacombe Jul 15, 2024
21d0171
add fast oobleck tests
ylacombe Jul 15, 2024
9cc7c02
add brownian tree
ylacombe Jul 16, 2024
c5eeafe
oobleck autoencoder slow tests
ylacombe Jul 17, 2024
0a2d065
remove TODO
ylacombe Jul 17, 2024
29e794b
fast stable audio pipeline tests
ylacombe Jul 17, 2024
1bad287
add slow tests
ylacombe Jul 17, 2024
cf15409
make style
ylacombe Jul 17, 2024
dec61b3
add first version of docs
ylacombe Jul 17, 2024
1961cc9
wrap is_torchsde_available to the scheduler
ylacombe Jul 18, 2024
3c7df74
fix slow test
ylacombe Jul 18, 2024
92392fd
test with input waveform
ylacombe Jul 18, 2024
d826f0f
add input waveform
ylacombe Jul 18, 2024
94c2a25
remove some todos
ylacombe Jul 18, 2024
ad8660e
create stableaudio gaussian projection + make style
ylacombe Jul 18, 2024
55b2a14
add pipeline to toctree
ylacombe Jul 18, 2024
42a05c5
fix copied from
ylacombe Jul 18, 2024
8919ba0
Merge branch 'huggingface:main' into add-stable-audio
ylacombe Jul 18, 2024
2df8e41
make quality
ylacombe Jul 18, 2024
68a5b56
refactor timestep_features->time_proj
ylacombe Jul 24, 2024
a81f46d
refactor joint_attention_kwargs->cross_attention_kwargs
ylacombe Jul 24, 2024
8e910d3
remove forward_chunk
ylacombe Jul 24, 2024
406f02a
move StableAudioDitModel to transformers folder
ylacombe Jul 24, 2024
3a1dddb
correct convert + remove partial rotary embed
ylacombe Jul 24, 2024
c44d0a4
apply suggestions from yiyixuxu -> removing attn.kv_heads
ylacombe Jul 24, 2024
e5859f1
remove temb
ylacombe Jul 24, 2024
d35451d
remove cross_attention_kwargs
ylacombe Jul 24, 2024
76debd5
further removal of cross_attention_kwargs
ylacombe Jul 24, 2024
acde6d5
remove text encoder autocast to fp16
ylacombe Jul 24, 2024
566972d
continue removing autocast
ylacombe Jul 24, 2024
f187d65
make style
ylacombe Jul 24, 2024
af4f2ab
Merge branch 'huggingface:main' into add-stable-audio
ylacombe Jul 24, 2024
8aa2e11
refactor how text and audio are embedded
ylacombe Jul 24, 2024
58ca32c
add paper
ylacombe Jul 24, 2024
a4b6930
update example code
ylacombe Jul 24, 2024
c0873dc
make style
ylacombe Jul 24, 2024
bc36933
unify projection model forward + fix device placement
ylacombe Jul 25, 2024
f318e15
make style
ylacombe Jul 25, 2024
8382156
remove fuse qkv
ylacombe Jul 25, 2024
6ff9cf6
Merge branch 'huggingface:main' into add-stable-audio
ylacombe Jul 25, 2024
f91b084
apply suggestions from review
ylacombe Jul 25, 2024
29dc552
Update src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py
ylacombe Jul 26, 2024
ff62035
make style
ylacombe Jul 26, 2024
d61a1a9
smaller models in fast tests
ylacombe Jul 26, 2024
f1c9585
pass sequential offloading fast tests
ylacombe Jul 26, 2024
8893373
add docs for vae and autoencoder
ylacombe Jul 26, 2024
0b93804
Merge branch 'main' into add-stable-audio
ylacombe Jul 26, 2024
264dd6d
make style and update example
ylacombe Jul 26, 2024
0277c7f
remove useless import
ylacombe Jul 29, 2024
1565d8a
add cosine scheduler
ylacombe Jul 29, 2024
d820e68
dummy classes
ylacombe Jul 29, 2024
fea9f8e
cosine scheduler docs
ylacombe Jul 29, 2024
8abdb61
Merge branch 'main' into add-stable-audio
ylacombe Jul 29, 2024
81dedd9
better description of scheduler
ylacombe Jul 30, 2024
6d5d663
Merge branch 'huggingface:main' into add-stable-audio
ylacombe Jul 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,8 @@
title: AsymmetricAutoencoderKL
- local: api/models/autoencoder_tiny
title: Tiny AutoEncoder
- local: api/models/autoencoder_oobleck
title: Oobleck AutoEncoder
- local: api/models/consistency_decoder_vae
title: ConsistencyDecoderVAE
- local: api/models/transformer2d
Expand All @@ -259,6 +261,8 @@
title: TransformerTemporalModel
- local: api/models/sd3_transformer2d
title: SD3Transformer2DModel
- local: api/models/stable_audio_transformer
title: StableAudioDiTModel
- local: api/models/prior_transformer
title: PriorTransformer
- local: api/models/controlnet
Expand Down Expand Up @@ -362,6 +366,8 @@
title: Semantic Guidance
- local: api/pipelines/shap_e
title: Shap-E
- local: api/pipelines/stable_audio
title: Stable Audio
- local: api/pipelines/stable_cascade
title: Stable Cascade
- sections:
Expand Down Expand Up @@ -425,6 +431,8 @@
title: CMStochasticIterativeScheduler
- local: api/schedulers/consistency_decoder
title: ConsistencyDecoderScheduler
- local: api/schedulers/cosine_dpm
title: CosineDPMSolverMultistepScheduler
- local: api/schedulers/ddim_inverse
title: DDIMInverseScheduler
- local: api/schedulers/ddim
Expand Down
38 changes: 38 additions & 0 deletions docs/source/en/api/models/autoencoder_oobleck.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# AutoencoderOobleck

The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms.

The abstract from the paper is:

*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*

## AutoencoderOobleck

[[autodoc]] AutoencoderOobleck
- decode
- encode
- all

## OobleckDecoderOutput

[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput

## OobleckDecoderOutput

[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput

## AutoencoderOobleckOutput

[[autodoc]] models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput
19 changes: 19 additions & 0 deletions docs/source/en/api/models/stable_audio_transformer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# StableAudioDiTModel

A Transformer model for audio waveforms from [Stable Audio Open](https://huggingface.co/papers/2407.14358).

## StableAudioDiTModel

[[autodoc]] StableAudioDiTModel
1 change: 1 addition & 0 deletions docs/source/en/api/pipelines/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
| [Semantic Guidance](semantic_stable_diffusion) | text2image |
| [Shap-E](shap_e) | text-to-3D, image-to-3D |
| [Spectrogram Diffusion](spectrogram_diffusion) | |
| [Stable Audio](stable_audio) | text2audio |
| [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution |
| [Stable Diffusion Model Editing](model_editing) | model editing |
| [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting |
Expand Down
42 changes: 42 additions & 0 deletions docs/source/en/api/pipelines/stable_audio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Stable Audio

Stable Audio was proposed in [Stable Audio Open](https://arxiv.org/abs/2407.14358) by Zach Evans et al. . it takes a text prompt as input and predicts the corresponding sound or music sample.

Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder.

Stable Audio is trained on a corpus of around 48k audio recordings, where around 47k are from Freesound and the rest are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train the autoencoder and the DiT.

The abstract of the paper is the following:
*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*

This pipeline was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). The original codebase can be found at [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tool).

## Tips

When constructing a prompt, keep in mind:

* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".

During inference:

* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.


## StableAudioPipeline
[[autodoc]] StableAudioPipeline
- all
- __call__
24 changes: 24 additions & 0 deletions docs/source/en/api/schedulers/cosine_dpm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# CosineDPMSolverMultistepScheduler

The [`CosineDPMSolverMultistepScheduler`] is a variant of [`DPMSolverMultistepScheduler`] with cosine schedule, proposed by Nichol and Dhariwal (2021).
It is being used in the [Stable Audio Open](https://arxiv.org/abs/2407.14358) paper and the [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tool) codebase.

This scheduler was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe).

## CosineDPMSolverMultistepScheduler
[[autodoc]] CosineDPMSolverMultistepScheduler

## SchedulerOutput
[[autodoc]] schedulers.scheduling_utils.SchedulerOutput
Loading
Loading