Stable Audio integration #8716

ylacombe · 2024-06-26T17:02:34Z

What does this PR do?

Stability AI recently open-sourced Stable Audio 1.0, which can be run using their toolkit library .

Contrarily to most diffusion models, the diffusion process here operates on a 1D latent signal, so I had to depart a bit from other models.

For now, I've drafted a bit how the pipeline will work, namely:

Project the input 1D waveform signal (which can be noise) to a latent space
Project input text description + two float numbers that indicates the beginning and ending of the audio into the latent space
Diffuse using a transformer-like model
Decode from the latent space to the waveform space.

For this to work, I'm waiting for DAC to be integrated to transformers in this PR, in order to use the encoder and decoder code for the VAE.

Left TODO

cc @sayakpaul and @yiyixuxu !

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-06-26T17:08:52Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yiyixuxu · 2024-06-26T19:52:11Z

@ylacombe

thanks for the PR!

Overall, it looks pretty aligned with the diffuser's design! Here is my initial feedback:

can we make the projection_model part of the transformer? I think it just contains projection layers on the prompt_embeds and audio_start_in_s and audio_end_in_s, so IMO, it is a natural part of the transformer model: all our transformer/Unet models apply some sort of projections on various conditions inputs, e.g., image_size, time, text, style, etc. Is there a special reason that you want to keep the projection layers outside of the transformer?
if we can agree on 1, then I think we can change encode_prompt_and_seconds toencode_prompt method that returns prompt_embeds, negative_prompt_embeds

ylacombe · 2024-06-27T08:19:59Z

Hey @yiyixuxu, thanks for the feedback here!

I think the main reason for the separate projection model is that encode_prompt_and_seconds also takes care of cfg and negative tensors:

the cross attention negative hidden states are set to 0 if there is cfg but no negative prompts, and it's done after the prompts and seconds have been projected to the latent space

Is this something that we'd want to do in the transformers ? IMO, no, but happy to change the way it's implemented !

sayakpaul

Thanks for quickly making it ready for reviews.

src/diffusers/pipelines/stable_audio/modeling_stable_audio.py

sayakpaul · 2024-06-27T08:35:02Z

src/diffusers/pipelines/stable_audio/modeling_stable_audio.py

+            seconds_end_hidden_states=seconds_end_hidden_states,
+        )
+
+@maybe_allow_in_graph


Nice ❤️

src/diffusers/pipelines/stable_audio/modeling_stable_audio.py

sayakpaul · 2024-06-27T08:37:11Z

src/diffusers/pipelines/stable_audio/modeling_stable_audio.py

+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.HunyuanDiT2DModel.fuse_qkv_projections
+    def fuse_qkv_projections(self):
+        """
+        Enables fused QKV projections. For self-attention modules, all projection matrices (i.e., query, key, value)
+        are fused. For cross-attention modules, key and value projection matrices are fused.
+
+        <Tip warning={true}>
+
+        This API is 🧪 experimental.
+
+        </Tip>
+        """
+        self.original_attn_processors = None
+
+        for _, attn_processor in self.attn_processors.items():
+            if "Added" in str(attn_processor.__class__.__name__):
+                raise ValueError("`fuse_qkv_projections()` is not supported for models having added KV projections.")
+
+        self.original_attn_processors = self.attn_processors
+
+        for module in self.modules():
+            if isinstance(module, Attention):
+                module.fuse_projections(fuse=True)
+
+    # Copied from diffusers.models.unets.unet_2d_condition.UNet2DConditionModel.unfuse_qkv_projections
+    def unfuse_qkv_projections(self):


All of these optimizations can come after we have a basic implementation ready that is matching the original outputs. It's easier to review as well. WDYT?

sayakpaul · 2024-06-27T08:40:38Z

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

+        """
+        self.vae.disable_slicing()
+
+    def enable_model_cpu_offload(self, gpu_id=0):


Why do we need to implement it separately here? Can't we specify a model_cpu_offload_seq like this?

https://github.com/huggingface/diffusers/blob/35f45ecd71a5c917406408a02bc982c3795d5a35/src/diffusers/pipelines/pixart_alpha/pipeline_pixart_alpha.py#L272C5-L272C26

I copied this snippet from another pipeline, I'm removing it and add model_cpu_offload_seq as you proposed

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

sayakpaul · 2024-06-27T08:44:16Z

src/diffusers/pipelines/stable_audio/modeling_stable_audio.py

+        encoder_hidden_states: torch.FloatTensor = None,
+        global_hidden_states: torch.FloatTensor = None,


How are these two different?

The decoder here has self-attention and cross-attention layers:

encoder_hidden_states is used in the cross-attention layer whereas the global_hidden_states is simply prepended to the hidden_states before being passed to the attention layer.

yiyixuxu · 2024-06-27T09:11:08Z

@ylacombe
Thanks for explaining! feel free to continue your plan and convert the weights, we can refactor later if needed:)

IMO, ideally, we do want to move the projection layers in transformers, but since the original implementation is implemented this way, let's keep it this way for now. I can help look into this later.

One way I think we can go about this is to make the audio_start_in_s and audio_end_in_s tensors (e.g., audio_end_in_s = Torch.tensor([10.0]) and we handle the CFG in the pipeline normally, something like this

        if self.do_classifier_free_guidance:
            audio_end_in_s = torch.cat([audio_end_in_s, audio_end_in_s], dim=0)
       elif ..: 
            neg_audio_end_in_s = torch.tensor([0])
            audio_end_in_s = torch.cat([neg_audio_end_in_s, audio_end_in_s], dim=0)

this way when these argument reach the transformer, it already contain info about CFG. But I'm just making it up here, I wouldn't know if it would work. so let's just don't worry about it and continue with your implementation :)

ylacombe

Hey @sayakpaul and @yiyixuxu !
While I figure out how to reconcile schedulers, I've left some comments on implementation choice for reference.
I'll address the current comments once we agree on the scheduler!
Thanks for your help!

ylacombe · 2024-07-09T14:38:09Z

scripts/convert_stable_audio.py

+
+
+# scheduler
+scheduler = DPMSolverMultistepScheduler(solver_order=2, algorithm_type="sde-dpmsolver++", use_exponential_sigmas=True)


I still need to find the right scheduler for this!

ylacombe · 2024-07-09T14:39:06Z

src/diffusers/models/activations.py

I had to add a new variant of GEGLU here

ylacombe · 2024-07-09T14:40:28Z

src/diffusers/models/attention_processor.py

+        kv_heads (`int`,  *optional*, defaults to `None`):
+            The number of key and value heads to use for multi-head attention. Defaults to `heads`.
+            If `kv_heads=heads`, the model will use Multi Head Attention (MHA), if `kv_heads=1` the model will use 
+            Multi Query Attention (MQA) otherwise GQA is used.


I've also added kv_heads for Grouped-Query Attention (GQA) support, it only works with StableAudioAttnProcessor2_0 for now.

Let me know if that works for you!

We already have kv_heads:

diffusers/src/diffusers/models/attention_processor.py

Line 97 in a785992

kv_heads: Optional[int] = None,

Possible leverage that?

StableAudioAttnProcessor2_0 could perhaps be renamed to GroupedQueryAttnProcessor2_0?

Also, could

diffusers/src/diffusers/models/attention_processor.py

Line 1601 in a785992

class LuminaAttnProcessor2_0:

be reused here?

Oh right, kv_heads''ve just been added 3 days ago in #8652!

Contrarily to Lumina we don't use a s normalization layer which is smthg I've never seen before!

(I can rename the stable audio one to GroupedQueryAttnProcessor2_0 if necessary)

StableAudioAttnProcessor2_0 should be the way to go
Even with grouped query attention, you can have different configurations (lumina has qk norm and an unusual way of applying it, you can have it with or without rotary embeddings etc), and no need to complicate things by forcing them to use the same attention processor

ylacombe · 2024-07-09T14:44:21Z

src/diffusers/models/embeddings.py

Stable Audio uses a partial rotary position embedding and also performs some operations differently

src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

sayakpaul

@ylacombe thanks for raising the questions. I have tried addressing them. LMK if that makes sense. Meanwhile, let's wait for @yiyixuxu for leave comments, too, before making changes.

src/diffusers/models/activations.py

sayakpaul · 2024-07-10T08:54:29Z

src/diffusers/models/activations.py

+class GLU(nn.Module):
+    r"""
+    A [variant](https://arxiv.org/abs/2002.05202) of the gated linear unit activation function. 
+    It's similar to `GEGLU` but uses SiLU / Swish instead of GeLU.
+
+    Parameters:
+        dim_in (`int`): The number of channels in the input.
+        dim_out (`int`): The number of channels in the output.
+        act_fn (str): Name of activation function used.
+        bias (`bool`, defaults to True): Whether to use a bias in the linear layer.
+    """


Since it's not GELU, I would prefer to have it in the modeling file. I have something similar here: #8796.

Once both the PRs are in, we can think of unifying them. WDYT @yiyixuxu?

I'm still using this in the FeedForward, but happy to move everything to the modeling file if @yiyixuxu agrees!

Sure. Since that Feedforward is very focused on GELU and GeGELU I thought it might be a better option to have its own class.

I think it is very reasonable to use this in FeedForward, no need to move

The feedforward does not have GeLU though.

sayakpaul · 2024-07-10T08:56:08Z

src/diffusers/models/attention_processor.py

+        kv_heads (`int`,  *optional*, defaults to `None`):
+            The number of key and value heads to use for multi-head attention. Defaults to `heads`.
+            If `kv_heads=heads`, the model will use Multi Head Attention (MHA), if `kv_heads=1` the model will use 
+            Multi Query Attention (MQA) otherwise GQA is used.


We already have kv_heads:

diffusers/src/diffusers/models/attention_processor.py

Line 97 in a785992

kv_heads: Optional[int] = None,

Possible leverage that?

StableAudioAttnProcessor2_0 could perhaps be renamed to GroupedQueryAttnProcessor2_0?

Also, could

diffusers/src/diffusers/models/attention_processor.py

Line 1601 in a785992

class LuminaAttnProcessor2_0:

be reused here?

src/diffusers/models/attention_processor.py

sayakpaul · 2024-07-10T08:57:55Z

src/diffusers/models/embeddings.py

+    Apply partial rotary embeddings (Wang et al. GPT-J) to input tensors using the given frequency tensor. This function applies rotary embeddings
+    to the given query or key 'x' tensors using the provided frequency tensor 'freqs_cis'. The input tensors are
+    reshaped as complex numbers, and the frequency tensor is reshaped for broadcasting compatibility. The resulting
+    tensors contain rotary embeddings and are returned as real tensors.


(nit): Provide some references that make use of partial rotary embeddings?

src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

yiyixuxu

thanks! looking great!
I did an initial review and left some comments, mainly focused on the changes introduced into non-stable audio files for this review, will do a second round next week!

src/diffusers/models/activations.py

yiyixuxu · 2024-07-12T10:14:04Z

src/diffusers/models/attention_processor.py

+        kv_heads (`int`,  *optional*, defaults to `None`):
+            The number of key and value heads to use for multi-head attention. Defaults to `heads`.
+            If `kv_heads=heads`, the model will use Multi Head Attention (MHA), if `kv_heads=1` the model will use 
+            Multi Query Attention (MQA) otherwise GQA is used.


StableAudioAttnProcessor2_0 should be the way to go
Even with grouped query attention, you can have different configurations (lumina has qk norm and an unusual way of applying it, you can have it with or without rotary embeddings etc), and no need to complicate things by forcing them to use the same attention processor

yiyixuxu · 2024-07-12T11:17:49Z

src/diffusers/models/embeddings.py

    else:
        freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64     # [S, D/2]
        return freqs_cis

+def apply_partial_rotary_emb(


let's use apply_rotary_emb to support this, we can do the "apply partial" part inside the attention processor

we can split x into x_ro_rorate and x_unrotated before calling apply_rotary_emb

only pass the x_to_rotate to apply_rotary_emb

do this on the output of apply_rotary_emb -> out = torch.cat((out, x_unrotated), dim = -1)

There's a small nuance in both methods:

apply_rotary_emb creates x_real and x_imag by reshaping like this: x.reshape(*x.shape[:-1], -1, 2).unbind(-1)

apply_partial_rotary_emb reshapes in a different way (last two dimensions are swapped) : x_to_rotate.reshape(*x_to_rotate.shape[:-1], 2, -1).unbind(dim=-2)

The resulting tensors are totally different. Happy to solve this a different way , I could add a boolean to reshape one way or another, WDYT ?

sounds good, maybe add a use_real_unbind_dim and defaults to -1 ?

src/diffusers/models/embeddings.py

yiyixuxu · 2024-07-12T11:34:29Z

src/diffusers/schedulers/scheduling_edm_dpmsolver_multistep.py

@@ -107,6 +110,7 @@ def __init__(
        lower_order_final: bool = True,
        euler_at_final: bool = False,
        final_sigmas_type: Optional[str] = "zero",  # "zero", "sigma_min"
+        noise_preconditioning_strategy: str = "log",


umm I need to dig a little more about this, but I think we can use a use_cosine_schedule argument to configure the difference we introduced

Let me know what you prefer!

yeah, the tricky thing is that "EDM" refers to this very specific way of preconditioning the noise, so this technically does not belong here, but it seems that it's not easy to make it work with the non-EDM version?

I will look into this a little bit more, you can leave it as it is for now though

it seems that it's not easy to make it work with the non-EDM version?
None that I've seen!

sayakpaul

Thanks for your hard work. I know you have been working quite hard on this. My comments should feel minor. I think we're nearing merge.

sayakpaul · 2024-07-25T11:54:13Z

scripts/convert_stable_audio.py

+        if "snake" in new_key:
+            value = value.unsqueeze(0).unsqueeze(-1)


🐍

nice param name

sayakpaul · 2024-07-25T12:22:55Z

src/diffusers/models/activations.py

@@ -123,6 +123,28 @@ def forward(self, hidden_states, *args, **kwargs):
            return hidden_states * self.gelu(gate)


+class SwiGLU(nn.Module):


Finally the OG FF.

sayakpaul · 2024-07-25T12:23:46Z

src/diffusers/models/attention.py

+        elif activation_fn == "swiglu":
+            act_fn = SwiGLU(dim, inner_dim, bias=bias)


I think this should be okay as we don't name our FeedForward in a way that indicates that it's restrictive toward SwiGLU.

sayakpaul · 2024-07-25T12:24:59Z

src/diffusers/models/attention_processor.py

+        kv_heads (`int`,  *optional*, defaults to `None`):
+            The number of key and value heads to use for multi-head attention. Defaults to `heads`. If
+            `kv_heads=heads`, the model will use Multi Head Attention (MHA), if `kv_heads=1` the model will use Multi
+            Query Attention (MQA) otherwise GQA is used.


Okay this is a doc-related change. Specifically, you are adding this entry to the doc-string. Thanks for doing that!

src/diffusers/models/attention_processor.py

sayakpaul · 2024-07-25T12:54:53Z

src/diffusers/pipelines/stable_audio/modeling_stable_audio.py

+        return float_embeds
+
+
+class StableAudioProjectionModel(ModelMixin, ConfigMixin):


@yiyixuxu sorry for the re-iteration. But what are some strong reasons to add it as a ModelMixin and not simply an nn.Module?

I summarized them here.

To re-iterate, CFG is applied on the output of the projection model, in a non-straightforward way. I'd rather keep the CFG logic in the pipeline.

sayakpaul · 2024-07-25T12:56:52Z

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

+        >>> # Peak normalize, clip, convert to int16, and save to file
+        >>> output = (
+        ...     audio[0]
+        ...     .to(torch.float32)
+        ...     .div(torch.max(torch.abs(audio[0])))
+        ...     .clamp(-1, 1)
+        ...     .mul(32767)
+        ...     .to(torch.int16)
+        ...     .cpu()
+        ... )
+        >>> torchaudio.save("hammer.wav", output, pipe.vae.sampling_rate)


@yiyixuxu does it not make sense to have a postprocessing util for this stuff? This code block is large enough to warrant such a block IMO.

I've removed that post-processing, turns out we don't need it

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

sayakpaul · 2024-07-25T12:59:09Z

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

+        if output_type == "np":
+            audio = audio.cpu().float().numpy()
+
+        if not return_dict:
+            return (audio,)


We should really post-process the audio here like we do in the other image pipelines.

I'm not entirely sure that the post-processing done in the code snippet is necessary or is just a way to post-process to save with torchaudio. In the former case, we'd add it to a post-processing function, in the later, we don't need to, let me verify.

So, if I understand correctly, the post-processing depends on which library you want to save the generated audio with?

Exactly ! turns out we don't need that post-processing if we save with something else, I've updated the example snippet accordingly

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

Co-authored-by: YiYi Xu <[email protected]>

tests/pipelines/stable_audio/test_stable_audio.py

yiyixuxu

thanks! looking great! I left one comment about making a new scheduler
should be pretty easy to do, no? let me know
we can merge this after that!

yiyixuxu · 2024-07-29T07:02:35Z

src/diffusers/models/attention_processor.py

+    def apply_partial_rotary_emb(
+        self,
+        x: torch.Tensor,
+        freqs_cis: Tuple[torch.Tensor],
+    ) -> torch.Tensor:
+        from .embeddings import apply_rotary_emb
+
+        rot_dim = freqs_cis[0].shape[-1]
+        x_to_rotate, x_unrotated = x[..., :rot_dim], x[..., rot_dim:]
+
+        x_rotated = apply_rotary_emb(x_to_rotate, freqs_cis, use_real=True, use_real_unbind_dim=-2)
+
+        out = torch.cat((x_rotated, x_unrotated), dim=-1)
+        return out


can we move it to where it's used inside the __call__? it is just two lines wrapped around the apply_rotary_emb method

src/diffusers/models/attention_processor.py

yiyixuxu · 2024-07-29T07:23:09Z

src/diffusers/schedulers/scheduling_edm_dpmsolver_multistep.py

+if is_torchsde_available():
+    from .scheduling_dpmsolver_sde import BrownianTreeNoiseSampler
+
+
 class EDMDPMSolverMultistepScheduler(SchedulerMixin, ConfigMixin):


can we make a CosineDPMSolverMultistepScheduler and move all the changes there? it is ok to only support BrownianTree there

it is not EDM and the DPMSolverMultistepScheduler is super bloated already, I think the easiest way is to make a new one!

Of course, I'm not sure about how to best describe it in docstrings and docs though!

thanks! I updated it here #8716 (comment)
The scheduler itself is not something entirely new - but this combination was not used in any other models at least in diffusers I think, and the "cosine schedule" part is the only part that's not in the DPM scheduler so let's just make a simple note of that

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

src/diffusers/schedulers/scheduling_cosine_dpmsolver_multistep.py

sayakpaul · 2024-07-30T09:59:39Z

Thank you so much @ylacombe for your hard work here. Navigating through 1e14 comments and addressing them like you did is NO SMALL FEAT. Thank you once again!

tin2tin · 2024-08-08T06:37:00Z

Thank you @ylacombe! Is there an example of how to use initial_audio_waveforms somewhere? Is that for extending or zero-shot generation?

* WIP modeling code and pipeline * add custom attention processor + custom activation + add to init * correct ProjectionModel forward * add stable audio to __initèè * add autoencoder and update pipeline and modeling code * add half Rope * add partial rotary v2 * add temporary modfis to scheduler * add EDM DPM Solver * remove TODOs * clean GLU * remove att.group_norm to attn processor * revert back src/diffusers/schedulers/scheduling_dpmsolver_multistep.py * refactor GLU -> SwiGLU * remove redundant args * add channel multiples in autoencoder docstrings * changes in docsrtings and copyright headers * clean pipeline * further cleaning * remove peft and lora and fromoriginalmodel * Delete src/diffusers/pipelines/stable_audio/diffusers.code-workspace * make style * dummy models * fix copied from * add fast oobleck tests * add brownian tree * oobleck autoencoder slow tests * remove TODO * fast stable audio pipeline tests * add slow tests * make style * add first version of docs * wrap is_torchsde_available to the scheduler * fix slow test * test with input waveform * add input waveform * remove some todos * create stableaudio gaussian projection + make style * add pipeline to toctree * fix copied from * make quality * refactor timestep_features->time_proj * refactor joint_attention_kwargs->cross_attention_kwargs * remove forward_chunk * move StableAudioDitModel to transformers folder * correct convert + remove partial rotary embed * apply suggestions from yiyixuxu -> removing attn.kv_heads * remove temb * remove cross_attention_kwargs * further removal of cross_attention_kwargs * remove text encoder autocast to fp16 * continue removing autocast * make style * refactor how text and audio are embedded * add paper * update example code * make style * unify projection model forward + fix device placement * make style * remove fuse qkv * apply suggestions from review * Update src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py Co-authored-by: YiYi Xu <[email protected]> * make style * smaller models in fast tests * pass sequential offloading fast tests * add docs for vae and autoencoder * make style and update example * remove useless import * add cosine scheduler * dummy classes * cosine scheduler docs * better description of scheduler --------- Co-authored-by: YiYi Xu <[email protected]>

WIP modeling code and pipeline

6151db5

sayakpaul reviewed Jun 27, 2024

View reviewed changes

ylacombe added 7 commits July 1, 2024 14:03

add custom attention processor + custom activation + add to init

656561b

correct ProjectionModel forward

819d746

add stable audio to __initèè

8a1a9d8

add autoencoder and update pipeline and modeling code

960339d

add half Rope

51c838f

add partial rotary v2

87f1e26

add temporary modfis to scheduler

2f2bb8a

ylacombe commented Jul 9, 2024

View reviewed changes

sayakpaul reviewed Jul 10, 2024

View reviewed changes

ylacombe added 5 commits July 10, 2024 18:26

add EDM DPM Solver

dc3f0eb

remove TODOs

07fc3c3

clean GLU

b49a3d5

remove att.group_norm to attn processor

d1b3e20

revert back src/diffusers/schedulers/scheduling_dpmsolver_multistep.py

23be1a3

yiyixuxu reviewed Jul 12, 2024

View reviewed changes

ylacombe and others added 9 commits July 15, 2024 14:51

refactor GLU -> SwiGLU

9d32408

Merge branch 'main' into add-stable-audio

661d4f1

remove redundant args

3689af0

add channel multiples in autoencoder docstrings

282e478

changes in docsrtings and copyright headers

c9fef25

clean pipeline

e51ffb2

further cleaning

ab6824c

remove peft and lora and fromoriginalmodel

eeb19fe

Delete src/diffusers/pipelines/stable_audio/diffusers.code-workspace

a43dfc5

ylacombe and others added 2 commits July 25, 2024 10:51

remove fuse qkv

8382156

Merge branch 'huggingface:main' into add-stable-audio

6ff9cf6

sayakpaul reviewed Jul 25, 2024

View reviewed changes

apply suggestions from review

f91b084

yiyixuxu reviewed Jul 26, 2024

View reviewed changes

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py Show resolved Hide resolved

Update src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py

29dc552

Co-authored-by: YiYi Xu <[email protected]>

sayakpaul reviewed Jul 26, 2024

View reviewed changes

tests/pipelines/stable_audio/test_stable_audio.py Outdated Show resolved Hide resolved

sayakpaul reviewed Jul 26, 2024

View reviewed changes

tests/pipelines/stable_audio/test_stable_audio.py Outdated Show resolved Hide resolved

ylacombe and others added 6 commits July 26, 2024 11:33

make style

ff62035

smaller models in fast tests

d61a1a9

pass sequential offloading fast tests

f1c9585

add docs for vae and autoencoder

8893373

Merge branch 'main' into add-stable-audio

0b93804

make style and update example

264dd6d

yiyixuxu approved these changes Jul 29, 2024

View reviewed changes

yiyixuxu reviewed Jul 29, 2024

View reviewed changes

src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py Outdated Show resolved Hide resolved

ylacombe and others added 5 commits July 29, 2024 09:56

remove useless import

0277c7f

add cosine scheduler

1565d8a

dummy classes

d820e68

cosine scheduler docs

fea9f8e

Merge branch 'main' into add-stable-audio

8abdb61

yiyixuxu reviewed Jul 30, 2024

View reviewed changes

src/diffusers/schedulers/scheduling_cosine_dpmsolver_multistep.py Outdated Show resolved Hide resolved

ylacombe and others added 2 commits July 30, 2024 10:30

better description of scheduler

81dedd9

Merge branch 'huggingface:main' into add-stable-audio

6d5d663

sayakpaul merged commit 69e72b1 into huggingface:main Jul 30, 2024
15 checks passed

ylacombe mentioned this pull request Jul 30, 2024

Fix Stable Audio repository id #9016

Merged

tolgacangoz mentioned this pull request Aug 2, 2024

cannot import name 'StableAudioPipeline' from 'diffusers' #9062

Closed

		encoder_hidden_states: torch.FloatTensor = None,
		global_hidden_states: torch.FloatTensor = None,



		# scheduler
		scheduler = DPMSolverMultistepScheduler(solver_order=2, algorithm_type="sde-dpmsolver++", use_exponential_sigmas=True)

		if "snake" in new_key:
		value = value.unsqueeze(0).unsqueeze(-1)

		@@ -123,6 +123,28 @@ def forward(self, hidden_states, args, *kwargs):
		return hidden_states * self.gelu(gate)


		class SwiGLU(nn.Module):

		elif activation_fn == "swiglu":
		act_fn = SwiGLU(dim, inner_dim, bias=bias)

		return float_embeds


		class StableAudioProjectionModel(ModelMixin, ConfigMixin):

Stable Audio integration #8716

Stable Audio integration #8716

Uh oh!

Conversation

ylacombe commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Left TODO

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 26, 2024

Uh oh!

yiyixuxu commented Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylacombe commented Jun 27, 2024

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ylacombe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

ylacombe commented Jun 26, 2024 •

edited

Loading

yiyixuxu commented Jun 26, 2024 •

edited

Loading

yiyixuxu commented Jun 27, 2024 •

edited

Loading

yiyixuxu Jul 12, 2024 •

edited

Loading

yiyixuxu Jul 12, 2024 •

edited

Loading

yiyixuxu Jul 12, 2024 •

edited

Loading

yiyixuxu Jul 12, 2024 •

edited

Loading