@@ -14,17 +14,20 @@ specific language governing permissions and limitations under the License.
1414
1515We present some techniques and ideas to optimize 🤗 Diffusers _inference_ for memory or speed.
1616
17-
1817| | Latency | Speedup |
19- |------------------| ---------| --------- |
18+ | ---------------- | ------- | ------- |
2019| original | 9.50s | x1 |
2120| cuDNN auto-tuner | 9.37s | x1.01 |
2221| autocast (fp16) | 5.47s | x1.91 |
2322| fp16 | 3.61s | x2.91 |
2423| channels last | 3.30s | x2.87 |
2524| traced UNet | 3.21s | x2.96 |
2625
27- <em>obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from the prompt " a photo of an astronaut riding a horse on mars" with 50 DDIM steps.</em>
26+ <em>
27+ obtained on NVIDIA TITAN RTX by generating a single image of size 512x512 from
28+ the prompt " a photo of an astronaut riding a horse on mars" with 50 DDIM
29+ steps.
30+ </em>
2831
2932## Enable cuDNN auto-tuner
3033
@@ -61,7 +64,7 @@ pipe = pipe.to("cuda")
6164
6265prompt = " a photo of an astronaut riding a horse on mars"
6366with autocast("cuda"):
64- image = pipe(prompt).images[0]
67+ image = pipe(prompt).images[0]
6568```
6669
6770Despite the precision loss, in our experience the final image results look the same as the `float32` versions. Feel free to experiment and report back!
@@ -79,15 +82,18 @@ pipe = StableDiffusionPipeline.from_pretrained(
7982pipe = pipe.to("cuda")
8083
8184prompt = " a photo of an astronaut riding a horse on mars"
82- image = pipe(prompt).images[0]
85+ image = pipe(prompt).images[0]
8386```
8487
8588## Sliced attention for additional memory savings
8689
8790For even additional memory savings, you can use a sliced version of attention that performs the computation in steps instead of all at once.
8891
8992<Tip>
90- Attention slicing is useful even if a batch size of just 1 is used - as long as the model uses more than one attention head. If there is more than one attention head the *QK^T* attention matrix can be computed sequentially for each head which can save a significant amount of memory.
93+ Attention slicing is useful even if a batch size of just 1 is used - as long
94+ as the model uses more than one attention head. If there is more than one
95+ attention head the *QK^T* attention matrix can be computed sequentially for
96+ each head which can save a significant amount of memory.
9197</Tip>
9298
9399To perform the attention computation sequentially over each head, you only need to invoke [`~StableDiffusionPipeline.enable_attention_slicing`] in your pipeline before inference, like here:
@@ -105,11 +111,55 @@ pipe = pipe.to("cuda")
105111
106112prompt = " a photo of an astronaut riding a horse on mars"
107113pipe.enable_attention_slicing()
108- image = pipe(prompt).images[0]
114+ image = pipe(prompt).images[0]
109115```
110116
111117There's a small performance penalty of about 10% slower inference times, but this method allows you to use Stable Diffusion in as little as 3.2 GB of VRAM!
112118
119+ ## Offloading to CPU with accelerate for memory savings
120+
121+ For additional memory savings, you can offload the weights to CPU and load them to GPU when performing the forward pass.
122+
123+ To perform CPU offloading, all you have to do is invoke [`~StableDiffusionPipeline.enable_sequential_cpu_offload`]:
124+
125+ ```Python
126+ import torch
127+ from diffusers import StableDiffusionPipeline
128+
129+ pipe = StableDiffusionPipeline.from_pretrained(
130+ " runwayml/stable-diffusion-v1-5" ,
131+ revision = " fp16" ,
132+ torch_dtype =torch.float16,
133+ )
134+ pipe = pipe.to("cuda")
135+
136+ prompt = " a photo of an astronaut riding a horse on mars"
137+ pipe.enable_sequential_cpu_offload()
138+ image = pipe(prompt).images[0]
139+ ```
140+
141+ And you can get the memory consumption to < 2GB.
142+
143+ If is also possible to chain it with attention slicing for minimal memory consumption, running it in as little as < 800mb of GPU vRAM:
144+
145+ ```Python
146+ import torch
147+ from diffusers import StableDiffusionPipeline
148+
149+ pipe = StableDiffusionPipeline.from_pretrained(
150+ " runwayml/stable-diffusion-v1-5" ,
151+ revision = " fp16" ,
152+ torch_dtype =torch.float16,
153+ )
154+ pipe = pipe.to("cuda")
155+
156+ prompt = " a photo of an astronaut riding a horse on mars"
157+ pipe.enable_sequential_cpu_offload()
158+ pipe.enable_attention_slicing(1)
159+
160+ image = pipe(prompt).images[0]
161+ ```
162+
113163## Using Channels Last memory format
114164
115165Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel). Since not all operators currently support channels last format it may result in a worst performance, so it's better to try it and see if it works for your model.
0 commit comments