Skip to content

Commit a39d42b

Browse files
authored
[docs] update torch 2 benchmark (#2764)
* Update benchmark for A100, 3090, 3090 Ti, 4090. * Link to PyTorch blog. * Update install instructions.
1 parent ca1e407 commit a39d42b

File tree

1 file changed

+57
-53
lines changed

1 file changed

+57
-53
lines changed

docs/source/en/optimization/torch2.0.mdx

Lines changed: 57 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,10 @@ Starting from version `0.13.0`, Diffusers supports the latest optimization from
1818

1919

2020
## Installation
21-
To benefit from the accelerated transformers implementation and `torch.compile`, we will need to install the nightly version of PyTorch, as the stable version is yet to be released. The first step is to install CUDA 11.7 or CUDA 11.8,
22-
as PyTorch 2.0 does not support the previous versions. Once CUDA is installed, torch nightly can be installed using:
21+
To benefit from the accelerated attention implementation and `torch.compile`, you just need to install the latest versions of PyTorch 2.0 from `pip`, and make sure you are on diffusers 0.13.0 or later. As explained below, `diffusers` automatically uses the attention optimizations (but not `torch.compile`) when available.
2322

2423
```bash
25-
pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu117
24+
pip install --upgrade torch torchvision diffusers
2625
```
2726

2827
## Using accelerated transformers and torch.compile.
@@ -91,8 +90,7 @@ pip install --pre torch torchvision --index-url https://download.pytorch.org/whl
9190
We conducted a simple benchmark on different GPUs to compare vanilla attention, xFormers, `torch.nn.functional.scaled_dot_product_attention` and `torch.compile+torch.nn.functional.scaled_dot_product_attention`.
9291
For the benchmark we used the [stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) model with 50 steps. The `xFormers` benchmark is done using the `torch==1.13.1` version, while the accelerated transformers optimizations are tested using nightly versions of PyTorch 2.0. The tables below summarize the results we got.
9392

94-
The `Speed over xformers` columns denote the speed-up gained over `xFormers` using the `torch.compile+torch.nn.functional.scaled_dot_product_attention`.
95-
93+
Please refer to [our featured blog post in the PyTorch site](https://pytorch.org/blog/accelerated-diffusers-pt-20/) for more details.
9694

9795
### FP16 benchmark
9896

@@ -103,10 +101,14 @@ ___The time reported is in seconds.___
103101

104102
| GPU | Batch Size | Vanilla Attention | xFormers | PyTorch2.0 SDPA | SDPA + torch.compile | Speed over xformers (%) |
105103
| --- | --- | --- | --- | --- | --- | --- |
106-
| A100 | 10 | 12.02 | 8.7 | 8.79 | 7.89 | 9.31 |
107-
| A100 | 16 | 18.95 | 13.57 | 13.67 | 12.25 | 9.73 |
108-
| A100 | 32 (1) | OOM | 26.56 | 26.68 | 24.08 | 9.34 |
109-
| A100 | 64 | | 52.51 | 53.03 | 47.81 | 8.95 |
104+
| A100 | 1 | 2.69 | 2.7 | 1.98 | 2.47 | 8.52 |
105+
| A100 | 2 | 3.21 | 3.04 | 2.38 | 2.78 | 8.55 |
106+
| A100 | 4 | 5.27 | 3.91 | 3.89 | 3.53 | 9.72 |
107+
| A100 | 8 | 9.74 | 7.03 | 7.04 | 6.62 | 5.83 |
108+
| A100 | 10 | 12.02 | 8.7 | 8.67 | 8.45 | 2.87 |
109+
| A100 | 16 | 18.95 | 13.57 | 13.55 | 13.20 | 2.73 |
110+
| A100 | 32 (1) | OOM | 26.56 | 26.68 | 25.85 | 2.67 |
111+
| A100 | 64 | | 52.51 | 53.03 | 50.93 | 3.01 |
110112
| | | | | | | |
111113
| A10 | 4 | 13.94 | 9.81 | 10.01 | 9.35 | 4.69 |
112114
| A10 | 8 | 27.09 | 19 | 19.53 | 18.33 | 3.53 |
@@ -125,25 +127,28 @@ ___The time reported is in seconds.___
125127
| V100 | 10 | OOM | 19.52 | 19.28 | 18.18 | 6.86 |
126128
| V100 | 16 | OOM | 30.29 | 29.84 | 28.22 | 6.83 |
127129
| | | | | | | |
128-
| 3090 | 4 | 10.04 | 7.82 | 7.89 | 7.47 | 4.48 |
129-
| 3090 | 8 | 19.27 | 14.97 | 15.04 | 14.22 | 5.01 |
130-
| 3090 | 10| 24.08 | 18.7 | 18.7 | 17.69 | 5.40 |
131-
| 3090 | 16 | OOM | 29.06 | 29.06 | 28.2 | 2.96 |
132-
| 3090 | 32 (1) | | 58.05 | 58 | 54.88 | 5.46 |
133-
| 3090 | 64 (1) | | 126.54 | 126.03 | 117.33 | 7.28 |
130+
| 3090 | 1 | 2.94 | 2.5 | 2.42 | 2.33 | 6.80 |
131+
| 3090 | 4 | 10.04 | 7.82 | 7.72 | 7.38 | 5.63 |
132+
| 3090 | 8 | 19.27 | 14.97 | 14.88 | 14.15 | 5.48 |
133+
| 3090 | 10| 24.08 | 18.7 | 18.62 | 18.12 | 3.10 |
134+
| 3090 | 16 | OOM | 29.06 | 28.88 | 28.2 | 2.96 |
135+
| 3090 | 32 (1) | | 58.05 | 57.42 | 56.28 | 3.05 |
136+
| 3090 | 64 (1) | | 126.54 | 114.27 | 112.21 | 11.32 |
134137
| | | | | | | |
135-
| 3090 Ti | 4 | 9.07 | 7.14 | 7.15 | 6.81 | 4.62 |
136-
| 3090 Ti | 8 | 17.51 | 13.65 | 13.72 | 12.99 | 4.84 |
137-
| 3090 Ti | 10 (2) | 21.79 | 16.85 | 16.93 | 16.02 | 4.93 |
138-
| 3090 Ti | 16 | OOM | 26.1 | 26.28 | 25.46 | 2.45 |
139-
| 3090 Ti | 32 (1) | | 51.78 | 52.04 | 49.15 | 5.08 |
140-
| 3090 Ti | 64 (1) | | 112.02 | 112.33 | 103.91 | 7.24 |
138+
| 3090 Ti | 1 | 2.7 | 2.26 | 2.19 | 2.12 | 6.19 |
139+
| 3090 Ti | 4 | 9.07 | 7.14 | 7.00 | 6.71 | 6.02 |
140+
| 3090 Ti | 8 | 17.51 | 13.65 | 13.53 | 12.94 | 5.20 |
141+
| 3090 Ti | 10 (2) | 21.79 | 16.85 | 16.77 | 16.44 | 2.43 |
142+
| 3090 Ti | 16 | OOM | 26.1 | 26.04 | 25.53 | 2.18 |
143+
| 3090 Ti | 32 (1) | | 51.78 | 51.71 | 50.91 | 1.68 |
144+
| 3090 Ti | 64 (1) | | 112.02 | 102.78 | 100.89 | 9.94 |
141145
| | | | | | | |
142-
| 4090 | 4 | 10.48 | 8.37 | 8.32 | 8.01 | 4.30 |
143-
| 4090 | 8 | 14.33 | 10.22 | 10.42 | 9.78 | 4.31 |
144-
| 4090 | 16 | | 17.07 | 17.46 | 17.15 | -0.47 |
145-
| 4090 | 32 (1) | | 39.03 | 39.86 | 37.97 | 2.72 |
146-
| 4090 | 64 (1) | | 77.29 | 79.44 | 77.67 | -0.49 |
146+
| 4090 | 1 | 4.47 | 3.98 | 1.28 | 1.21 | 69.60 |
147+
| 4090 | 4 | 10.48 | 8.37 | 3.76 | 3.56 | 57.47 |
148+
| 4090 | 8 | 14.33 | 10.22 | 7.43 | 6.99 | 31.60 |
149+
| 4090 | 16 | | 17.07 | 14.98 | 14.58 | 14.59 |
150+
| 4090 | 32 (1) | | 39.03 | 30.18 | 29.49 | 24.44 |
151+
| 4090 | 64 (1) | | 77.29 | 61.34 | 59.96 | 22.42 |
147152

148153

149154

@@ -155,11 +160,13 @@ Using `torch.compile` in addition to the accelerated transformers implementation
155160

156161
| GPU | Batch Size | Vanilla Attention | xFormers | PyTorch2.0 SDPA | SDPA + torch.compile | Speed over xformers (%) | Speed over vanilla (%) |
157162
| --- | --- | --- | --- | --- | --- | --- | --- |
158-
| A100 | 4 | 16.56 | 12.42 | 12.2 | 11.84 | 4.67 | 28.50 |
159-
| A100 | 10 | OOM | 29.93 | 29.44 | 28.5 | 4.78 | |
160-
| A100 | 16 | | 47.08 | 46.27 | 44.8 | 4.84 | |
161-
| A100 | 32 | | 92.89 | 91.34 | 88.35 | 4.89 | |
162-
| A100 | 64 | | 185.3 | 182.71 | 176.48 | 4.76 | |
163+
| A100 | 1 | 4.97 | 3.86 | 2.6 | 2.86 | 25.91 | 42.45 |
164+
| A100 | 2 | 9.03 | 6.76 | 4.41 | 4.21 | 37.72 | 53.38 |
165+
| A100 | 4 | 16.70 | 12.42 | 7.94 | 7.54 | 39.29 | 54.85 |
166+
| A100 | 10 | OOM | 29.93 | 18.70 | 18.46 | 38.32 | |
167+
| A100 | 16 | | 47.08 | 29.41 | 29.04 | 38.32 | |
168+
| A100 | 32 | | 92.89 | 57.55 | 56.67 | 38.99 | |
169+
| A100 | 64 | | 185.3 | 114.8 | 112.98 | 39.03 | |
163170
| | | | | | | |
164171
| A10 | 1 | 10.59 | 8.81 | 7.51 | 7.35 | 16.57 | 30.59 |
165172
| A10 | 4 | 34.77 | 27.63 | 22.77 | 22.07 | 20.12 | 36.53 |
@@ -179,30 +186,27 @@ Using `torch.compile` in addition to the accelerated transformers implementation
179186
| V100 | 8 | | 43.95 | 43.37 | 42.25 | 3.87 | |
180187
| V100 | 16 | | 84.99 | 84.73 | 82.55 | 2.87 | |
181188
| | | | | | | |
182-
| 3090 | 1 | 7.09 | 6.78 | 6.11 | 6.03 | 11.06 | 14.95 |
183-
| 3090 | 4 | 22.69 | 21.45 | 18.67 | 18.09 | 15.66 | 20.27 |
184-
| 3090 | 8 | | 42.59 | 36.75 | 35.59 | 16.44 | |
185-
| 3090 | 16 | | 85.35 | 72.37 | 70.25 | 17.69 | |
186-
| 3090 | 32 (1) | | 162.05 | 138.99 | 134.53 | 16.98 | |
187-
| 3090 | 48 | | 241.91 | 207.75 | | 14.12 | |
189+
| 3090 | 1 | 7.09 | 6.78 | 5.34 | 5.35 | 21.09 | 24.54 |
190+
| 3090 | 4 | 22.69 | 21.45 | 18.56 | 18.18 | 15.24 | 19.88 |
191+
| 3090 | 8 | | 42.59 | 36.68 | 35.61 | 16.39 | |
192+
| 3090 | 16 | | 85.35 | 72.93 | 70.18 | 17.77 | |
193+
| 3090 | 32 (1) | | 162.05 | 143.46 | 138.67 | 14.43 | |
188194
| | | | | | | |
189-
| 3090 Ti | 1 | 6.45 | 6.19 | 5.64 | 5.49 | 11.31 | 14.88 |
190-
| 3090 Ti | 4 | 20.32 | 19.31 | 16.9 | 16.37 | 15.23 | 19.44 |
191-
| 3090 Ti | 8 (2) | | 37.93 | 33.05 | 31.99 | 15.66 | |
192-
| 3090 Ti | 16 | | 75.37 | 65.25 | 64.32 | 14.66 | |
193-
| 3090 Ti | 32 (1) | | 142.55 | 124.44 | 120.74 | 15.30 | |
194-
| 3090 Ti | 48 | | 213.19 | 186.55 | | 12.50 | |
195+
| 3090 Ti | 1 | 6.45 | 6.19 | 4.99 | 4.89 | 21.00 | 24.19 |
196+
| 3090 Ti | 4 | 20.32 | 19.31 | 17.02 | 16.48 | 14.66 | 18.90 |
197+
| 3090 Ti | 8 | | 37.93 | 33.21 | 32.24 | 15.00 | |
198+
| 3090 Ti | 16 | | 75.37 | 66.63 | 64.5 | 14.42 | |
199+
| 3090 Ti | 32 (1) | | 142.55 | 128.89 | 124.92 | 12.37 | |
195200
| | | | | | | |
196-
| 4090 | 1 | 5.54 | 4.99 | 4.51 | 4.44 | 11.02 | 19.86 |
197-
| 4090 | 4 | 13.67 | 11.4 | 10.3 | 9.84 | 13.68 | 28.02 |
198-
| 4090 | 8 | | 19.79 | 17.13 | 16.19 | 18.19 | |
199-
| 4090 | 16 | | 38.62 | 33.14 | 32.31 | 16.34 | |
200-
| 4090 | 32 (1) | | 76.57 | 65.96 | 62.05 | 18.96 | |
201-
| 4090 | 48 | | 114.44 | 98.78 | | 13.68 | |
202-
201+
| 4090 | 1 | 5.54 | 4.99 | 2.66 | 2.58 | 48.30 | 53.43 |
202+
| 4090 | 4 | 13.67 | 11.4 | 8.81 | 8.46 | 25.79 | 38.11 |
203+
| 4090 | 8 | | 19.79 | 17.55 | 16.62 | 16.02 | |
204+
| 4090 | 16 | | 38.62 | 35.65 | 34.07 | 11.78 | |
205+
| 4090 | 32 (1) | | 76.57 | 69.48 | 65.35 | 14.65 | |
206+
| 4090 | 48 | | 114.44 | 106.3 | | 7.11 | |
203207

204208

205-
(1) Batch Size >= 32 requires enable_vae_slicing() because of https://github.com/pytorch/pytorch/issues/81665.
206-
This is required for PyTorch 1.13.1, and also for PyTorch 2.0 and batch size of 64.
209+
(1) Batch Size >= 32 requires enable_vae_slicing() because of https://github.com/pytorch/pytorch/issues/81665.
210+
This is required for PyTorch 1.13.1, and also for PyTorch 2.0 and large batch sizes.
207211

208-
For more details about how this benchmark was run, please refer to [this PR](https://github.com/huggingface/diffusers/pull/2303).
212+
For more details about how this benchmark was run, please refer to [this PR](https://github.com/huggingface/diffusers/pull/2303) and to [the blog post](https://pytorch.org/blog/accelerated-diffusers-pt-20/).

0 commit comments

Comments
 (0)