Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Commit 92d9d3a

Browse files
committed
Add shader optimization doc
1 parent 695ac68 commit 92d9d3a

File tree

1 file changed

+280
-0
lines changed

1 file changed

+280
-0
lines changed

impeller/docs/shader_optimization.md

Lines changed: 280 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,280 @@
1+
# Writing efficient shaders
2+
3+
When it comes to optimizing shaders for a wide range of devices, there is no
4+
perfect strategy. The reality of different drivers written by different vendors
5+
targeting different hardware is that they will vary in behavior. Any attempt at
6+
optimizing against a specific driver will likely result in a performance loss
7+
for some other drivers that end users will run Flutter apps against.
8+
9+
That being said, newer graphics devices have architectures that allow for both
10+
simpler shader compilation and better handling of traditionally slow shader
11+
code. In fact, straight forward "unoptimized" shader code filled with branches
12+
may significantly outperform the equivalent branchless optimized shader code
13+
when targeting newer GPU architectures.
14+
15+
Flutter actively supports devices that are more than a decade old, which
16+
requires us to write shaders that perform well across multiple generations of
17+
GPU architectures featuring radically different behavior. Most optimization
18+
choices are direct tradeoffs between GPU architectures, and having an accurate
19+
mental model for how these common architectures maximize parallelism is
20+
essential for making good tradeoff decisions while writing shaders.
21+
22+
For these reasons, it's important to profile shaders against some of the older
23+
devices that Flutter can target (such as the iPhone 4s) when making changes to
24+
shaders that are intended to improve performance.
25+
26+
## GPU architecture primer
27+
28+
GPUs are designed to have functional units running single instructions over many
29+
elements (the "data path") each clock cycle. This is the fundamental aspect of
30+
GPUs that makes them work well for massively parallel compute work; they're
31+
essentially specialized SIMD engines.
32+
33+
GPU parallelism generally comes in two broad architectural flavors:
34+
**Instruction-level parallelism** and **Thread-level parallelism** -- these
35+
architecture designs handle shader branching very differently and are covered
36+
in great detail in sections below. In general, older GPU architectures (before
37+
~2015) leverage instruction-level parallelism, while most if not all newer GPUs
38+
leverage thread-level parallelism.
39+
40+
Early GPU architectures often had no runtime control flow primitives at all
41+
(i.e. jump instructions), and compilers for these architectures needed to handle
42+
branches ahead of time by unrolling loops, compiling a different program for
43+
every possible branch combination, and executing all of them. However, virtually
44+
all GPU architectures in use today have instruction-level support for dynamic
45+
branching, and it's quite unlikely that we'll come across a mobile device
46+
capable of running Flutter that doesn't. For example, the oldest devices we test
47+
against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic runtime
48+
branching. For these reasons, the optimization advice in this document isn't
49+
aimed at such devices.
50+
51+
### Instruction-level parallelism
52+
53+
Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC) rely
54+
on SIMD vector or array instructions to maximize the number of computations
55+
performed per clock cycle on each functional unit. This means that the shader
56+
compiler must figure out which parts of the program are safe to parallelize
57+
ahead of time and emit appropriate instructions. This presents a problem for
58+
certain kinds of branches: If the compiler doesn't know that the same decision
59+
will always be taken for all data lanes in the data path at runtime (meaning the
60+
branch is not _uniform_), it can't safely emit SIMD instructions when compiling
61+
the branch. The result is that instructions within non-uniform branches incur a
62+
`1/[data width]` performance penalty when compared to non-branched instructions
63+
because they can't be parallelized.
64+
65+
VLIW ("Very Long Instruction Width") is another common instruction-level
66+
parallelism design that suffers from the same compile time reasoning
67+
disadvantage that SIMD does.
68+
69+
### Thread-level parallelism
70+
71+
Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
72+
Moto G4's Snapdragon SOC) use scalar functional units (no SIMD/VLIW/MIMD) and
73+
parallelize instructions at runtime by running the same instruction over many
74+
threads in groups often referred to as "warps" (Nvidia terminology) or
75+
"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
76+
warp/wavefront. This design is also commonly referred to as SIMT ("Single
77+
Instruction Multiple Thread").
78+
79+
To handle branching, SIMT programs use special instructions to write a thread
80+
mask that determines which threads are activated/deactivated in the warp; only
81+
the warp's activated threads will actually execute instructions. Given this
82+
setup, the program can first deactivate threads that failed the branch
83+
condition, run the positive path, invert the mask, run the negative path, and
84+
finally restore the mask to its original state prior to the branch. The compiler
85+
may also insert mask checks to skip over branches when all of the threads have
86+
been deactivated.
87+
88+
Therefore, the best case scenario for a SIMT branch is that it only incurs the
89+
cost of the conditional. The worst case scenario is that some of the warp's
90+
threads fail the conditional and the rest succeed, requiring the program to
91+
execute both paths of the branch back-to-back in the warp. Note that this is
92+
very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
93+
is able to retain significant parallelism in all cases, whereas SIMD cannot.
94+
95+
## Recommendations
96+
97+
### Don't flatten uniform or constant branches
98+
99+
Uniforms are pipeline variables accessible within a shader which are guaranteed
100+
to not vary during a GPU program's invocation.
101+
102+
Example of a uniform branch in action:
103+
104+
```glsl
105+
uniform struct FrameInfo {
106+
mat4 mvp;
107+
bool invert_y;
108+
} frame_info;
109+
110+
in vec2 position;
111+
112+
void main() {
113+
gl_Position = mvp * vec4(position, 0, 1)
114+
if (invert_y) {
115+
gl_Position *= vec2(1, -1);
116+
}
117+
}
118+
```
119+
120+
While it's true that driver stacks have the opportunity to generate multiple
121+
pipeline variants ahead of time to handle these branches, this advanced
122+
functionality isn't actually necessary to achieve for good runtime performance
123+
of uniform branches on widely used mobile architectures:
124+
* On SIMT architectures, branching on a uniform means that every thread in every
125+
warp will resolve to the same path, so only one path in the branch will ever
126+
execute.
127+
* On VLIW/SIMD architectures, the compiler can be certain that all of the
128+
elements in the data path for every functional unit will resolve to the same
129+
path, and so it can safely emit fully parallelized instructions for the
130+
contents of the branch!
131+
132+
### Don't flatten simple varying branches
133+
134+
Widely used mobile GPU architectures generally don't benefit from flattening
135+
simple varying branches. While it's true that compilers for VLIW/SIMD-based
136+
architectures can't emit efficient instructions for these branches, the
137+
detrimental effects of this are minimal with small branches. For modern SIMT
138+
architectures, flattened branches can actually perform measurably worse than
139+
straight forward branch solutions. Also, some shader compilers can collapse
140+
small branches automatically.
141+
142+
Instead of this:
143+
144+
```glsl
145+
vec3 ColorBurn(vec3 dst, vec3 src) {
146+
vec3 color = 1 - min(vec3(1), (1 - dst) / src);
147+
color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
148+
color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
149+
return color;
150+
}
151+
```
152+
153+
...just do this:
154+
155+
```glsl
156+
vec3 ColorBurn(vec3 dst, vec3 src) {
157+
vec3 color = 1 - min(vec3(1), (1 - dst) / src);
158+
if (1 - dst.r < kEhCloseEnough) {
159+
color.r = 1;
160+
}
161+
if (1 - dst.g < kEhCloseEnough) {
162+
color.g = 1;
163+
}
164+
if (1 - dst.b < kEhCloseEnough) {
165+
color.b = 1;
166+
}
167+
if (src.r < kEhCloseEnough) {
168+
color.r = 0;
169+
}
170+
if (src.g < kEhCloseEnough) {
171+
color.g = 0;
172+
}
173+
if (src.b < kEhCloseEnough) {
174+
color.b = 0;
175+
}
176+
return color;
177+
}
178+
```
179+
180+
It's easier to understand, doesn't prevent compiler optimizations, runs
181+
measurably faster on SIMT devices, and works out to be at most marginally slower
182+
on older VLIW devices.
183+
184+
### Avoid complex varying branches
185+
186+
Consider the following fragment shader:
187+
188+
```glsl
189+
in vec4 color;
190+
out vec4 frag_color;
191+
192+
void main() {
193+
vec4 result;
194+
195+
if (color.a == 0) {
196+
result = vec4(0);
197+
} else {
198+
result = DoExtremelyExpensiveThing(color);
199+
}
200+
201+
frag_color = result;
202+
}
203+
```
204+
205+
Note that `color` is _varying_. Specifically, it's an interpolated output from a
206+
vertex shader -- so the value may change from fragment to fragment (as opposed
207+
to a _uniform_ or _constant_, which will remain the same for the whole draw
208+
call).
209+
210+
On SIMT architectures, this branch incurs very little overhead because, and
211+
`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
212+
the threads in a given warp.
213+
However, architectures that use instruction-level parallelism (VLIW or SIMD)
214+
can't handle this branch efficiently because the compiler can't safely emit
215+
parallelized instructions on either side of the branch.
216+
217+
To achieve maximum parallelism across all of these architectures, one possible
218+
solution is to unbranch the more complex path:
219+
220+
```glsl
221+
in vec4 color;
222+
out vec4 frag_color;
223+
224+
void main() {
225+
frag_color = DoExtremelyExpensiveThing(color);
226+
227+
if (color.a == 0) {
228+
frag_color = vec4(0);
229+
}
230+
}
231+
```
232+
233+
However, this may be a big tradeoff depending on how this shader is used -- this
234+
solution will perform worse on SIMT devices in cases where `color.a == 0` across
235+
all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
236+
skipped with this solution! So if the cheap branch path covers a large solid
237+
portion of a draw call's coverage area, alternative designs may be favorable.
238+
239+
### Beware of return branching
240+
241+
Consider the following glsl function:
242+
```glsl
243+
vec4 FrobnicateColor(vec4 color) {
244+
if (color.a == 0) {
245+
return vec4(0);
246+
}
247+
248+
return DoExtremelyExpensiveThing(color);
249+
}
250+
```
251+
252+
At first glance, this may appear cheap due to its simple contents, but this
253+
branch has two exclusive paths in practice, and the generated shader assembly
254+
will reflect the same behavior as this code:
255+
256+
```glsl
257+
vec4 FrobnicateColor(vec4 color) {
258+
vec4 result;
259+
260+
if (color.a == 0) {
261+
result vec4(0);
262+
} else {
263+
result = DoExtremelyExpensiveThing(color);
264+
}
265+
266+
return result;
267+
}
268+
```
269+
270+
The same concerns and advice apply to this branch as the scenario under "Avoid
271+
complex varying branches".
272+
273+
### Use lower precision whenever possible
274+
275+
Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
276+
operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
277+
according to the
278+
[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
279+
using lower precision floating point operations is more efficient on these
280+
devices.

0 commit comments

Comments
 (0)