Skip to content

Commit efc604e

Browse files
authored
[Impeller] Add guidance for writing shaders (flutter#34634)
1 parent 45e92fb commit efc604e

File tree

2 files changed

+290
-0
lines changed

2 files changed

+290
-0
lines changed

impeller/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,3 +185,4 @@ To your `AndroidManifest.xml` file, add under the `<application>` tag:
185185
* [Learning to Read GPU Frame Captures](docs/read_frame_captures.md)
186186
* [How to Enable Metal Validation for Command Line Apps.](docs/metal_validation.md)
187187
* [How Impeller Works Around The Lack of Uniform Buffers in Open GL ES 2.0.](docs/ubo_gles2.md)
188+
* [Guidance for writing efficient shaders](docs/shader_optimization.md)

impeller/docs/shader_optimization.md

Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
# Writing efficient shaders
2+
3+
When it comes to optimizing shaders for a wide range of devices, there is no
4+
perfect strategy. The reality of different drivers written by different vendors
5+
targeting different hardware is that they will vary in behavior. Any attempt at
6+
optimizing against a specific driver will likely result in a performance loss
7+
for some other drivers that end users will run Flutter apps against.
8+
9+
That being said, newer graphics devices have architectures that allow for both
10+
simpler shader compilation and better handling of traditionally slow shader
11+
code. In fact, ostensibly "unoptimized" shader code filled with branches may
12+
significantly outperform the equivalent branchless optimized shader code when
13+
targeting newer GPU architectures. (See the "Don't flatten simple varying
14+
branches" recommendation for an explanation of this with respect to different
15+
architectures).
16+
17+
Flutter actively supports mobile devices that are more than a decade old, which
18+
requires us to write shaders that perform well across multiple generations of
19+
GPU architectures featuring radically different behavior. Most optimization
20+
choices are direct tradeoffs between these GPU architectures, and so having an
21+
accurate mental model for how these common architectures maximize parallelism is
22+
essential for making good decisions while authoring shaders.
23+
24+
For these reasons, it's also important to profile shaders against some of the
25+
older devices that Flutter can target (such as the iPhone 6s) when making
26+
changes intended to improve shader performance.
27+
28+
Also, even though the branching behavior is largely architecture dependent and
29+
should remain the same when using different graphics APIs, it's still also a
30+
good idea to test changes against the different backends supported by Impeller
31+
(Metal and GLES). Early stage shader compilation (as well as the high level
32+
shader code generated by ImpellerC) may vary quite a bit between APIs.
33+
34+
## GPU architecture primer
35+
36+
GPUs are designed to have functional units running single instructions over many
37+
elements (the "data path") each clock cycle. This is the fundamental aspect of
38+
GPUs that makes them work well for massively parallel compute work; they're
39+
essentially specialized SIMD engines.
40+
41+
GPU parallelism generally comes in two broad architectural flavors:
42+
**Instruction-level parallelism** and **Thread-level parallelism** -- these
43+
architecture designs handle shader branching very differently and are covered
44+
in the sections below. In general, older GPU architectures (on some products
45+
released before ~2015) leverage instruction-level parallelism, while most if not
46+
all newer GPUs leverage thread-level parallelism.
47+
48+
Some of the earliest GPU architectures had no runtime control flow primitives at
49+
all (i.e. jump instructions), and compilers for these architectures needed to
50+
handle branches ahead of time by unrolling loops, compiling a different program
51+
for every possible branch combination, and then executing all of them. However,
52+
virtually all GPU architectures in use today have instruction-level support for
53+
dynamic branching, and it's quite unlikely that we'll come across a mobile
54+
device capable of running Flutter that doesn't. For example, the old devices we
55+
test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
56+
runtime branching. For these reasons, the optimization advice in this document
57+
isn't aimed at branchless architectures.
58+
59+
### Instruction-level parallelism
60+
61+
Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
62+
SIMD vector or array instructions to maximize the number of computations
63+
performed per clock cycle on each functional unit. This means that the shader
64+
compiler must figure out which parts of the program are safe to parallelize
65+
ahead of time and emit appropriate instructions. This presents a problem for
66+
certain kinds of branches: If the compiler doesn't know that the same decision
67+
will always be taken by all of the data lanes at runtime (meaning the branch is
68+
_varying_), it can't safely emit SIMD instructions while compiling the branch.
69+
The result is that instructions within non-uniform branches incur a
70+
`1/[data width]` performance penalty when compared to non-branched instructions
71+
because they can't be parallelized.
72+
73+
VLIW ("Very Long Instruction Width") is another common instruction-level
74+
parallelism design that suffers from the same compile time reasoning
75+
disadvantage that SIMD does.
76+
77+
### Thread-level parallelism
78+
79+
Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
80+
Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
81+
parallelize instructions at runtime by running the same instruction over many
82+
threads in groups often referred to as "warps" (Nvidia terminology) or
83+
"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
84+
warp/wavefront. This design is also commonly referred to as SIMT ("Single
85+
Instruction Multiple Thread").
86+
87+
To handle branching, SIMT programs use special instructions to write a thread
88+
mask that determines which threads are activated/deactivated in the warp; only
89+
the warp's activated threads will actually execute instructions. Given this
90+
setup, the program can first deactivate threads that failed the branch
91+
condition, run the positive path, invert the mask, run the negative path, and
92+
finally restore the mask to its original state prior to the branch. The compiler
93+
may also insert mask checks to skip over branches when all of the threads have
94+
been deactivated.
95+
96+
Therefore, the best case scenario for a SIMT branch is that it only incurs the
97+
cost of the conditional. The worst case scenario is that some of the warp's
98+
threads fail the conditional and the rest succeed, requiring the program to
99+
execute both paths of the branch back-to-back in the warp. Note that this is
100+
very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
101+
is able to retain significant parallelism in all cases, whereas SIMD cannot.
102+
103+
## Recommendations
104+
105+
### Don't flatten uniform or constant branches
106+
107+
Uniforms are pipeline variables accessible within a shader which are guaranteed
108+
to not vary during a GPU program's invocation.
109+
110+
Example of a uniform branch in action:
111+
112+
```glsl
113+
uniform struct FrameInfo {
114+
mat4 mvp;
115+
bool invert_y;
116+
} frame_info;
117+
118+
in vec2 position;
119+
120+
void main() {
121+
gl_Position = frame_info.mvp * vec4(position, 0, 1)
122+
123+
if (frame_info.invert_y) {
124+
gl_Position *= vec4(1, -1, 1, 1);
125+
}
126+
}
127+
```
128+
129+
While it's true that driver stacks have the opportunity to generate multiple
130+
pipeline variants ahead of time to handle these branches, this advanced
131+
functionality isn't actually necessary to achieve for good runtime performance
132+
of uniform branches on widely used mobile architectures:
133+
* On SIMT architectures, branching on a uniform means that every thread in every
134+
warp will resolve to the same path, so only one path in the branch will ever
135+
execute.
136+
* On VLIW/SIMD architectures, the compiler can be certain that all of the
137+
elements in the data path for every functional unit will resolve to the same
138+
path, and so it can safely emit fully parallelized instructions for the
139+
contents of the branch!
140+
141+
### Don't flatten simple varying branches
142+
143+
Widely used mobile GPU architectures generally don't benefit from flattening
144+
simple varying branches. While it's true that compilers for VLIW/SIMD-based
145+
architectures can't emit efficient instructions for these branches, the
146+
detrimental effects of this are minimal with small branches. For modern SIMT
147+
architectures, flattened branches can actually perform measurably worse than
148+
straight forward branch solutions. Also, some shader compilers can collapse
149+
small branches automatically.
150+
151+
Instead of this:
152+
153+
```glsl
154+
vec3 ColorBurn(vec3 dst, vec3 src) {
155+
vec3 color = 1 - min(vec3(1), (1 - dst) / src);
156+
color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
157+
color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
158+
return color;
159+
}
160+
```
161+
162+
...just do this:
163+
164+
```glsl
165+
vec3 ColorBurn(vec3 dst, vec3 src) {
166+
vec3 color = 1 - min(vec3(1), (1 - dst) / src);
167+
if (1 - dst.r < kEhCloseEnough) {
168+
color.r = 1;
169+
}
170+
if (1 - dst.g < kEhCloseEnough) {
171+
color.g = 1;
172+
}
173+
if (1 - dst.b < kEhCloseEnough) {
174+
color.b = 1;
175+
}
176+
if (src.r < kEhCloseEnough) {
177+
color.r = 0;
178+
}
179+
if (src.g < kEhCloseEnough) {
180+
color.g = 0;
181+
}
182+
if (src.b < kEhCloseEnough) {
183+
color.b = 0;
184+
}
185+
return color;
186+
}
187+
```
188+
189+
It's easier to understand, doesn't prevent compiler optimizations, runs
190+
measurably faster on SIMT devices, and works out to be at most marginally slower
191+
on older VLIW devices.
192+
193+
### Avoid complex varying branches
194+
195+
Consider the following fragment shader:
196+
197+
```glsl
198+
in vec4 color;
199+
out vec4 frag_color;
200+
201+
void main() {
202+
vec4 result;
203+
204+
if (color.a == 0) {
205+
result = vec4(0);
206+
} else {
207+
result = DoExtremelyExpensiveThing(color);
208+
}
209+
210+
frag_color = result;
211+
}
212+
```
213+
214+
Note that `color` is _varying_. Specifically, it's an interpolated output from a
215+
vertex shader -- so the value may change from fragment to fragment (as opposed
216+
to a _uniform_ or _constant_, which will remain the same for the whole draw
217+
call).
218+
219+
On SIMT architectures, this branch incurs very little overhead because
220+
`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
221+
the threads in a given warp.
222+
However, architectures that use instruction-level parallelism (VLIW or SIMD)
223+
can't handle this branch efficiently because the compiler can't safely emit
224+
parallelized instructions on either side of the branch.
225+
226+
To achieve maximum parallelism across all of these architectures, one possible
227+
solution is to unbranch the more complex path:
228+
229+
```glsl
230+
in vec4 color;
231+
out vec4 frag_color;
232+
233+
void main() {
234+
frag_color = DoExtremelyExpensiveThing(color);
235+
236+
if (color.a == 0) {
237+
frag_color = vec4(0);
238+
}
239+
}
240+
```
241+
242+
However, this may be a big tradeoff depending on how this shader is used -- this
243+
solution will perform worse on SIMT devices in cases where `color.a == 0` across
244+
all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
245+
skipped with this solution! So if the cheap branch path covers a large solid
246+
portion of a draw call's coverage area, alternative designs may be favorable.
247+
248+
### Beware of return branching
249+
250+
Consider the following glsl function:
251+
```glsl
252+
vec4 FrobnicateColor(vec4 color) {
253+
if (color.a == 0) {
254+
return vec4(0);
255+
}
256+
257+
return DoExtremelyExpensiveThing(color);
258+
}
259+
```
260+
261+
At first glance, this may appear cheap due to its simple contents, but this
262+
branch has two exclusive paths in practice, and the generated shader assembly
263+
will reflect the same behavior as this code:
264+
265+
```glsl
266+
vec4 FrobnicateColor(vec4 color) {
267+
vec4 result;
268+
269+
if (color.a == 0) {
270+
result vec4(0);
271+
} else {
272+
result = DoExtremelyExpensiveThing(color);
273+
}
274+
275+
return result;
276+
}
277+
```
278+
279+
The same concerns and advice apply to this branch as the scenario under "Avoid
280+
complex varying branches".
281+
282+
### Use lower precision whenever possible
283+
284+
Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
285+
operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
286+
according to the
287+
[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
288+
using lower precision floating point operations is more efficient on these
289+
devices.

0 commit comments

Comments
 (0)