@@ -8,21 +8,29 @@ for some other drivers that end users will run Flutter apps against.
8
8
9
9
That being said, newer graphics devices have architectures that allow for both
10
10
simpler shader compilation and better handling of traditionally slow shader
11
- code. In fact, straightforward "unoptimized" shader code filled with branches
12
- may significantly outperform the equivalent branchless optimized shader code
13
- when targeting newer GPU architectures.
11
+ code. In fact, ostensibly "unoptimized" shader code filled with branches may
12
+ significantly outperform the equivalent branchless optimized shader code when
13
+ targeting newer GPU architectures. (See the "Don't flatten simple varying
14
+ branches" recommendation for an explanation of this with respect to different
15
+ architectures).
14
16
15
- Flutter actively supports devices that are more than a decade old, which
17
+ Flutter actively supports mobile devices that are more than a decade old, which
16
18
requires us to write shaders that perform well across multiple generations of
17
19
GPU architectures featuring radically different behavior. Most optimization
18
- choices are direct tradeoffs between GPU architectures, and having an accurate
19
- mental model for how these common architectures maximize parallelism is
20
- essential for making good tradeoff decisions while writing shaders.
20
+ choices are direct tradeoffs between these GPU architectures, and so having an
21
+ accurate mental model for how these common architectures maximize parallelism is
22
+ essential for making good decisions while authoring shaders.
21
23
22
24
For these reasons, it's also important to profile shaders against some of the
23
- older devices that Flutter can target (such as the iPhone 4s ) when making
25
+ older devices that Flutter can target (such as the iPhone 6s ) when making
24
26
changes intended to improve shader performance.
25
27
28
+ Also, even though the branching behavior is largely architecture dependent and
29
+ should remain the same when using different graphics APIs, it's still also a
30
+ good idea to test changes against the different backends supported by Impeller
31
+ (Metal and GLES). Early stage shader compilation (as well as the high level
32
+ shader code generated by ImpellerC) may vary quite a bit between APIs.
33
+
26
34
## GPU architecture primer
27
35
28
36
GPUs are designed to have functional units running single instructions over many
@@ -33,25 +41,25 @@ essentially specialized SIMD engines.
33
41
GPU parallelism generally comes in two broad architectural flavors:
34
42
** Instruction-level parallelism** and ** Thread-level parallelism** -- these
35
43
architecture designs handle shader branching very differently and are covered
36
- in great detail in sections below. In general, older GPU architectures (on some
37
- products released before ~ 2015) leverage instruction-level parallelism, while
38
- most if not all newer GPUs leverage thread-level parallelism.
44
+ in the sections below. In general, older GPU architectures (on some products
45
+ released before ~ 2015) leverage instruction-level parallelism, while most if not
46
+ all newer GPUs leverage thread-level parallelism.
39
47
40
48
Some of the earliest GPU architectures had no runtime control flow primitives at
41
49
all (i.e. jump instructions), and compilers for these architectures needed to
42
50
handle branches ahead of time by unrolling loops, compiling a different program
43
51
for every possible branch combination, and then executing all of them. However,
44
52
virtually all GPU architectures in use today have instruction-level support for
45
53
dynamic branching, and it's quite unlikely that we'll come across a mobile
46
- device capable of running Flutter that doesn't. For example, the oldest devices
47
- we test against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic
54
+ device capable of running Flutter that doesn't. For example, the old devices we
55
+ test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
48
56
runtime branching. For these reasons, the optimization advice in this document
49
57
isn't aimed at branchless architectures.
50
58
51
59
### Instruction-level parallelism
52
60
53
- Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC ) rely
54
- on SIMD vector or array instructions to maximize the number of computations
61
+ Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC ) rely on
62
+ SIMD vector or array instructions to maximize the number of computations
55
63
performed per clock cycle on each functional unit. This means that the shader
56
64
compiler must figure out which parts of the program are safe to parallelize
57
65
ahead of time and emit appropriate instructions. This presents a problem for
@@ -69,7 +77,7 @@ disadvantage that SIMD does.
69
77
### Thread-level parallelism
70
78
71
79
Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
72
- Moto G4's Snapdragon SOC ) use scalar functional units (no SIMD/VLIW/MIMD) and
80
+ Moto G4's Snapdragon SoC ) use scalar functional units (no SIMD/VLIW/MIMD) and
73
81
parallelize instructions at runtime by running the same instruction over many
74
82
threads in groups often referred to as "warps" (Nvidia terminology) or
75
83
"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
@@ -110,9 +118,10 @@ uniform struct FrameInfo {
110
118
in vec2 position;
111
119
112
120
void main() {
113
- gl_Position = mvp * vec4(position, 0, 1)
114
- if (invert_y) {
115
- gl_Position *= vec2(1, -1);
121
+ gl_Position = frame_info.mvp * vec4(position, 0, 1)
122
+
123
+ if (frame_info.invert_y) {
124
+ gl_Position *= vec4(1, -1, 1, 1);
116
125
}
117
126
}
118
127
```
@@ -207,7 +216,7 @@ vertex shader -- so the value may change from fragment to fragment (as opposed
207
216
to a _ uniform_ or _ constant_ , which will remain the same for the whole draw
208
217
call).
209
218
210
- On SIMT architectures, this branch incurs very little overhead because, and
219
+ On SIMT architectures, this branch incurs very little overhead because
211
220
` DoExtremelyExpensiveThing ` will be skipped over if ` color.a == 0 ` across all
212
221
the threads in a given warp.
213
222
However, architectures that use instruction-level parallelism (VLIW or SIMD)
0 commit comments