|
| 1 | +# Writing efficient shaders |
| 2 | + |
| 3 | +When it comes to optimizing shaders for a wide range of devices, there is no |
| 4 | +perfect strategy. The reality of different drivers written by different vendors |
| 5 | +targeting different hardware is that they will vary in behavior. Any attempt at |
| 6 | +optimizing against a specific driver will likely result in a performance loss |
| 7 | +for some other drivers that end users will run Flutter apps against. |
| 8 | + |
| 9 | +That being said, newer graphics devices have architectures that allow for both |
| 10 | +simpler shader compilation and better handling of traditionally slow shader |
| 11 | +code. In fact, straight forward "unoptimized" shader code filled with branches |
| 12 | +may significantly outperform the equivalent branchless optimized shader code |
| 13 | +when targeting newer GPU architectures. |
| 14 | + |
| 15 | +Flutter actively supports devices that are more than a decade old, which |
| 16 | +requires us to write shaders that perform well across multiple generations of |
| 17 | +GPU architectures featuring radically different behavior. Most optimization |
| 18 | +choices are direct tradeoffs between GPU architectures, and having an accurate |
| 19 | +mental model for how these common architectures maximize parallelism is |
| 20 | +essential for making good tradeoff decisions while writing shaders. |
| 21 | + |
| 22 | +For these reasons, it's important to profile shaders against some of the older |
| 23 | +devices that Flutter can target (such as the iPhone 4s) when making changes to |
| 24 | +shaders that are intended to improve performance. |
| 25 | + |
| 26 | +## GPU architecture primer |
| 27 | + |
| 28 | +GPUs are designed to have functional units running single instructions over many |
| 29 | +elements (the "data path") each clock cycle. This is the fundamental aspect of |
| 30 | +GPUs that makes them work well for massively parallel compute work; they're |
| 31 | +essentially specialized SIMD engines. |
| 32 | + |
| 33 | +GPU parallelism generally comes in two broad architectural flavors: |
| 34 | +**Instruction-level parallelism** and **Thread-level parallelism** -- these |
| 35 | +architecture designs handle shader branching very differently and are covered |
| 36 | +in great detail in sections below. In general, older GPU architectures (before |
| 37 | +~2015) leverage instruction-level parallelism, while most if not all newer GPUs |
| 38 | +leverage thread-level parallelism. |
| 39 | + |
| 40 | +Early GPU architectures often had no runtime control flow primitives at all |
| 41 | +(i.e. jump instructions), and compilers for these architectures needed to handle |
| 42 | +branches ahead of time by unrolling loops, compiling a different program for |
| 43 | +every possible branch combination, and executing all of them. However, virtually |
| 44 | +all GPU architectures in use today have instruction-level support for dynamic |
| 45 | +branching, and it's quite unlikely that we'll come across a mobile device |
| 46 | +capable of running Flutter that doesn't. For example, the oldest devices we test |
| 47 | +against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic runtime |
| 48 | +branching. For these reasons, the optimization advice in this document isn't |
| 49 | +aimed at such devices. |
| 50 | + |
| 51 | +### Instruction-level parallelism |
| 52 | + |
| 53 | +Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC) rely |
| 54 | +on SIMD vector or array instructions to maximize the number of computations |
| 55 | +performed per clock cycle on each functional unit. This means that the shader |
| 56 | +compiler must figure out which parts of the program are safe to parallelize |
| 57 | +ahead of time and emit appropriate instructions. This presents a problem for |
| 58 | +certain kinds of branches: If the compiler doesn't know that the same decision |
| 59 | +will always be taken for all data lanes in the data path at runtime (meaning the |
| 60 | +branch is not _uniform_), it can't safely emit SIMD instructions when compiling |
| 61 | +the branch. The result is that instructions within non-uniform branches incur a |
| 62 | +`1/[data width]` performance penalty when compared to non-branched instructions |
| 63 | +because they can't be parallelized. |
| 64 | + |
| 65 | +VLIW ("Very Long Instruction Width") is another common instruction-level |
| 66 | +parallelism design that suffers from the same compile time reasoning |
| 67 | +disadvantage that SIMD does. |
| 68 | + |
| 69 | +### Thread-level parallelism |
| 70 | + |
| 71 | +Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the |
| 72 | +Moto G4's Snapdragon SOC) use scalar functional units (no SIMD/VLIW/MIMD) and |
| 73 | +parallelize instructions at runtime by running the same instruction over many |
| 74 | +threads in groups often referred to as "warps" (Nvidia terminology) or |
| 75 | +"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per |
| 76 | +warp/wavefront. This design is also commonly referred to as SIMT ("Single |
| 77 | +Instruction Multiple Thread"). |
| 78 | + |
| 79 | +To handle branching, SIMT programs use special instructions to write a thread |
| 80 | +mask that determines which threads are activated/deactivated in the warp; only |
| 81 | +the warp's activated threads will actually execute instructions. Given this |
| 82 | +setup, the program can first deactivate threads that failed the branch |
| 83 | +condition, run the positive path, invert the mask, run the negative path, and |
| 84 | +finally restore the mask to its original state prior to the branch. The compiler |
| 85 | +may also insert mask checks to skip over branches when all of the threads have |
| 86 | +been deactivated. |
| 87 | + |
| 88 | +Therefore, the best case scenario for a SIMT branch is that it only incurs the |
| 89 | +cost of the conditional. The worst case scenario is that some of the warp's |
| 90 | +threads fail the conditional and the rest succeed, requiring the program to |
| 91 | +execute both paths of the branch back-to-back in the warp. Note that this is |
| 92 | +very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT |
| 93 | +is able to retain significant parallelism in all cases, whereas SIMD cannot. |
| 94 | + |
| 95 | +## Recommendations |
| 96 | + |
| 97 | +### Don't flatten uniform or constant branches |
| 98 | + |
| 99 | +Uniforms are pipeline variables accessible within a shader which are guaranteed |
| 100 | +to not vary during a GPU program's invocation. |
| 101 | + |
| 102 | +Example of a uniform branch in action: |
| 103 | + |
| 104 | +```glsl |
| 105 | +uniform struct FrameInfo { |
| 106 | + mat4 mvp; |
| 107 | + bool invert_y; |
| 108 | +} frame_info; |
| 109 | +
|
| 110 | +in vec2 position; |
| 111 | +
|
| 112 | +void main() { |
| 113 | + gl_Position = mvp * vec4(position, 0, 1) |
| 114 | + if (invert_y) { |
| 115 | + gl_Position *= vec2(1, -1); |
| 116 | + } |
| 117 | +} |
| 118 | +``` |
| 119 | + |
| 120 | +While it's true that driver stacks have the opportunity to generate multiple |
| 121 | +pipeline variants ahead of time to handle these branches, this advanced |
| 122 | +functionality isn't actually necessary to achieve for good runtime performance |
| 123 | +of uniform branches on widely used mobile architectures: |
| 124 | +* On SIMT architectures, branching on a uniform means that every thread in every |
| 125 | + warp will resolve to the same path, so only one path in the branch will ever |
| 126 | + execute. |
| 127 | +* On VLIW/SIMD architectures, the compiler can be certain that all of the |
| 128 | + elements in the data path for every functional unit will resolve to the same |
| 129 | + path, and so it can safely emit fully parallelized instructions for the |
| 130 | + contents of the branch! |
| 131 | + |
| 132 | +### Don't flatten simple varying branches |
| 133 | + |
| 134 | +Widely used mobile GPU architectures generally don't benefit from flattening |
| 135 | +simple varying branches. While it's true that compilers for VLIW/SIMD-based |
| 136 | +architectures can't emit efficient instructions for these branches, the |
| 137 | +detrimental effects of this are minimal with small branches. For modern SIMT |
| 138 | +architectures, flattened branches can actually perform measurably worse than |
| 139 | +straight forward branch solutions. Also, some shader compilers can collapse |
| 140 | +small branches automatically. |
| 141 | + |
| 142 | +Instead of this: |
| 143 | + |
| 144 | +```glsl |
| 145 | +vec3 ColorBurn(vec3 dst, vec3 src) { |
| 146 | + vec3 color = 1 - min(vec3(1), (1 - dst) / src); |
| 147 | + color = mix(color, vec3(1), 1 - abs(sign(dst - 1))); |
| 148 | + color = mix(color, vec3(0), 1 - abs(sign(src - 0))); |
| 149 | + return color; |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +...just do this: |
| 154 | + |
| 155 | +```glsl |
| 156 | +vec3 ColorBurn(vec3 dst, vec3 src) { |
| 157 | + vec3 color = 1 - min(vec3(1), (1 - dst) / src); |
| 158 | + if (1 - dst.r < kEhCloseEnough) { |
| 159 | + color.r = 1; |
| 160 | + } |
| 161 | + if (1 - dst.g < kEhCloseEnough) { |
| 162 | + color.g = 1; |
| 163 | + } |
| 164 | + if (1 - dst.b < kEhCloseEnough) { |
| 165 | + color.b = 1; |
| 166 | + } |
| 167 | + if (src.r < kEhCloseEnough) { |
| 168 | + color.r = 0; |
| 169 | + } |
| 170 | + if (src.g < kEhCloseEnough) { |
| 171 | + color.g = 0; |
| 172 | + } |
| 173 | + if (src.b < kEhCloseEnough) { |
| 174 | + color.b = 0; |
| 175 | + } |
| 176 | + return color; |
| 177 | +} |
| 178 | +``` |
| 179 | + |
| 180 | +It's easier to understand, doesn't prevent compiler optimizations, runs |
| 181 | +measurably faster on SIMT devices, and works out to be at most marginally slower |
| 182 | +on older VLIW devices. |
| 183 | + |
| 184 | +### Avoid complex varying branches |
| 185 | + |
| 186 | +Consider the following fragment shader: |
| 187 | + |
| 188 | +```glsl |
| 189 | +in vec4 color; |
| 190 | +out vec4 frag_color; |
| 191 | +
|
| 192 | +void main() { |
| 193 | + vec4 result; |
| 194 | +
|
| 195 | + if (color.a == 0) { |
| 196 | + result = vec4(0); |
| 197 | + } else { |
| 198 | + result = DoExtremelyExpensiveThing(color); |
| 199 | + } |
| 200 | +
|
| 201 | + frag_color = result; |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +Note that `color` is _varying_. Specifically, it's an interpolated output from a |
| 206 | +vertex shader -- so the value may change from fragment to fragment (as opposed |
| 207 | +to a _uniform_ or _constant_, which will remain the same for the whole draw |
| 208 | +call). |
| 209 | + |
| 210 | +On SIMT architectures, this branch incurs very little overhead because, and |
| 211 | +`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all |
| 212 | +the threads in a given warp. |
| 213 | +However, architectures that use instruction-level parallelism (VLIW or SIMD) |
| 214 | +can't handle this branch efficiently because the compiler can't safely emit |
| 215 | +parallelized instructions on either side of the branch. |
| 216 | + |
| 217 | +To achieve maximum parallelism across all of these architectures, one possible |
| 218 | +solution is to unbranch the more complex path: |
| 219 | + |
| 220 | +```glsl |
| 221 | +in vec4 color; |
| 222 | +out vec4 frag_color; |
| 223 | +
|
| 224 | +void main() { |
| 225 | + frag_color = DoExtremelyExpensiveThing(color); |
| 226 | +
|
| 227 | + if (color.a == 0) { |
| 228 | + frag_color = vec4(0); |
| 229 | + } |
| 230 | +} |
| 231 | +``` |
| 232 | + |
| 233 | +However, this may be a big tradeoff depending on how this shader is used -- this |
| 234 | +solution will perform worse on SIMT devices in cases where `color.a == 0` across |
| 235 | +all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be |
| 236 | +skipped with this solution! So if the cheap branch path covers a large solid |
| 237 | +portion of a draw call's coverage area, alternative designs may be favorable. |
| 238 | + |
| 239 | +### Beware of return branching |
| 240 | + |
| 241 | +Consider the following glsl function: |
| 242 | +```glsl |
| 243 | +vec4 FrobnicateColor(vec4 color) { |
| 244 | + if (color.a == 0) { |
| 245 | + return vec4(0); |
| 246 | + } |
| 247 | +
|
| 248 | + return DoExtremelyExpensiveThing(color); |
| 249 | +} |
| 250 | +``` |
| 251 | + |
| 252 | +At first glance, this may appear cheap due to its simple contents, but this |
| 253 | +branch has two exclusive paths in practice, and the generated shader assembly |
| 254 | +will reflect the same behavior as this code: |
| 255 | + |
| 256 | +```glsl |
| 257 | +vec4 FrobnicateColor(vec4 color) { |
| 258 | + vec4 result; |
| 259 | +
|
| 260 | + if (color.a == 0) { |
| 261 | + result vec4(0); |
| 262 | + } else { |
| 263 | + result = DoExtremelyExpensiveThing(color); |
| 264 | + } |
| 265 | +
|
| 266 | + return result; |
| 267 | +} |
| 268 | +``` |
| 269 | + |
| 270 | +The same concerns and advice apply to this branch as the scenario under "Avoid |
| 271 | +complex varying branches". |
| 272 | + |
| 273 | +### Use lower precision whenever possible |
| 274 | + |
| 275 | +Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point |
| 276 | +operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and |
| 277 | +according to the |
| 278 | +[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible), |
| 279 | +using lower precision floating point operations is more efficient on these |
| 280 | +devices. |
0 commit comments