Add shader optimization doc

bdero · bdero · commit 92d9d3ace0e6 · 2022-07-13T06:16:10.000-07:00
diff --git a/impeller/docs/shader_optimization.md b/impeller/docs/shader_optimization.md
@@ -0,0 +1,280 @@
+# Writing efficient shaders
+
+When it comes to optimizing shaders for a wide range of devices, there is no
+perfect strategy. The reality of different drivers written by different vendors
+targeting different hardware is that they will vary in behavior. Any attempt at
+optimizing against a specific driver will likely result in a performance loss
+for some other drivers that end users will run Flutter apps against.
+
+That being said, newer graphics devices have architectures that allow for both
+simpler shader compilation and better handling of traditionally slow shader
+code. In fact, straight forward "unoptimized" shader code filled with branches
+may significantly outperform the equivalent branchless optimized shader code
+when targeting newer GPU architectures.
+
+Flutter actively supports devices that are more than a decade old, which
+requires us to write shaders that perform well across multiple generations of
+GPU architectures featuring radically different behavior. Most optimization
+choices are direct tradeoffs between GPU architectures, and having an accurate
+mental model for how these common architectures maximize parallelism is
+essential for making good tradeoff decisions while writing shaders.
+
+For these reasons, it's important to profile shaders against some of the older
+devices that Flutter can target (such as the iPhone 4s) when making changes to
+shaders that are intended to improve performance.
+
+## GPU architecture primer
+
+GPUs are designed to have functional units running single instructions over many
+elements (the "data path") each clock cycle. This is the fundamental aspect of
+GPUs that makes them work well for massively parallel compute work; they're
+essentially specialized SIMD engines.
+
+GPU parallelism generally comes in two broad architectural flavors:
+**Instruction-level parallelism** and **Thread-level parallelism** -- these
+architecture designs handle shader branching very differently and are covered
+in great detail in sections below. In general, older GPU architectures (before
+~2015) leverage instruction-level parallelism, while most if not all newer GPUs
+leverage thread-level parallelism.
+
+Early GPU architectures often had no runtime control flow primitives at all
+(i.e. jump instructions), and compilers for these architectures needed to handle
+branches ahead of time by unrolling loops, compiling a different program for
+every possible branch combination, and executing all of them. However, virtually
+all GPU architectures in use today have instruction-level support for dynamic
+branching, and it's quite unlikely that we'll come across a mobile device
+capable of running Flutter that doesn't. For example, the oldest devices we test
+against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic runtime
+branching. For these reasons, the optimization advice in this document isn't
+aimed at such devices.
+
+### Instruction-level parallelism
+
+Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC) rely
+on SIMD vector or array instructions to maximize the number of computations
+performed per clock cycle on each functional unit. This means that the shader
+compiler must figure out which parts of the program are safe to parallelize
+ahead of time and emit appropriate instructions. This presents a problem for
+certain kinds of branches: If the compiler doesn't know that the same decision
+will always be taken for all data lanes in the data path at runtime (meaning the
+branch is not _uniform_), it can't safely emit SIMD instructions when compiling
+the branch. The result is that instructions within non-uniform branches incur a
+`1/[data width]` performance penalty when compared to non-branched instructions
+because they can't be parallelized.
+
+VLIW ("Very Long Instruction Width") is another common instruction-level
+parallelism design that suffers from the same compile time reasoning
+disadvantage that SIMD does.
+
+### Thread-level parallelism
+
+Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
+Moto G4's Snapdragon SOC) use scalar functional units (no SIMD/VLIW/MIMD) and
+parallelize instructions at runtime by running the same instruction over many
+threads in groups often referred to as "warps" (Nvidia terminology) or
+"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
+warp/wavefront. This design is also commonly referred to as SIMT ("Single
+Instruction Multiple Thread").
+
+To handle branching, SIMT programs use special instructions to write a thread
+mask that determines which threads are activated/deactivated in the warp; only
+the warp's activated threads will actually execute instructions. Given this
+setup, the program can first deactivate threads that failed the branch
+condition, run the positive path, invert the mask, run the negative path, and
+finally restore the mask to its original state prior to the branch. The compiler
+may also insert mask checks to skip over branches when all of the threads have
+been deactivated.
+
+Therefore, the best case scenario for a SIMT branch is that it only incurs the
+cost of the conditional. The worst case scenario is that some of the warp's
+threads fail the conditional and the rest succeed, requiring the program to
+execute both paths of the branch back-to-back in the warp. Note that this is
+very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
+is able to retain significant parallelism in all cases, whereas SIMD cannot.
+
+## Recommendations
+
+### Don't flatten uniform or constant branches
+
+Uniforms are pipeline variables accessible within a shader which are guaranteed
+to not vary during a GPU program's invocation.
+
+Example of a uniform branch in action:
+
+```glsl
+uniform struct FrameInfo {
+  mat4 mvp;
+  bool invert_y;
+} frame_info;
+
+in vec2 position;
+
+void main() {
+  gl_Position = mvp * vec4(position, 0, 1)
+  if (invert_y) {
+    gl_Position *= vec2(1, -1);
+  }
+}
+```
+
+While it's true that driver stacks have the opportunity to generate multiple
+pipeline variants ahead of time to handle these branches, this advanced
+functionality isn't actually necessary to achieve for good runtime performance
+of uniform branches on widely used mobile architectures:
+* On SIMT architectures, branching on a uniform means that every thread in every
+  warp will resolve to the same path, so only one path in the branch will ever
+  execute.
+* On VLIW/SIMD architectures, the compiler can be certain that all of the
+  elements in the data path for every functional unit will resolve to the same
+  path, and so it can safely emit fully parallelized instructions for the
+  contents of the branch!
+
+### Don't flatten simple varying branches
+
+Widely used mobile GPU architectures generally don't benefit from flattening
+simple varying branches. While it's true that compilers for VLIW/SIMD-based
+architectures can't emit efficient instructions for these branches, the
+detrimental effects of this are minimal with small branches. For modern SIMT
+architectures, flattened branches can actually perform measurably worse than
+straight forward branch solutions. Also, some shader compilers can collapse
+small branches automatically.
+
+Instead of this:
+
+```glsl
+vec3 ColorBurn(vec3 dst, vec3 src) {
+  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
+  color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
+  color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
+  return color;
+}
+```
+
+...just do this:
+
+```glsl
+vec3 ColorBurn(vec3 dst, vec3 src) {
+  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
+  if (1 - dst.r < kEhCloseEnough) {
+    color.r = 1;
+  }
+  if (1 - dst.g < kEhCloseEnough) {
+    color.g = 1;
+  }
+  if (1 - dst.b < kEhCloseEnough) {
+    color.b = 1;
+  }
+  if (src.r < kEhCloseEnough) {
+    color.r = 0;
+  }
+  if (src.g < kEhCloseEnough) {
+    color.g = 0;
+  }
+  if (src.b < kEhCloseEnough) {
+    color.b = 0;
+  }
+  return color;
+}
+```
+
+It's easier to understand, doesn't prevent compiler optimizations, runs
+measurably faster on SIMT devices, and works out to be at most marginally slower
+on older VLIW devices.
+
+### Avoid complex varying branches
+
+Consider the following fragment shader:
+
+```glsl
+in vec4 color;
+out vec4 frag_color;
+
+void main() {
+  vec4 result;
+
+  if (color.a == 0) {
+    result = vec4(0);
+  } else {
+    result = DoExtremelyExpensiveThing(color);
+  }
+
+  frag_color = result;
+}
+```
+
+Note that `color` is _varying_. Specifically, it's an interpolated output from a
+vertex shader -- so the value may change from fragment to fragment (as opposed
+to a _uniform_ or _constant_, which will remain the same for the whole draw
+call).
+
+On SIMT architectures, this branch incurs very little overhead because, and
+`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
+the threads in a given warp.
+However, architectures that use instruction-level parallelism (VLIW or SIMD)
+can't handle this branch efficiently because the compiler can't safely emit
+parallelized instructions on either side of the branch.
+
+To achieve maximum parallelism across all of these architectures, one possible
+solution is to unbranch the more complex path:
+
+```glsl
+in vec4 color;
+out vec4 frag_color;
+
+void main() {
+  frag_color = DoExtremelyExpensiveThing(color);
+
+  if (color.a == 0) {
+    frag_color = vec4(0);
+  }
+}
+```
+
+However, this may be a big tradeoff depending on how this shader is used -- this
+solution will perform worse on SIMT devices in cases where `color.a == 0` across
+all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
+skipped with this solution! So if the cheap branch path covers a large solid
+portion of a draw call's coverage area, alternative designs may be favorable.
+
+### Beware of return branching
+
+Consider the following glsl function:
+```glsl
+vec4 FrobnicateColor(vec4 color) {
+  if (color.a == 0) {
+    return vec4(0);
+  }
+
+  return DoExtremelyExpensiveThing(color);
+}
+```
+
+At first glance, this may appear cheap due to its simple contents, but this
+branch has two exclusive paths in practice, and the generated shader assembly
+will reflect the same behavior as this code:
+
+```glsl
+vec4 FrobnicateColor(vec4 color) {
+  vec4 result;
+
+  if (color.a == 0) {
+    result vec4(0);
+  } else {
+    result = DoExtremelyExpensiveThing(color);
+  }
+
+  return result;
+}
+```
+
+The same concerns and advice apply to this branch as the scenario under "Avoid
+complex varying branches".
+
+### Use lower precision whenever possible
+
+Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
+operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
+according to the
+[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
+using lower precision floating point operations is more efficient on these
+devices.