[Impeller] Add guidance for writing shaders (flutter#34634)

bdero · web-flow · commit efc604ee6256 · 2022-07-18T14:59:50.000-07:00
diff --git a/impeller/README.md b/impeller/README.md
@@ -185,3 +185,4 @@ To your `AndroidManifest.xml` file, add under the `<application>` tag:
 * [Learning to Read GPU Frame Captures](docs/read_frame_captures.md)
 * [How to Enable Metal Validation for Command Line Apps.](docs/metal_validation.md)
 * [How Impeller Works Around The Lack of Uniform Buffers in Open GL ES 2.0.](docs/ubo_gles2.md)
+* [Guidance for writing efficient shaders](docs/shader_optimization.md)
diff --git a/impeller/docs/shader_optimization.md b/impeller/docs/shader_optimization.md
@@ -0,0 +1,289 @@
+# Writing efficient shaders
+
+When it comes to optimizing shaders for a wide range of devices, there is no
+perfect strategy. The reality of different drivers written by different vendors
+targeting different hardware is that they will vary in behavior. Any attempt at
+optimizing against a specific driver will likely result in a performance loss
+for some other drivers that end users will run Flutter apps against.
+
+That being said, newer graphics devices have architectures that allow for both
+simpler shader compilation and better handling of traditionally slow shader
+code. In fact, ostensibly "unoptimized" shader code filled with branches may
+significantly outperform the equivalent branchless optimized shader code when
+targeting newer GPU architectures. (See the "Don't flatten simple varying
+branches" recommendation for an explanation of this with respect to different
+architectures).
+
+Flutter actively supports mobile devices that are more than a decade old, which
+requires us to write shaders that perform well across multiple generations of
+GPU architectures featuring radically different behavior. Most optimization
+choices are direct tradeoffs between these GPU architectures, and so having an
+accurate mental model for how these common architectures maximize parallelism is
+essential for making good decisions while authoring shaders.
+
+For these reasons, it's also important to profile shaders against some of the
+older devices that Flutter can target (such as the iPhone 6s) when making
+changes intended to improve shader performance.
+
+Also, even though the branching behavior is largely architecture dependent and
+should remain the same when using different graphics APIs, it's still also a
+good idea to test changes against the different backends supported by Impeller
+(Metal and GLES). Early stage shader compilation (as well as the high level
+shader code generated by ImpellerC) may vary quite a bit between APIs.
+
+## GPU architecture primer
+
+GPUs are designed to have functional units running single instructions over many
+elements (the "data path") each clock cycle. This is the fundamental aspect of
+GPUs that makes them work well for massively parallel compute work; they're
+essentially specialized SIMD engines.
+
+GPU parallelism generally comes in two broad architectural flavors:
+**Instruction-level parallelism** and **Thread-level parallelism** -- these
+architecture designs handle shader branching very differently and are covered
+in the sections below. In general, older GPU architectures (on some products
+released before ~2015) leverage instruction-level parallelism, while most if not
+all newer GPUs leverage thread-level parallelism.
+
+Some of the earliest GPU architectures had no runtime control flow primitives at
+all (i.e. jump instructions), and compilers for these architectures needed to
+handle branches ahead of time by unrolling loops, compiling a different program
+for every possible branch combination, and then executing all of them. However,
+virtually all GPU architectures in use today have instruction-level support for
+dynamic branching, and it's quite unlikely that we'll come across a mobile
+device capable of running Flutter that doesn't. For example, the old devices we
+test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
+runtime branching. For these reasons, the optimization advice in this document
+isn't aimed at branchless architectures.
+
+### Instruction-level parallelism
+
+Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
+SIMD vector or array instructions to maximize the number of computations
+performed per clock cycle on each functional unit. This means that the shader
+compiler must figure out which parts of the program are safe to parallelize
+ahead of time and emit appropriate instructions. This presents a problem for
+certain kinds of branches: If the compiler doesn't know that the same decision
+will always be taken by all of the data lanes at runtime (meaning the branch is
+_varying_), it can't safely emit SIMD instructions while compiling the branch.
+The result is that instructions within non-uniform branches incur a
+`1/[data width]` performance penalty when compared to non-branched instructions
+because they can't be parallelized.
+
+VLIW ("Very Long Instruction Width") is another common instruction-level
+parallelism design that suffers from the same compile time reasoning
+disadvantage that SIMD does.
+
+### Thread-level parallelism
+
+Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
+Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
+parallelize instructions at runtime by running the same instruction over many
+threads in groups often referred to as "warps" (Nvidia terminology) or
+"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
+warp/wavefront. This design is also commonly referred to as SIMT ("Single
+Instruction Multiple Thread").
+
+To handle branching, SIMT programs use special instructions to write a thread
+mask that determines which threads are activated/deactivated in the warp; only
+the warp's activated threads will actually execute instructions. Given this
+setup, the program can first deactivate threads that failed the branch
+condition, run the positive path, invert the mask, run the negative path, and
+finally restore the mask to its original state prior to the branch. The compiler
+may also insert mask checks to skip over branches when all of the threads have
+been deactivated.
+
+Therefore, the best case scenario for a SIMT branch is that it only incurs the
+cost of the conditional. The worst case scenario is that some of the warp's
+threads fail the conditional and the rest succeed, requiring the program to
+execute both paths of the branch back-to-back in the warp. Note that this is
+very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
+is able to retain significant parallelism in all cases, whereas SIMD cannot.
+
+## Recommendations
+
+### Don't flatten uniform or constant branches
+
+Uniforms are pipeline variables accessible within a shader which are guaranteed
+to not vary during a GPU program's invocation.
+
+Example of a uniform branch in action:
+
+```glsl
+uniform struct FrameInfo {
+  mat4 mvp;
+  bool invert_y;
+} frame_info;
+
+in vec2 position;
+
+void main() {
+  gl_Position = frame_info.mvp * vec4(position, 0, 1)
+
+  if (frame_info.invert_y) {
+    gl_Position *= vec4(1, -1, 1, 1);
+  }
+}
+```
+
+While it's true that driver stacks have the opportunity to generate multiple
+pipeline variants ahead of time to handle these branches, this advanced
+functionality isn't actually necessary to achieve for good runtime performance
+of uniform branches on widely used mobile architectures:
+* On SIMT architectures, branching on a uniform means that every thread in every
+  warp will resolve to the same path, so only one path in the branch will ever
+  execute.
+* On VLIW/SIMD architectures, the compiler can be certain that all of the
+  elements in the data path for every functional unit will resolve to the same
+  path, and so it can safely emit fully parallelized instructions for the
+  contents of the branch!
+
+### Don't flatten simple varying branches
+
+Widely used mobile GPU architectures generally don't benefit from flattening
+simple varying branches. While it's true that compilers for VLIW/SIMD-based
+architectures can't emit efficient instructions for these branches, the
+detrimental effects of this are minimal with small branches. For modern SIMT
+architectures, flattened branches can actually perform measurably worse than
+straight forward branch solutions. Also, some shader compilers can collapse
+small branches automatically.
+
+Instead of this:
+
+```glsl
+vec3 ColorBurn(vec3 dst, vec3 src) {
+  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
+  color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
+  color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
+  return color;
+}
+```
+
+...just do this:
+
+```glsl
+vec3 ColorBurn(vec3 dst, vec3 src) {
+  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
+  if (1 - dst.r < kEhCloseEnough) {
+    color.r = 1;
+  }
+  if (1 - dst.g < kEhCloseEnough) {
+    color.g = 1;
+  }
+  if (1 - dst.b < kEhCloseEnough) {
+    color.b = 1;
+  }
+  if (src.r < kEhCloseEnough) {
+    color.r = 0;
+  }
+  if (src.g < kEhCloseEnough) {
+    color.g = 0;
+  }
+  if (src.b < kEhCloseEnough) {
+    color.b = 0;
+  }
+  return color;
+}
+```
+
+It's easier to understand, doesn't prevent compiler optimizations, runs
+measurably faster on SIMT devices, and works out to be at most marginally slower
+on older VLIW devices.
+
+### Avoid complex varying branches
+
+Consider the following fragment shader:
+
+```glsl
+in vec4 color;
+out vec4 frag_color;
+
+void main() {
+  vec4 result;
+
+  if (color.a == 0) {
+    result = vec4(0);
+  } else {
+    result = DoExtremelyExpensiveThing(color);
+  }
+
+  frag_color = result;
+}
+```
+
+Note that `color` is _varying_. Specifically, it's an interpolated output from a
+vertex shader -- so the value may change from fragment to fragment (as opposed
+to a _uniform_ or _constant_, which will remain the same for the whole draw
+call).
+
+On SIMT architectures, this branch incurs very little overhead because
+`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
+the threads in a given warp.
+However, architectures that use instruction-level parallelism (VLIW or SIMD)
+can't handle this branch efficiently because the compiler can't safely emit
+parallelized instructions on either side of the branch.
+
+To achieve maximum parallelism across all of these architectures, one possible
+solution is to unbranch the more complex path:
+
+```glsl
+in vec4 color;
+out vec4 frag_color;
+
+void main() {
+  frag_color = DoExtremelyExpensiveThing(color);
+
+  if (color.a == 0) {
+    frag_color = vec4(0);
+  }
+}
+```
+
+However, this may be a big tradeoff depending on how this shader is used -- this
+solution will perform worse on SIMT devices in cases where `color.a == 0` across
+all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
+skipped with this solution! So if the cheap branch path covers a large solid
+portion of a draw call's coverage area, alternative designs may be favorable.
+
+### Beware of return branching
+
+Consider the following glsl function:
+```glsl
+vec4 FrobnicateColor(vec4 color) {
+  if (color.a == 0) {
+    return vec4(0);
+  }
+
+  return DoExtremelyExpensiveThing(color);
+}
+```
+
+At first glance, this may appear cheap due to its simple contents, but this
+branch has two exclusive paths in practice, and the generated shader assembly
+will reflect the same behavior as this code:
+
+```glsl
+vec4 FrobnicateColor(vec4 color) {
+  vec4 result;
+
+  if (color.a == 0) {
+    result vec4(0);
+  } else {
+    result = DoExtremelyExpensiveThing(color);
+  }
+
+  return result;
+}
+```
+
+The same concerns and advice apply to this branch as the scenario under "Avoid
+complex varying branches".
+
+### Use lower precision whenever possible
+
+Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
+operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
+according to the
+[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
+using lower precision floating point operations is more efficient on these
+devices.