Address comments

bdero · bdero · commit 405b44322d08 · 2022-07-17T17:29:45.000-07:00
diff --git a/impeller/README.md b/impeller/README.md
@@ -185,3 +185,4 @@ To your `AndroidManifest.xml` file, add under the `<application>` tag:
 * [Learning to Read GPU Frame Captures](docs/read_frame_captures.md)
 * [How to Enable Metal Validation for Command Line Apps.](docs/metal_validation.md)
 * [How Impeller Works Around The Lack of Uniform Buffers in Open GL ES 2.0.](docs/ubo_gles2.md)
+* [Guidance for writing efficient shaders](docs/shader_optimization.md)
diff --git a/impeller/docs/shader_optimization.md b/impeller/docs/shader_optimization.md
@@ -8,21 +8,29 @@ for some other drivers that end users will run Flutter apps against.
 
 That being said, newer graphics devices have architectures that allow for both
 simpler shader compilation and better handling of traditionally slow shader
-code. In fact, straightforward "unoptimized" shader code filled with branches
-may significantly outperform the equivalent branchless optimized shader code
-when targeting newer GPU architectures.
+code. In fact, ostensibly "unoptimized" shader code filled with branches may
+significantly outperform the equivalent branchless optimized shader code when
+targeting newer GPU architectures. (See the "Don't flatten simple varying
+branches" recommendation for an explanation of this with respect to different
+architectures).
 
-Flutter actively supports devices that are more than a decade old, which
+Flutter actively supports mobile devices that are more than a decade old, which
 requires us to write shaders that perform well across multiple generations of
 GPU architectures featuring radically different behavior. Most optimization
-choices are direct tradeoffs between GPU architectures, and having an accurate
-mental model for how these common architectures maximize parallelism is
-essential for making good tradeoff decisions while writing shaders.
+choices are direct tradeoffs between these GPU architectures, and so having an
+accurate mental model for how these common architectures maximize parallelism is
+essential for making good decisions while authoring shaders.
 
 For these reasons, it's also important to profile shaders against some of the
-older devices that Flutter can target (such as the iPhone 4s) when making
+older devices that Flutter can target (such as the iPhone 6s) when making
 changes intended to improve shader performance.
 
+Also, even though the branching behavior is largely architecture dependent and
+should remain the same when using different graphics APIs, it's still also a
+good idea to test changes against the different backends supported by Impeller
+(Metal and GLES). Early stage shader compilation (as well as the high level
+shader code generated by ImpellerC) may vary quite a bit between APIs.
+
 ## GPU architecture primer
 
 GPUs are designed to have functional units running single instructions over many
@@ -33,25 +41,25 @@ essentially specialized SIMD engines.
 GPU parallelism generally comes in two broad architectural flavors:
 **Instruction-level parallelism** and **Thread-level parallelism** -- these
 architecture designs handle shader branching very differently and are covered
-in great detail in sections below. In general, older GPU architectures (on some
-products released before ~2015) leverage instruction-level parallelism, while
-most if not all newer GPUs leverage thread-level parallelism.
+in the sections below. In general, older GPU architectures (on some products
+released before ~2015) leverage instruction-level parallelism, while most if not
+all newer GPUs leverage thread-level parallelism.
 
 Some of the earliest GPU architectures had no runtime control flow primitives at
 all (i.e. jump instructions), and compilers for these architectures needed to
 handle branches ahead of time by unrolling loops, compiling a different program
 for every possible branch combination, and then executing all of them. However,
 virtually all GPU architectures in use today have instruction-level support for
 dynamic branching, and it's quite unlikely that we'll come across a mobile
-device capable of running Flutter that doesn't. For example, the oldest devices
-we test against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic
+device capable of running Flutter that doesn't. For example, the old devices we
+test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
 runtime branching. For these reasons, the optimization advice in this document
 isn't aimed at branchless architectures.
 
 ### Instruction-level parallelism
 
-Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC) rely
-on SIMD vector or array instructions to maximize the number of computations
+Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
+SIMD vector or array instructions to maximize the number of computations
 performed per clock cycle on each functional unit. This means that the shader
 compiler must figure out which parts of the program are safe to parallelize
 ahead of time and emit appropriate instructions. This presents a problem for
@@ -69,7 +77,7 @@ disadvantage that SIMD does.
 ### Thread-level parallelism
 
 Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
-Moto G4's Snapdragon SOC) use scalar functional units (no SIMD/VLIW/MIMD) and
+Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
 parallelize instructions at runtime by running the same instruction over many
 threads in groups often referred to as "warps" (Nvidia terminology) or
 "wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
@@ -110,9 +118,10 @@ uniform struct FrameInfo {
 in vec2 position;
 
 void main() {
-  gl_Position = mvp * vec4(position, 0, 1)
-  if (invert_y) {
-    gl_Position *= vec2(1, -1);
+  gl_Position = frame_info.mvp * vec4(position, 0, 1)
+
+  if (frame_info.invert_y) {
+    gl_Position *= vec4(1, -1, 1, 1);
   }
 }
 ```
@@ -207,7 +216,7 @@ vertex shader -- so the value may change from fragment to fragment (as opposed
 to a _uniform_ or _constant_, which will remain the same for the whole draw
 call).
 
-On SIMT architectures, this branch incurs very little overhead because, and
+On SIMT architectures, this branch incurs very little overhead because
 `DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
 the threads in a given warp.
 However, architectures that use instruction-level parallelism (VLIW or SIMD)