Skip to content
This repository was archived by the owner on Feb 25, 2025. It is now read-only.

Commit 405b443

Browse files
committed
Address comments
1 parent fcbabd7 commit 405b443

File tree

2 files changed

+30
-20
lines changed

2 files changed

+30
-20
lines changed

impeller/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,3 +185,4 @@ To your `AndroidManifest.xml` file, add under the `<application>` tag:
185185
* [Learning to Read GPU Frame Captures](docs/read_frame_captures.md)
186186
* [How to Enable Metal Validation for Command Line Apps.](docs/metal_validation.md)
187187
* [How Impeller Works Around The Lack of Uniform Buffers in Open GL ES 2.0.](docs/ubo_gles2.md)
188+
* [Guidance for writing efficient shaders](docs/shader_optimization.md)

impeller/docs/shader_optimization.md

Lines changed: 29 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,29 @@ for some other drivers that end users will run Flutter apps against.
88

99
That being said, newer graphics devices have architectures that allow for both
1010
simpler shader compilation and better handling of traditionally slow shader
11-
code. In fact, straightforward "unoptimized" shader code filled with branches
12-
may significantly outperform the equivalent branchless optimized shader code
13-
when targeting newer GPU architectures.
11+
code. In fact, ostensibly "unoptimized" shader code filled with branches may
12+
significantly outperform the equivalent branchless optimized shader code when
13+
targeting newer GPU architectures. (See the "Don't flatten simple varying
14+
branches" recommendation for an explanation of this with respect to different
15+
architectures).
1416

15-
Flutter actively supports devices that are more than a decade old, which
17+
Flutter actively supports mobile devices that are more than a decade old, which
1618
requires us to write shaders that perform well across multiple generations of
1719
GPU architectures featuring radically different behavior. Most optimization
18-
choices are direct tradeoffs between GPU architectures, and having an accurate
19-
mental model for how these common architectures maximize parallelism is
20-
essential for making good tradeoff decisions while writing shaders.
20+
choices are direct tradeoffs between these GPU architectures, and so having an
21+
accurate mental model for how these common architectures maximize parallelism is
22+
essential for making good decisions while authoring shaders.
2123

2224
For these reasons, it's also important to profile shaders against some of the
23-
older devices that Flutter can target (such as the iPhone 4s) when making
25+
older devices that Flutter can target (such as the iPhone 6s) when making
2426
changes intended to improve shader performance.
2527

28+
Also, even though the branching behavior is largely architecture dependent and
29+
should remain the same when using different graphics APIs, it's still also a
30+
good idea to test changes against the different backends supported by Impeller
31+
(Metal and GLES). Early stage shader compilation (as well as the high level
32+
shader code generated by ImpellerC) may vary quite a bit between APIs.
33+
2634
## GPU architecture primer
2735

2836
GPUs are designed to have functional units running single instructions over many
@@ -33,25 +41,25 @@ essentially specialized SIMD engines.
3341
GPU parallelism generally comes in two broad architectural flavors:
3442
**Instruction-level parallelism** and **Thread-level parallelism** -- these
3543
architecture designs handle shader branching very differently and are covered
36-
in great detail in sections below. In general, older GPU architectures (on some
37-
products released before ~2015) leverage instruction-level parallelism, while
38-
most if not all newer GPUs leverage thread-level parallelism.
44+
in the sections below. In general, older GPU architectures (on some products
45+
released before ~2015) leverage instruction-level parallelism, while most if not
46+
all newer GPUs leverage thread-level parallelism.
3947

4048
Some of the earliest GPU architectures had no runtime control flow primitives at
4149
all (i.e. jump instructions), and compilers for these architectures needed to
4250
handle branches ahead of time by unrolling loops, compiling a different program
4351
for every possible branch combination, and then executing all of them. However,
4452
virtually all GPU architectures in use today have instruction-level support for
4553
dynamic branching, and it's quite unlikely that we'll come across a mobile
46-
device capable of running Flutter that doesn't. For example, the oldest devices
47-
we test against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic
54+
device capable of running Flutter that doesn't. For example, the old devices we
55+
test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
4856
runtime branching. For these reasons, the optimization advice in this document
4957
isn't aimed at branchless architectures.
5058

5159
### Instruction-level parallelism
5260

53-
Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC) rely
54-
on SIMD vector or array instructions to maximize the number of computations
61+
Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
62+
SIMD vector or array instructions to maximize the number of computations
5563
performed per clock cycle on each functional unit. This means that the shader
5664
compiler must figure out which parts of the program are safe to parallelize
5765
ahead of time and emit appropriate instructions. This presents a problem for
@@ -69,7 +77,7 @@ disadvantage that SIMD does.
6977
### Thread-level parallelism
7078

7179
Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
72-
Moto G4's Snapdragon SOC) use scalar functional units (no SIMD/VLIW/MIMD) and
80+
Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
7381
parallelize instructions at runtime by running the same instruction over many
7482
threads in groups often referred to as "warps" (Nvidia terminology) or
7583
"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
@@ -110,9 +118,10 @@ uniform struct FrameInfo {
110118
in vec2 position;
111119
112120
void main() {
113-
gl_Position = mvp * vec4(position, 0, 1)
114-
if (invert_y) {
115-
gl_Position *= vec2(1, -1);
121+
gl_Position = frame_info.mvp * vec4(position, 0, 1)
122+
123+
if (frame_info.invert_y) {
124+
gl_Position *= vec4(1, -1, 1, 1);
116125
}
117126
}
118127
```
@@ -207,7 +216,7 @@ vertex shader -- so the value may change from fragment to fragment (as opposed
207216
to a _uniform_ or _constant_, which will remain the same for the whole draw
208217
call).
209218

210-
On SIMT architectures, this branch incurs very little overhead because, and
219+
On SIMT architectures, this branch incurs very little overhead because
211220
`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
212221
the threads in a given warp.
213222
However, architectures that use instruction-level parallelism (VLIW or SIMD)

0 commit comments

Comments
 (0)