[ETVK] Add benchmark binary + im2col/GEMM conv2d prototype#19580
Open
SS-JIA wants to merge 1 commit into
Open
Conversation
This change does two related things on top of the existing direct conv2d path: it adds a new benchmark binary for general conv2d, and it adds an im2col-backed conv2d implementation that the benchmark exercises alongside the existing direct shader.
**Why a benchmark binary**
Profiling a sample CNN showed that the standard `conv2d_float` (general sliding window) shader accounts for ~93% of all conv time, with six 3x3 stride=1 same-channels shapes dominating. The existing custom-ops directory had benchmark binaries for pointwise and depthwise conv but no standalone way to iterate on the general kernel. The new `test_conv2d` binary fills that gap.
`test_conv2d.cpp` includes 7 small accuracy configs (validated against a CPU float reference) and 13 performance configs covering the sample CNN's hotspots: the six dominant `C_in == C_out` 3x3 stride=1 shapes, several stride=2 downsample variants, two channel-reduction cases, and the 3-channel RGB stem. Perf configs are run in FP32 and FP16; accuracy configs are FP32-only because the reference is float. The binary uses 5 warmup + 20 timed iterations per case so the GPU governor reaches a stable clock before measurement. On a Pixel device, the reported per-call latencies for the direct path match the in-model profile within 0.84x-0.99x for all six dominant shapes, confirming the binary is a faithful proxy for in-model conv latency.
**Why an im2col-backed conv2d**
The im2col approach materializes the conv input into a `[1, K_total, H_out, W_out]` (or `[M, K_total]`) intermediate and runs the conv as a single tiled GEMM. The im2col K-axis layout `K = (ki * Kw + kj) * Cin_padded + ci` is chosen so that every 4-tile of K holds 4 consecutive `ci` values for the same `(ki, kj)` — that way each im2col output texel reads exactly one input texel and the GEMM can use a clean 1x1-style load pattern. On the sample CNN's hotspots this gives 1.20x-1.43x FP32 and 1.50x-1.80x FP16 speedups vs. the direct shader (estimated ~21% reduction in total FP32 conv time, ~36% in FP16) on Pixel 9 Pro XL.
The implementation is split into three pieces so we can iterate on the GEMM step in isolation:
- `conv2d_im2col.glsl` + `impl/Conv2dIm2Col.{h,cpp}`: the im2col dispatch only.
- `conv2d_gemm.glsl` + the orchestration in `impl/Conv2dGemm.{h,cpp}`: a private GEMM shader for the im2col-backed case, separate from the production pointwise path so we can experiment with more aggressive optimizations (larger tiles, cooperative matrix, register blocking) without affecting `conv2d_pw_tiled`.
- `Conv2dGemm.cpp` also does the CPU-side weight repack from `[C_out, C_in, Kh, Kw]` into the matching `[C_out, K_total]` layout, wrapped in a `FreeableBuffer` so the graph owns the lifetime.
**Device-specific storage selection**
Both shader templates codegen three variants of the im2col intermediate — `buffer`, `texture2d` width-packed `[K4_total, M]`, and `texture3d` channels-packed `[W_out, H_out, K4_total]` — and `conv2d_gemm_impl` picks at graph build time based on `graph.device_is_mali()` and the relevant max texture extents. Mali → buffer always (its texture sampling is comparatively slow vs SSBO reads). Adreno and others prefer `texture2d`, but for shapes where M would exceed `max_texture2d_dim` (e.g. `[1, 32, 144, 192]` with M = 27,648) the dispatch falls back to `texture3d`, then to `buffer` as a last resort.
On Adreno (Samsung S921), the device-specific routing pushes wins to 0.47x-0.79x FP32 and 0.65x-0.96x FP16 on the dominant shapes. On Mali (Pixel 9 Pro XL), buffer routing pushes wins to 0.51x-0.78x FP32 and 0.34x-0.46x FP16.
**Test integration**
`test_etvk.test_conv2d.default` switches between `aten.convolution.default` and `et_vk.conv2d_gemm.default` based on the `impl_selector` string ("im2col" picks the new path), so the same benchmark binary exercises both implementations back-to-back per shape.
Differential Revision: [D105120966](https://our.internmc.facebook.com/intern/diff/D105120966/)
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19580
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 1 Unrelated Failure, 1 Unclassified FailureAs of commit 14ea170 with merge base 97cece2 ( NEW FAILURE - The following job has failed:
UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
This change does two related things on top of the existing direct conv2d path: it adds a new benchmark binary for general conv2d, and it adds an im2col-backed conv2d implementation that the benchmark exercises alongside the existing direct shader.
Why a benchmark binary
Profiling a sample CNN showed that the standard
conv2d_float(general sliding window) shader accounts for ~93% of all conv time, with six 3x3 stride=1 same-channels shapes dominating. The existing custom-ops directory had benchmark binaries for pointwise and depthwise conv but no standalone way to iterate on the general kernel. The newtest_conv2dbinary fills that gap.test_conv2d.cppincludes 7 small accuracy configs (validated against a CPU float reference) and 13 performance configs covering the sample CNN's hotspots: the six dominantC_in == C_out3x3 stride=1 shapes, several stride=2 downsample variants, two channel-reduction cases, and the 3-channel RGB stem. Perf configs are run in FP32 and FP16; accuracy configs are FP32-only because the reference is float. The binary uses 5 warmup + 20 timed iterations per case so the GPU governor reaches a stable clock before measurement. On a Pixel device, the reported per-call latencies for the direct path match the in-model profile within 0.84x-0.99x for all six dominant shapes, confirming the binary is a faithful proxy for in-model conv latency.Why an im2col-backed conv2d
The im2col approach materializes the conv input into a
[1, K_total, H_out, W_out](or[M, K_total]) intermediate and runs the conv as a single tiled GEMM. The im2col K-axis layoutK = (ki * Kw + kj) * Cin_padded + ciis chosen so that every 4-tile of K holds 4 consecutivecivalues for the same(ki, kj)— that way each im2col output texel reads exactly one input texel and the GEMM can use a clean 1x1-style load pattern. On the sample CNN's hotspots this gives 1.20x-1.43x FP32 and 1.50x-1.80x FP16 speedups vs. the direct shader (estimated ~21% reduction in total FP32 conv time, ~36% in FP16) on Pixel 9 Pro XL.The implementation is split into three pieces so we can iterate on the GEMM step in isolation:
conv2d_im2col.glsl+impl/Conv2dIm2Col.{h,cpp}: the im2col dispatch only.conv2d_gemm.glsl+ the orchestration inimpl/Conv2dGemm.{h,cpp}: a private GEMM shader for the im2col-backed case, separate from the production pointwise path so we can experiment with more aggressive optimizations (larger tiles, cooperative matrix, register blocking) without affectingconv2d_pw_tiled.Conv2dGemm.cppalso does the CPU-side weight repack from[C_out, C_in, Kh, Kw]into the matching[C_out, K_total]layout, wrapped in aFreeableBufferso the graph owns the lifetime.Device-specific storage selection
Both shader templates codegen three variants of the im2col intermediate —
buffer,texture2dwidth-packed[K4_total, M], andtexture3dchannels-packed[W_out, H_out, K4_total]— andconv2d_gemm_implpicks at graph build time based ongraph.device_is_mali()and the relevant max texture extents. Mali → buffer always (its texture sampling is comparatively slow vs SSBO reads). Adreno and others prefertexture2d, but for shapes where M would exceedmax_texture2d_dim(e.g.[1, 32, 144, 192]with M = 27,648) the dispatch falls back totexture3d, then tobufferas a last resort.On Adreno (Samsung S921), the device-specific routing pushes wins to 0.47x-0.79x FP32 and 0.65x-0.96x FP16 on the dominant shapes. On Mali (Pixel 9 Pro XL), buffer routing pushes wins to 0.51x-0.78x FP32 and 0.34x-0.46x FP16.
Test integration
test_etvk.test_conv2d.defaultswitches betweenaten.convolution.defaultandet_vk.conv2d_gemm.defaultbased on theimpl_selectorstring ("im2col" picks the new path), so the same benchmark binary exercises both implementations back-to-back per shape.Differential Revision: D105120966