[ETVK] Add benchmark binary + im2col/GEMM conv2d prototype by SS-JIA · Pull Request #19580 · pytorch/executorch

SS-JIA · 2026-05-14T06:50:53Z

Stack from ghstack (oldest at bottom):

This change does two related things on top of the existing direct conv2d path: it adds a new benchmark binary for general conv2d, and it adds an im2col-backed conv2d implementation that the benchmark exercises alongside the existing direct shader.

Why a benchmark binary

Profiling a sample CNN showed that the standard conv2d_float (general sliding window) shader accounts for ~93% of all conv time, with six 3x3 stride=1 same-channels shapes dominating. The existing custom-ops directory had benchmark binaries for pointwise and depthwise conv but no standalone way to iterate on the general kernel. The new test_conv2d binary fills that gap.

test_conv2d.cpp includes 7 small accuracy configs (validated against a CPU float reference) and 13 performance configs covering the sample CNN's hotspots: the six dominant C_in == C_out 3x3 stride=1 shapes, several stride=2 downsample variants, two channel-reduction cases, and the 3-channel RGB stem. Perf configs are run in FP32 and FP16; accuracy configs are FP32-only because the reference is float. The binary uses 5 warmup + 20 timed iterations per case so the GPU governor reaches a stable clock before measurement. On a Pixel device, the reported per-call latencies for the direct path match the in-model profile within 0.84x-0.99x for all six dominant shapes, confirming the binary is a faithful proxy for in-model conv latency.

Why an im2col-backed conv2d

The im2col approach materializes the conv input into a [1, K_total, H_out, W_out] (or [M, K_total]) intermediate and runs the conv as a single tiled GEMM. The im2col K-axis layout K = (ki * Kw + kj) * Cin_padded + ci is chosen so that every 4-tile of K holds 4 consecutive ci values for the same (ki, kj) — that way each im2col output texel reads exactly one input texel and the GEMM can use a clean 1x1-style load pattern. On the sample CNN's hotspots this gives 1.20x-1.43x FP32 and 1.50x-1.80x FP16 speedups vs. the direct shader (estimated ~21% reduction in total FP32 conv time, ~36% in FP16) on Pixel 9 Pro XL.

The implementation is split into three pieces so we can iterate on the GEMM step in isolation:

conv2d_im2col.glsl + impl/Conv2dIm2Col.{h,cpp}: the im2col dispatch only.
conv2d_gemm.glsl + the orchestration in impl/Conv2dGemm.{h,cpp}: a private GEMM shader for the im2col-backed case, separate from the production pointwise path so we can experiment with more aggressive optimizations (larger tiles, cooperative matrix, register blocking) without affecting conv2d_pw_tiled.
Conv2dGemm.cpp also does the CPU-side weight repack from [C_out, C_in, Kh, Kw] into the matching [C_out, K_total] layout, wrapped in a FreeableBuffer so the graph owns the lifetime.

Device-specific storage selection

Both shader templates codegen three variants of the im2col intermediate — buffer, texture2d width-packed [K4_total, M], and texture3d channels-packed [W_out, H_out, K4_total] — and conv2d_gemm_impl picks at graph build time based on graph.device_is_mali() and the relevant max texture extents. Mali → buffer always (its texture sampling is comparatively slow vs SSBO reads). Adreno and others prefer texture2d, but for shapes where M would exceed max_texture2d_dim (e.g. [1, 32, 144, 192] with M = 27,648) the dispatch falls back to texture3d, then to buffer as a last resort.

On Adreno (Samsung S921), the device-specific routing pushes wins to 0.47x-0.79x FP32 and 0.65x-0.96x FP16 on the dominant shapes. On Mali (Pixel 9 Pro XL), buffer routing pushes wins to 0.51x-0.78x FP32 and 0.34x-0.46x FP16.

Test integration

test_etvk.test_conv2d.default switches between aten.convolution.default and et_vk.conv2d_gemm.default based on the impl_selector string ("im2col" picks the new path), so the same benchmark binary exercises both implementations back-to-back per shape.

Differential Revision: D105120966

This change does two related things on top of the existing direct conv2d path: it adds a new benchmark binary for general conv2d, and it adds an im2col-backed conv2d implementation that the benchmark exercises alongside the existing direct shader. **Why a benchmark binary** Profiling a sample CNN showed that the standard `conv2d_float` (general sliding window) shader accounts for ~93% of all conv time, with six 3x3 stride=1 same-channels shapes dominating. The existing custom-ops directory had benchmark binaries for pointwise and depthwise conv but no standalone way to iterate on the general kernel. The new `test_conv2d` binary fills that gap. `test_conv2d.cpp` includes 7 small accuracy configs (validated against a CPU float reference) and 13 performance configs covering the sample CNN's hotspots: the six dominant `C_in == C_out` 3x3 stride=1 shapes, several stride=2 downsample variants, two channel-reduction cases, and the 3-channel RGB stem. Perf configs are run in FP32 and FP16; accuracy configs are FP32-only because the reference is float. The binary uses 5 warmup + 20 timed iterations per case so the GPU governor reaches a stable clock before measurement. On a Pixel device, the reported per-call latencies for the direct path match the in-model profile within 0.84x-0.99x for all six dominant shapes, confirming the binary is a faithful proxy for in-model conv latency. **Why an im2col-backed conv2d** The im2col approach materializes the conv input into a `[1, K_total, H_out, W_out]` (or `[M, K_total]`) intermediate and runs the conv as a single tiled GEMM. The im2col K-axis layout `K = (ki * Kw + kj) * Cin_padded + ci` is chosen so that every 4-tile of K holds 4 consecutive `ci` values for the same `(ki, kj)` — that way each im2col output texel reads exactly one input texel and the GEMM can use a clean 1x1-style load pattern. On the sample CNN's hotspots this gives 1.20x-1.43x FP32 and 1.50x-1.80x FP16 speedups vs. the direct shader (estimated ~21% reduction in total FP32 conv time, ~36% in FP16) on Pixel 9 Pro XL. The implementation is split into three pieces so we can iterate on the GEMM step in isolation: - `conv2d_im2col.glsl` + `impl/Conv2dIm2Col.{h,cpp}`: the im2col dispatch only. - `conv2d_gemm.glsl` + the orchestration in `impl/Conv2dGemm.{h,cpp}`: a private GEMM shader for the im2col-backed case, separate from the production pointwise path so we can experiment with more aggressive optimizations (larger tiles, cooperative matrix, register blocking) without affecting `conv2d_pw_tiled`. - `Conv2dGemm.cpp` also does the CPU-side weight repack from `[C_out, C_in, Kh, Kw]` into the matching `[C_out, K_total]` layout, wrapped in a `FreeableBuffer` so the graph owns the lifetime. **Device-specific storage selection** Both shader templates codegen three variants of the im2col intermediate — `buffer`, `texture2d` width-packed `[K4_total, M]`, and `texture3d` channels-packed `[W_out, H_out, K4_total]` — and `conv2d_gemm_impl` picks at graph build time based on `graph.device_is_mali()` and the relevant max texture extents. Mali → buffer always (its texture sampling is comparatively slow vs SSBO reads). Adreno and others prefer `texture2d`, but for shapes where M would exceed `max_texture2d_dim` (e.g. `[1, 32, 144, 192]` with M = 27,648) the dispatch falls back to `texture3d`, then to `buffer` as a last resort. On Adreno (Samsung S921), the device-specific routing pushes wins to 0.47x-0.79x FP32 and 0.65x-0.96x FP16 on the dominant shapes. On Mali (Pixel 9 Pro XL), buffer routing pushes wins to 0.51x-0.78x FP32 and 0.34x-0.46x FP16. **Test integration** `test_etvk.test_conv2d.default` switches between `aten.convolution.default` and `et_vk.conv2d_gemm.default` based on the `impl_selector` string ("im2col" picks the new path), so the same benchmark binary exercises both implementations back-to-back per shape. Differential Revision: [D105120966](https://our.internmc.facebook.com/intern/diff/D105120966/) [ghstack-poisoned]

pytorch-bot · 2026-05-14T06:50:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19580

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 1 New Failure, 1 Unrelated Failure, 1 Unclassified Failure

As of commit 14ea170 with merge base 97cece2 ():

NEW FAILURE - The following job has failed:

pull / test-qnn-delegate-linux / linux-job (gh)
backends/qualcomm/tests/test_qnn_delegate.py::TestQNNFloatingPointOperator::test_qnn_backend_argmax

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

Check Labels / Check labels (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
RuntimeError: GraphQL query

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 14, 2026

SS-JIA mentioned this pull request May 14, 2026

[ETVK][experimental] Route general conv2d through im2col + GEMM #19581

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETVK] Add benchmark binary + im2col/GEMM conv2d prototype#19580

[ETVK] Add benchmark binary + im2col/GEMM conv2d prototype#19580
SS-JIA wants to merge 1 commit into
gh/SS-JIA/538/basefrom
gh/SS-JIA/538/head

SS-JIA commented May 14, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SS-JIA commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19580

❗ 1 Active SEVs

❌ 1 New Failure, 1 Unrelated Failure, 1 Unclassified Failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SS-JIA commented May 14, 2026 •

edited

Loading

pytorch-bot Bot commented May 14, 2026 •

edited

Loading