Skip to content

[ETVK] Add benchmark binary + im2col/GEMM conv2d prototype#19580

Open
SS-JIA wants to merge 1 commit into
gh/SS-JIA/538/basefrom
gh/SS-JIA/538/head
Open

[ETVK] Add benchmark binary + im2col/GEMM conv2d prototype#19580
SS-JIA wants to merge 1 commit into
gh/SS-JIA/538/basefrom
gh/SS-JIA/538/head

Conversation

@SS-JIA
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA commented May 14, 2026

Stack from ghstack (oldest at bottom):

This change does two related things on top of the existing direct conv2d path: it adds a new benchmark binary for general conv2d, and it adds an im2col-backed conv2d implementation that the benchmark exercises alongside the existing direct shader.

Why a benchmark binary

Profiling a sample CNN showed that the standard conv2d_float (general sliding window) shader accounts for ~93% of all conv time, with six 3x3 stride=1 same-channels shapes dominating. The existing custom-ops directory had benchmark binaries for pointwise and depthwise conv but no standalone way to iterate on the general kernel. The new test_conv2d binary fills that gap.

test_conv2d.cpp includes 7 small accuracy configs (validated against a CPU float reference) and 13 performance configs covering the sample CNN's hotspots: the six dominant C_in == C_out 3x3 stride=1 shapes, several stride=2 downsample variants, two channel-reduction cases, and the 3-channel RGB stem. Perf configs are run in FP32 and FP16; accuracy configs are FP32-only because the reference is float. The binary uses 5 warmup + 20 timed iterations per case so the GPU governor reaches a stable clock before measurement. On a Pixel device, the reported per-call latencies for the direct path match the in-model profile within 0.84x-0.99x for all six dominant shapes, confirming the binary is a faithful proxy for in-model conv latency.

Why an im2col-backed conv2d

The im2col approach materializes the conv input into a [1, K_total, H_out, W_out] (or [M, K_total]) intermediate and runs the conv as a single tiled GEMM. The im2col K-axis layout K = (ki * Kw + kj) * Cin_padded + ci is chosen so that every 4-tile of K holds 4 consecutive ci values for the same (ki, kj) — that way each im2col output texel reads exactly one input texel and the GEMM can use a clean 1x1-style load pattern. On the sample CNN's hotspots this gives 1.20x-1.43x FP32 and 1.50x-1.80x FP16 speedups vs. the direct shader (estimated ~21% reduction in total FP32 conv time, ~36% in FP16) on Pixel 9 Pro XL.

The implementation is split into three pieces so we can iterate on the GEMM step in isolation:

  • conv2d_im2col.glsl + impl/Conv2dIm2Col.{h,cpp}: the im2col dispatch only.
  • conv2d_gemm.glsl + the orchestration in impl/Conv2dGemm.{h,cpp}: a private GEMM shader for the im2col-backed case, separate from the production pointwise path so we can experiment with more aggressive optimizations (larger tiles, cooperative matrix, register blocking) without affecting conv2d_pw_tiled.
  • Conv2dGemm.cpp also does the CPU-side weight repack from [C_out, C_in, Kh, Kw] into the matching [C_out, K_total] layout, wrapped in a FreeableBuffer so the graph owns the lifetime.

Device-specific storage selection

Both shader templates codegen three variants of the im2col intermediate — buffer, texture2d width-packed [K4_total, M], and texture3d channels-packed [W_out, H_out, K4_total] — and conv2d_gemm_impl picks at graph build time based on graph.device_is_mali() and the relevant max texture extents. Mali → buffer always (its texture sampling is comparatively slow vs SSBO reads). Adreno and others prefer texture2d, but for shapes where M would exceed max_texture2d_dim (e.g. [1, 32, 144, 192] with M = 27,648) the dispatch falls back to texture3d, then to buffer as a last resort.

On Adreno (Samsung S921), the device-specific routing pushes wins to 0.47x-0.79x FP32 and 0.65x-0.96x FP16 on the dominant shapes. On Mali (Pixel 9 Pro XL), buffer routing pushes wins to 0.51x-0.78x FP32 and 0.34x-0.46x FP16.

Test integration

test_etvk.test_conv2d.default switches between aten.convolution.default and et_vk.conv2d_gemm.default based on the impl_selector string ("im2col" picks the new path), so the same benchmark binary exercises both implementations back-to-back per shape.

Differential Revision: D105120966

This change does two related things on top of the existing direct conv2d path: it adds a new benchmark binary for general conv2d, and it adds an im2col-backed conv2d implementation that the benchmark exercises alongside the existing direct shader.

**Why a benchmark binary**

Profiling a sample CNN showed that the standard `conv2d_float` (general sliding window) shader accounts for ~93% of all conv time, with six 3x3 stride=1 same-channels shapes dominating. The existing custom-ops directory had benchmark binaries for pointwise and depthwise conv but no standalone way to iterate on the general kernel. The new `test_conv2d` binary fills that gap.

`test_conv2d.cpp` includes 7 small accuracy configs (validated against a CPU float reference) and 13 performance configs covering the sample CNN's hotspots: the six dominant `C_in == C_out` 3x3 stride=1 shapes, several stride=2 downsample variants, two channel-reduction cases, and the 3-channel RGB stem. Perf configs are run in FP32 and FP16; accuracy configs are FP32-only because the reference is float. The binary uses 5 warmup + 20 timed iterations per case so the GPU governor reaches a stable clock before measurement. On a Pixel device, the reported per-call latencies for the direct path match the in-model profile within 0.84x-0.99x for all six dominant shapes, confirming the binary is a faithful proxy for in-model conv latency.

**Why an im2col-backed conv2d**

The im2col approach materializes the conv input into a `[1, K_total, H_out, W_out]` (or `[M, K_total]`) intermediate and runs the conv as a single tiled GEMM. The im2col K-axis layout `K = (ki * Kw + kj) * Cin_padded + ci` is chosen so that every 4-tile of K holds 4 consecutive `ci` values for the same `(ki, kj)` — that way each im2col output texel reads exactly one input texel and the GEMM can use a clean 1x1-style load pattern. On the sample CNN's hotspots this gives 1.20x-1.43x FP32 and 1.50x-1.80x FP16 speedups vs. the direct shader (estimated ~21% reduction in total FP32 conv time, ~36% in FP16) on Pixel 9 Pro XL.

The implementation is split into three pieces so we can iterate on the GEMM step in isolation:

- `conv2d_im2col.glsl` + `impl/Conv2dIm2Col.{h,cpp}`: the im2col dispatch only.
- `conv2d_gemm.glsl` + the orchestration in `impl/Conv2dGemm.{h,cpp}`: a private GEMM shader for the im2col-backed case, separate from the production pointwise path so we can experiment with more aggressive optimizations (larger tiles, cooperative matrix, register blocking) without affecting `conv2d_pw_tiled`.
- `Conv2dGemm.cpp` also does the CPU-side weight repack from `[C_out, C_in, Kh, Kw]` into the matching `[C_out, K_total]` layout, wrapped in a `FreeableBuffer` so the graph owns the lifetime.

**Device-specific storage selection**

Both shader templates codegen three variants of the im2col intermediate — `buffer`, `texture2d` width-packed `[K4_total, M]`, and `texture3d` channels-packed `[W_out, H_out, K4_total]` — and `conv2d_gemm_impl` picks at graph build time based on `graph.device_is_mali()` and the relevant max texture extents. Mali → buffer always (its texture sampling is comparatively slow vs SSBO reads). Adreno and others prefer `texture2d`, but for shapes where M would exceed `max_texture2d_dim` (e.g. `[1, 32, 144, 192]` with M = 27,648) the dispatch falls back to `texture3d`, then to `buffer` as a last resort.

On Adreno (Samsung S921), the device-specific routing pushes wins to 0.47x-0.79x FP32 and 0.65x-0.96x FP16 on the dominant shapes. On Mali (Pixel 9 Pro XL), buffer routing pushes wins to 0.51x-0.78x FP32 and 0.34x-0.46x FP16.

**Test integration**

`test_etvk.test_conv2d.default` switches between `aten.convolution.default` and `et_vk.conv2d_gemm.default` based on the `impl_selector` string ("im2col" picks the new path), so the same benchmark binary exercises both implementations back-to-back per shape.

Differential Revision: [D105120966](https://our.internmc.facebook.com/intern/diff/D105120966/)

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19580

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 1 Unrelated Failure, 1 Unclassified Failure

As of commit 14ea170 with merge base 97cece2 (image):

NEW FAILURE - The following job has failed:

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

  • Check Labels / Check labels (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
    RuntimeError: GraphQL query

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant