Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) #3710

jiawenliu64 · 2025-02-18T21:51:19Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/783

Performance Issue with Current BF16 and mixed TileShape Configuration

The current FBGEMM bf16 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense bf16 tensor core on H100 is m64n256k16.
The current configuration leads to suboptimal performance for tensor cores and bandwidth usage,
as evidenced by PTX warnings about:
'wgmma.mma_async instruction serialization due to insufficient register resources'

Optimized TileShape (128x256x64) Implementation

Modification of the TileShape configuration from 128x128x128 to 128x256x64 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 and identified by Colfax-intl
as the optimal configuration after empirical study for bf16 kernels.

Benchmark Results on H100 GPU

Benchmark configuration:

PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

Benchmark

bf16bf16bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)

TileShape	TFlops
128-128-128	606
128-256- 64	790

bf16i4bf16_rowwise_batched (B = 4, M = 2,048, N = 8,192, K = 8,192)

TileShape	TFlops bf16*	TFlops fp16*	TFlops float*
128-128-128	354	341	383
128-256- 64	704	727	763

bf16i4bf16_rowwise (M=N=K = 8,192)

TileShape	TFlops bf16*	TFlops fp16*	TFlops float*
128-128-128	349	351	381
128-256- 64	652	663	693

f8i4bf16_rowwise (M=N=K = 8,192)

TileShape	TFlops bf16*	TFlops fp16*	TFlops float*
128-128-128	407	542	606
128-256- 64	921	942	1088

*WEIGHT_SCALE_DTYPE

Technical Implementation

Modified TileShape from 128-128-128 to 128-256-64 for:

bf16bf16bf16_grouped
bf16i4bf16_rowwise_batched
bf16i4bf16_rowwise
f8i4bf16_rowwise

Added cooperative kernel by default for:

bf16i4bf16_rowwise_batched
bf16i4bf16_rowwise
f8i4bf16_rowwise

The modifications only affect large mode and Default kernels where N > 128.
These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

Test Coverage

Unit Tests Results

The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

jiawenliu64 jwfromm Hello! I wanted to share this contribution to FBGEMM.
While this is my first PR, I hope these changes could be useful for this great project.
I'd welcome any feedback if you have time to take a look. Thank you!

Reviewed By: jianyuh

Differential Revision: D68609243

Pulled By: jiawenliu64

facebook-github-bot · 2025-02-18T21:51:28Z

This pull request was exported from Phabricator. Differential Revision: D68609243

netlify · 2025-02-18T21:51:39Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`ec50d16`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67b5063f4bf6130008741c0a
😎 Deploy Preview	https://deploy-preview-3710--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

…and Mixed Formats (pytorch#3710) Summary: Pull Request resolved: pytorch#3710 X-link: facebookresearch/FBGEMM#783 ## Performance Issue with Current BF16 and mixed TileShape Configuration The current FBGEMM bf16 kernel uses a TileShape configuration of 128x128x128, while the optimal shape for dense bf16 tensor core on H100 is m64n256k16. The current configuration leads to suboptimal performance for tensor cores and bandwidth usage, as evidenced by PTX warnings about: 'wgmma.mma_async instruction serialization due to insufficient register resources' ## Optimized TileShape (128x256x64) Implementation Modification of the TileShape configuration from 128x128x128 to 128x256x64 for large GEMM operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization. This configuration is notably used in Flash Attention V3 and identified by Colfax-intl as the optimal configuration after empirical study for bf16 kernels. ## Benchmark Results on H100 GPU ### Benchmark configuration: PyTorch 2.6 CUDA 12.4 CPU: AMD EPYC GPU: NVIDIA H100 Benchmarks are configured with 30 kernel launch iterations and averaged over 25 Benchmark calculations. We used the same gemm sizes as in the Colfax benchmarks ### Benchmark #### bf16bf16bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192) | TileShape | TFlops | |-------------|-------- | | 128-128-128 | 606 | | 128-256- 64 | 790 | #### bf16i4bf16_rowwise_batched (B = 4, M = 2,048, N = 8,192, K = 8,192) | TileShape | TFlops bf16*| TFlops fp16*| TFlops float*| |-------------|-------------|-------------|------------- | | 128-128-128 | 354 | 341 | 383 | | 128-256- 64 | 704 | 727 | 763 | #### bf16i4bf16_rowwise (M=N=K = 8,192) | TileShape | TFlops bf16*| TFlops fp16*| TFlops float*| |-------------|-------------|-------------|------------- | | 128-128-128 | 349 | 351 | 381 | | 128-256- 64 | 652 | 663 | 693 | #### f8i4bf16_rowwise (M=N=K = 8,192) | TileShape | TFlops bf16*| TFlops fp16*| TFlops float*| |-------------|-------------|-------------|------------- | | 128-128-128 | 407 | 542 | 606 | | 128-256- 64 | 921 | 942 | 1088 | *WEIGHT_SCALE_DTYPE ## Technical Implementation Modified TileShape from 128-128-128 to 128-256-64 for: - bf16bf16bf16_grouped - bf16i4bf16_rowwise_batched - bf16i4bf16_rowwise - f8i4bf16_rowwise Added cooperative kernel by default for: - bf16i4bf16_rowwise_batched - bf16i4bf16_rowwise - f8i4bf16_rowwise The modifications only affect large mode and Default kernels where N > 128. These changes were made by modifying the minimum necessary code while respecting existing coding practices in FBGEMM. ## Test Coverage ### Unit Tests Results The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize have been verified for the modified kernels. jiawenliu64 jwfromm Hello! I wanted to share this contribution to FBGEMM. While this is my first PR, I hope these changes could be useful for this great project. I'd welcome any feedback if you have time to take a look. Thank you! Pull Request resolved: pytorch#3591 Reviewed By: jianyuh Differential Revision: D68609243 Pulled By: jiawenliu64

facebook-github-bot · 2025-02-18T22:14:13Z

This pull request was exported from Phabricator. Differential Revision: D68609243

facebook-github-bot · 2025-02-18T23:07:43Z

@jiawenliu64 merged this pull request in 19f3713.

…and Mixed Formats (pytorch#783) Summary: X-link: pytorch#3710 Pull Request resolved: facebookresearch/FBGEMM#783 ## Performance Issue with Current BF16 and mixed TileShape Configuration The current FBGEMM bf16 kernel uses a TileShape configuration of 128x128x128, while the optimal shape for dense bf16 tensor core on H100 is m64n256k16. The current configuration leads to suboptimal performance for tensor cores and bandwidth usage, as evidenced by PTX warnings about: 'wgmma.mma_async instruction serialization due to insufficient register resources' ## Optimized TileShape (128x256x64) Implementation Modification of the TileShape configuration from 128x128x128 to 128x256x64 for large GEMM operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization. This configuration is notably used in Flash Attention V3 and identified by Colfax-intl as the optimal configuration after empirical study for bf16 kernels. ## Benchmark Results on H100 GPU ### Benchmark configuration: PyTorch 2.6 CUDA 12.4 CPU: AMD EPYC GPU: NVIDIA H100 Benchmarks are configured with 30 kernel launch iterations and averaged over 25 Benchmark calculations. We used the same gemm sizes as in the Colfax benchmarks ### Benchmark #### bf16bf16bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192) | TileShape | TFlops | |-------------|-------- | | 128-128-128 | 606 | | 128-256- 64 | 790 | #### bf16i4bf16_rowwise_batched (B = 4, M = 2,048, N = 8,192, K = 8,192) | TileShape | TFlops bf16*| TFlops fp16*| TFlops float*| |-------------|-------------|-------------|------------- | | 128-128-128 | 354 | 341 | 383 | | 128-256- 64 | 704 | 727 | 763 | #### bf16i4bf16_rowwise (M=N=K = 8,192) | TileShape | TFlops bf16*| TFlops fp16*| TFlops float*| |-------------|-------------|-------------|------------- | | 128-128-128 | 349 | 351 | 381 | | 128-256- 64 | 652 | 663 | 693 | #### f8i4bf16_rowwise (M=N=K = 8,192) | TileShape | TFlops bf16*| TFlops fp16*| TFlops float*| |-------------|-------------|-------------|------------- | | 128-128-128 | 407 | 542 | 606 | | 128-256- 64 | 921 | 942 | 1088 | *WEIGHT_SCALE_DTYPE ## Technical Implementation Modified TileShape from 128-128-128 to 128-256-64 for: - bf16bf16bf16_grouped - bf16i4bf16_rowwise_batched - bf16i4bf16_rowwise - f8i4bf16_rowwise Added cooperative kernel by default for: - bf16i4bf16_rowwise_batched - bf16i4bf16_rowwise - f8i4bf16_rowwise The modifications only affect large mode and Default kernels where N > 128. These changes were made by modifying the minimum necessary code while respecting existing coding practices in FBGEMM. ## Test Coverage ### Unit Tests Results The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize have been verified for the modified kernels. jiawenliu64 jwfromm Hello! I wanted to share this contribution to FBGEMM. While this is my first PR, I hope these changes could be useful for this great project. I'd welcome any feedback if you have time to take a look. Thank you! X-link: pytorch#3591 Reviewed By: jianyuh Differential Revision: D68609243 Pulled By: jiawenliu64 fbshipit-source-id: e6cc2a9e42f2fc7da76f5fa7189fe773a8c69e51

facebook-github-bot added the cla signed label Feb 18, 2025

facebook-github-bot added the fb-exported label Feb 18, 2025

jiawenliu64 mentioned this pull request Feb 18, 2025

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats #3591

Closed

jiawenliu64 force-pushed the export-D68609243 branch from 4bc4da4 to ec50d16 Compare February 18, 2025 22:14

facebook-github-bot closed this in 19f3713 Feb 18, 2025

facebook-github-bot added the Merged label Feb 18, 2025

q10 added category:improvement feature:genai feature:gemm labels Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) #3710

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) #3710

Uh oh!

jiawenliu64 commented Feb 18, 2025

Uh oh!

facebook-github-bot commented Feb 18, 2025

Uh oh!

netlify bot commented Feb 18, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 18, 2025

Uh oh!

facebook-github-bot commented Feb 18, 2025

Uh oh!

Uh oh!

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) #3710

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) #3710

Uh oh!

Conversation

jiawenliu64 commented Feb 18, 2025

Performance Issue with Current BF16 and mixed TileShape Configuration

Optimized TileShape (128x256x64) Implementation

Benchmark Results on H100 GPU

Benchmark configuration:

Benchmark

bf16bf16bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)

bf16i4bf16_rowwise_batched (B = 4, M = 2,048, N = 8,192, K = 8,192)

bf16i4bf16_rowwise (M=N=K = 8,192)

f8i4bf16_rowwise (M=N=K = 8,192)

Technical Implementation

Test Coverage

Unit Tests Results

Uh oh!

facebook-github-bot commented Feb 18, 2025

Uh oh!

netlify bot commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Feb 18, 2025

Uh oh!

facebook-github-bot commented Feb 18, 2025

Uh oh!

Uh oh!

netlify bot commented Feb 18, 2025 •

edited

Loading