Skip to content

Performance Optimization: Optimized TileShape Configuration for bf16 and Mixed Formats (#3591) #3710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jiawenliu64
Copy link
Member

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/783

Performance Issue with Current BF16 and mixed TileShape Configuration

The current FBGEMM bf16 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense bf16 tensor core on H100 is m64n256k16.
The current configuration leads to suboptimal performance for tensor cores and bandwidth usage,
as evidenced by PTX warnings about:
'wgmma.mma_async instruction serialization due to insufficient register resources'

Optimized TileShape (128x256x64) Implementation

Modification of the TileShape configuration from 128x128x128 to 128x256x64 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 and identified by Colfax-intl
as the optimal configuration after empirical study for bf16 kernels.

Benchmark Results on H100 GPU

Benchmark configuration:

PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

Benchmark

bf16bf16bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)

TileShape TFlops
128-128-128 606
128-256- 64 790

bf16i4bf16_rowwise_batched (B = 4, M = 2,048, N = 8,192, K = 8,192)

TileShape TFlops bf16* TFlops fp16* TFlops float*
128-128-128 354 341 383
128-256- 64 704 727 763

bf16i4bf16_rowwise (M=N=K = 8,192)

TileShape TFlops bf16* TFlops fp16* TFlops float*
128-128-128 349 351 381
128-256- 64 652 663 693

f8i4bf16_rowwise (M=N=K = 8,192)

TileShape TFlops bf16* TFlops fp16* TFlops float*
128-128-128 407 542 606
128-256- 64 921 942 1088

*WEIGHT_SCALE_DTYPE

Technical Implementation

Modified TileShape from 128-128-128 to 128-256-64 for:

  • bf16bf16bf16_grouped
  • bf16i4bf16_rowwise_batched
  • bf16i4bf16_rowwise
  • f8i4bf16_rowwise

Added cooperative kernel by default for:

  • bf16i4bf16_rowwise_batched
  • bf16i4bf16_rowwise
  • f8i4bf16_rowwise

The modifications only affect large mode and Default kernels where N > 128.
These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

Test Coverage

Unit Tests Results

The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

jiawenliu64 jwfromm Hello! I wanted to share this contribution to FBGEMM.
While this is my first PR, I hope these changes could be useful for this great project.
I'd welcome any feedback if you have time to take a look. Thank you!

Reviewed By: jianyuh

Differential Revision: D68609243

Pulled By: jiawenliu64

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68609243

Copy link

netlify bot commented Feb 18, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit ec50d16
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67b5063f4bf6130008741c0a
😎 Deploy Preview https://deploy-preview-3710--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

…and Mixed Formats (pytorch#3710)

Summary:
Pull Request resolved: pytorch#3710

X-link: facebookresearch/FBGEMM#783

## Performance Issue with Current BF16 and mixed TileShape Configuration
The current FBGEMM bf16 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense bf16 tensor core on H100 is m64n256k16.
The current configuration leads to suboptimal performance for tensor cores and bandwidth usage,
as evidenced by PTX warnings about:
'wgmma.mma_async instruction serialization due to insufficient register resources'

## Optimized TileShape (128x256x64) Implementation
Modification of the TileShape configuration from 128x128x128 to 128x256x64 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 and identified by Colfax-intl
as the optimal configuration after empirical study for bf16 kernels.

## Benchmark Results on H100 GPU
### Benchmark configuration:
PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

### Benchmark
#### bf16bf16bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)
| TileShape   | TFlops  |
|-------------|-------- |
| 128-128-128 | 606     |
| 128-256- 64 | 790     |

#### bf16i4bf16_rowwise_batched (B = 4, M = 2,048, N = 8,192, K = 8,192)
| TileShape   | TFlops bf16*| TFlops fp16*| TFlops float*|
|-------------|-------------|-------------|------------- |
| 128-128-128 |         354 |         341 |          383 |
| 128-256- 64 |         704 |         727 |          763 |

#### bf16i4bf16_rowwise (M=N=K = 8,192)
| TileShape   | TFlops bf16*| TFlops fp16*| TFlops float*|
|-------------|-------------|-------------|------------- |
| 128-128-128 |         349 |         351 |          381 |
| 128-256- 64 |         652 |         663 |          693 |

#### f8i4bf16_rowwise (M=N=K = 8,192)
| TileShape   | TFlops bf16*| TFlops fp16*| TFlops float*|
|-------------|-------------|-------------|------------- |
| 128-128-128 |         407 |         542 |          606 |
| 128-256- 64 |         921 |         942 |         1088 |

*WEIGHT_SCALE_DTYPE

## Technical Implementation
Modified TileShape from 128-128-128 to 128-256-64 for:
 - bf16bf16bf16_grouped
 - bf16i4bf16_rowwise_batched
 - bf16i4bf16_rowwise
 - f8i4bf16_rowwise

Added cooperative kernel by default for:
 - bf16i4bf16_rowwise_batched
 - bf16i4bf16_rowwise
 - f8i4bf16_rowwise

The modifications only affect large mode and Default kernels where N > 128.
These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

## Test Coverage
### Unit Tests Results
The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

jiawenliu64 jwfromm Hello! I wanted to share this contribution to FBGEMM.
While this is my first PR, I hope these changes could be useful for this great project.
I'd welcome any feedback if you have time to take a look. Thank you!

Pull Request resolved: pytorch#3591

Reviewed By: jianyuh

Differential Revision: D68609243

Pulled By: jiawenliu64
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D68609243

@facebook-github-bot
Copy link
Contributor

@jiawenliu64 merged this pull request in 19f3713.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
…and Mixed Formats (pytorch#783)

Summary:
X-link: pytorch#3710

Pull Request resolved: facebookresearch/FBGEMM#783

## Performance Issue with Current BF16 and mixed TileShape Configuration
The current FBGEMM bf16 kernel uses a TileShape configuration of 128x128x128,
while the optimal shape for dense bf16 tensor core on H100 is m64n256k16.
The current configuration leads to suboptimal performance for tensor cores and bandwidth usage,
as evidenced by PTX warnings about:
'wgmma.mma_async instruction serialization due to insufficient register resources'

## Optimized TileShape (128x256x64) Implementation
Modification of the TileShape configuration from 128x128x128 to 128x256x64 for large GEMM
operations using a cooperative kernel, enabling optimal bandwidth and tensor cores utilization.
This configuration is notably used in Flash Attention V3 and identified by Colfax-intl
as the optimal configuration after empirical study for bf16 kernels.

## Benchmark Results on H100 GPU
### Benchmark configuration:
PyTorch 2.6
CUDA 12.4
CPU: AMD EPYC
GPU: NVIDIA H100
Benchmarks are configured with 30 kernel launch iterations
and averaged over 25 Benchmark calculations.
We used the same gemm sizes as in the Colfax benchmarks

### Benchmark
#### bf16bf16bf16_grouped (G = 4, M = 2,048, N = 8,192, K = 8,192)
| TileShape   | TFlops  |
|-------------|-------- |
| 128-128-128 | 606     |
| 128-256- 64 | 790     |

#### bf16i4bf16_rowwise_batched (B = 4, M = 2,048, N = 8,192, K = 8,192)
| TileShape   | TFlops bf16*| TFlops fp16*| TFlops float*|
|-------------|-------------|-------------|------------- |
| 128-128-128 |         354 |         341 |          383 |
| 128-256- 64 |         704 |         727 |          763 |

#### bf16i4bf16_rowwise (M=N=K = 8,192)
| TileShape   | TFlops bf16*| TFlops fp16*| TFlops float*|
|-------------|-------------|-------------|------------- |
| 128-128-128 |         349 |         351 |          381 |
| 128-256- 64 |         652 |         663 |          693 |

#### f8i4bf16_rowwise (M=N=K = 8,192)
| TileShape   | TFlops bf16*| TFlops fp16*| TFlops float*|
|-------------|-------------|-------------|------------- |
| 128-128-128 |         407 |         542 |          606 |
| 128-256- 64 |         921 |         942 |         1088 |

*WEIGHT_SCALE_DTYPE

## Technical Implementation
Modified TileShape from 128-128-128 to 128-256-64 for:
 - bf16bf16bf16_grouped
 - bf16i4bf16_rowwise_batched
 - bf16i4bf16_rowwise
 - f8i4bf16_rowwise

Added cooperative kernel by default for:
 - bf16i4bf16_rowwise_batched
 - bf16i4bf16_rowwise
 - f8i4bf16_rowwise

The modifications only affect large mode and Default kernels where N > 128.
These changes were made by modifying the minimum necessary code while respecting
existing coding practices in FBGEMM.

## Test Coverage
### Unit Tests Results
The unit tests in fbgemm_gpu/experimental/gen_ai/test/quantize
have been verified for the modified kernels.

jiawenliu64 jwfromm Hello! I wanted to share this contribution to FBGEMM.
While this is my first PR, I hope these changes could be useful for this great project.
I'd welcome any feedback if you have time to take a look. Thank you!

X-link: pytorch#3591

Reviewed By: jianyuh

Differential Revision: D68609243

Pulled By: jiawenliu64

fbshipit-source-id: e6cc2a9e42f2fc7da76f5fa7189fe773a8c69e51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants