-
Notifications
You must be signed in to change notification settings - Fork 16
Add Bfloat16 Benchmark and Benchmark Suite #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Update: with NVIDIA/numba-cuda#48 inplace for |
|
We should add a readme to document how to use the benchmark suite. |
|
The work of this PR is still important - but as bfloat16 bindings are introduced into numba-cuda proper, perhaps we need a separate way to introduce these benchmark suites. Pending discussion. |
This PR adds bfloat16 kernel benchmarks suite, comparing a raw CUDA kernel runtime and a Numba kernel runtime. It is expected to have high overhead without supporting LTOIR.
The profiling shows slowdown:
Contributes to #12