Add Bfloat16 Benchmark and Benchmark Suite #71

isVoid · 2024-08-15T15:28:52Z

This PR adds bfloat16 kernel benchmarks suite, comparing a raw CUDA kernel runtime and a Numba kernel runtime. It is expected to have high overhead without supporting LTOIR.

The profiling shows slowdown:

                 GOLD: simple_kernel(float *)  PY: cudapy::__main__::kernel[abi:v1,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKzLTg4gaGKFsG2oMQGEYakJSQB1PQBk0Bynm21OiwU1a0UoLGhDpQE8oxrNQE_3d](Array<float, (int)1, C, mutable, aligned>)
Time (%)                                100.0                                              100.0                                                                                                                             
Total Time (ns)                     1164770.0                                          2753366.0                                                                                                                             
Instances                              1000.0                                             1000.0                                                                                                                             
Avg (ns)                               1164.8                                             2753.4                                                                                                                             
Med (ns)                               1152.0                                             2528.0                                                                                                                             
Min (ns)                               1120.0                                             2495.0                                                                                                                             
Max (ns)                               1504.0                                             8992.0                                                                                                                             
StdDev (ns)                              21.1                                              814.3                                                                                                                             
Perf Ratio (PY / GOLD, %): 
Avg (ns)        236.383929
Med (ns)        219.444444
Min (ns)        222.767857
Max (ns)        597.872340
StdDev (ns)    3859.241706
dtype: float64

Contributes to #12

isVoid · 2024-08-28T16:29:20Z

Update: with NVIDIA/numba-cuda#48 inplace for Numba-CUDA, we will start to see a very low overhead between raw CUDA kernel performance and Numba CUDA kernel performance:

                 GOLD: simple_kernel(float *)  PY: cudapy::__main__::kernel[abi:v1,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKiLTj0gIGIFp_2b2oLQFEYYkHSQB1OQAk0Bynm21OizQ1K0UoIGvDpQE8oxrNQE_3d](Array<float, (int)1, C, mutable, aligned>)
Time (%)                                100.0                                              100.0                                                                                                                               
Total Time (ns)                     1136068.0                                          1145038.0                                                                                                                               
Instances                              1000.0                                             1000.0                                                                                                                               
Avg (ns)                               1136.1                                             1145.0                                                                                                                               
Med (ns)                               1121.0                                             1152.0                                                                                                                               
Min (ns)                               1119.0                                             1119.0                                                                                                                               
Max (ns)                               1504.0                                             1536.0                                                                                                                               
StdDev (ns)                              21.6                                               53.1                                                                                                                               
Perf Ratio (PY / GOLD, %): 
Avg (ns)       100.783382
Med (ns)       102.765388
Min (ns)       100.000000
Max (ns)       102.127660
StdDev (ns)    245.833333
dtype: float64

isVoid · 2024-10-07T07:25:28Z

We should add a readme to document how to use the benchmark suite.

… the lto enabled case

isVoid · 2025-07-09T21:03:38Z

The work of this PR is still important - but as bfloat16 bindings are introduced into numba-cuda proper, perhaps we need a separate way to introduce these benchmark suites. Pending discussion.

isVoid added 6 commits August 15, 2024 23:26

initial application on bfloat16 benchmark suite

bb592d6

Move bf16 and fp16 benchmarks to numbast-extensions

3a5246a

add PTX output to benchmark script

2563ab9

temporary change to enable FFI, pending performance debug

5ab4aed

update script to automatically print benchmark compares

9f36de8

apply nvrtc patch and update analyze script

d660c19

isVoid added 3 commits December 4, 2024 03:57

Merge branch 'main' of github.com:NVIDIA/numbast into test-ffi

1b7f4b0

remove pynvjitlink patch and augment benchmark tool to also benchmark…

4ddc7fe

… the lto enabled case

remove compiled code

636abe7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Bfloat16 Benchmark and Benchmark Suite #71

Add Bfloat16 Benchmark and Benchmark Suite #71

Uh oh!

isVoid commented Aug 15, 2024 •

edited

Loading

Uh oh!

isVoid commented Aug 28, 2024

Uh oh!

isVoid commented Oct 7, 2024

Uh oh!

isVoid commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add Bfloat16 Benchmark and Benchmark Suite #71

Are you sure you want to change the base?

Add Bfloat16 Benchmark and Benchmark Suite #71

Uh oh!

Conversation

isVoid commented Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isVoid commented Aug 28, 2024

Uh oh!

isVoid commented Oct 7, 2024

Uh oh!

isVoid commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isVoid commented Aug 15, 2024 •

edited

Loading