Skip to content

Conversation

@isVoid
Copy link
Collaborator

@isVoid isVoid commented Aug 15, 2024

This PR adds bfloat16 kernel benchmarks suite, comparing a raw CUDA kernel runtime and a Numba kernel runtime. It is expected to have high overhead without supporting LTOIR.

The profiling shows slowdown:

                 GOLD: simple_kernel(float *)  PY: cudapy::__main__::kernel[abi:v1,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKzLTg4gaGKFsG2oMQGEYakJSQB1PQBk0Bynm21OiwU1a0UoLGhDpQE8oxrNQE_3d](Array<float, (int)1, C, mutable, aligned>)
Time (%)                                100.0                                              100.0                                                                                                                             
Total Time (ns)                     1164770.0                                          2753366.0                                                                                                                             
Instances                              1000.0                                             1000.0                                                                                                                             
Avg (ns)                               1164.8                                             2753.4                                                                                                                             
Med (ns)                               1152.0                                             2528.0                                                                                                                             
Min (ns)                               1120.0                                             2495.0                                                                                                                             
Max (ns)                               1504.0                                             8992.0                                                                                                                             
StdDev (ns)                              21.1                                              814.3                                                                                                                             
Perf Ratio (PY / GOLD, %): 
Avg (ns)        236.383929
Med (ns)        219.444444
Min (ns)        222.767857
Max (ns)        597.872340
StdDev (ns)    3859.241706
dtype: float64

Contributes to #12

@isVoid
Copy link
Collaborator Author

isVoid commented Aug 28, 2024

Update: with NVIDIA/numba-cuda#48 inplace for Numba-CUDA, we will start to see a very low overhead between raw CUDA kernel performance and Numba CUDA kernel performance:

                 GOLD: simple_kernel(float *)  PY: cudapy::__main__::kernel[abi:v1,cw51cXTLSUwv1sCUt9Ww0FEw09RRQPKiLTj0gIGIFp_2b2oLQFEYYkHSQB1OQAk0Bynm21OizQ1K0UoIGvDpQE8oxrNQE_3d](Array<float, (int)1, C, mutable, aligned>)
Time (%)                                100.0                                              100.0                                                                                                                               
Total Time (ns)                     1136068.0                                          1145038.0                                                                                                                               
Instances                              1000.0                                             1000.0                                                                                                                               
Avg (ns)                               1136.1                                             1145.0                                                                                                                               
Med (ns)                               1121.0                                             1152.0                                                                                                                               
Min (ns)                               1119.0                                             1119.0                                                                                                                               
Max (ns)                               1504.0                                             1536.0                                                                                                                               
StdDev (ns)                              21.6                                               53.1                                                                                                                               
Perf Ratio (PY / GOLD, %): 
Avg (ns)       100.783382
Med (ns)       102.765388
Min (ns)       100.000000
Max (ns)       102.127660
StdDev (ns)    245.833333
dtype: float64

@isVoid
Copy link
Collaborator Author

isVoid commented Oct 7, 2024

We should add a readme to document how to use the benchmark suite.

@isVoid
Copy link
Collaborator Author

isVoid commented Jul 9, 2025

The work of this PR is still important - but as bfloat16 bindings are introduced into numba-cuda proper, perhaps we need a separate way to introduce these benchmark suites. Pending discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant