Skip to content

[BUG]: CUDA API calls through cuda-bindings 3x slower than direct CUDA C++ API calls #659

Open
@lucifer1004

Description

@lucifer1004

Is this a duplicate?

Type of Bug

Performance

Component

cuda.bindings

Describe the bug

I have a same CUDA kernel, and I am calling it using two different methods.

  1. Compile the kernel to .cubin, load it in Python and call it using cuda-bindings. The TMA descriptors are also created via cuda-bindings.
  2. Compile the kernel and relevant C++ host functions (including TMA desc, kernel attribute and launch config) to .so, load the .so in Python and call the C++ host function.

As revealed in the two timelines below, the first path is 3x slower than the second path.
Image
Image

How to Reproduce

Follow the instructions in the reproduction repo https://github.com/lucifer1004/cuda-python-repro

Expected behavior

CUDA Python's perf should be on par with CUDA C++.

Operating System

Ubuntu Linux 24.04

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:53:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:64:00.0 Off | 0 |
| N/A 28C P0 68W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:75:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:86:00.0 Off | 0 |
| N/A 29C P0 68W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:97:00.0 Off | 0 |
| N/A 30C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:A8:00.0 Off | 0 |
| N/A 28C P0 68W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:B9:00.0 Off | 0 |
| N/A 28C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:CA:00.0 Off | 0 |
| N/A 26C P0 66W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcuda.bindingsEverything related to the cuda.bindings moduletriageNeeds the team's attention

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions