Skip to content

Regenerate cust_rawfor newer CUDA / support multiple versions #166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LegNeato opened this issue Mar 17, 2025 · 30 comments
Closed

Regenerate cust_rawfor newer CUDA / support multiple versions #166

LegNeato opened this issue Mar 17, 2025 · 30 comments

Comments

@LegNeato
Copy link
Contributor

Now that we are adding support for newer CUDA (#100), we need a way to select the version for cust_raw's bindgen (and elsewhere?) See:

Current version is based on CUDA 11.2

@jorge-ortega
Copy link
Collaborator

jorge-ortega commented Mar 21, 2025

Currently we have a checked-in generated bindings that seem to be for windows based on the repr(i32) value for enums. That probably won't break linux, and we could update current binding and also generate linux bindings, but neither seems ideal.

In the usual bindgen flow, the user selects the version through whatever they have installed on their machine, and cust_raw (typically -sys) crate would generate bindings for that version. The cust crate can then configure itself through reading the version used to generate the bindings.

Say we have support up to 12.8, and the user has 12.0 installed. cust should be able to support APIs up to 12.0 and disable newer once through its configuration. If the user has a newer version, then at best cust could still compile without any support for newer features/APIs, but the user should still be able to access them through the sys crate. At worst, we don't compile due to some backwards incompatible change with the newer version.

@LegNeato
Copy link
Contributor Author

Right, we need to decide if we generate and check in plus the hide the variants behind targets + features or we generate on the fly via build.rs and what is installed. The former is sort of gross but we see the complete picture and changes for each version and have more visibility when bugs are filed. The later means cross compilation is unsupported.

@jorge-ortega
Copy link
Collaborator

I gave generating bindings at build time a shoot and have a WIP branch focusing on cuda 12 and up for now on Windows. I built with versions 12.0, 12.1, 12.3, and 12.8. The only breaking changes were in the graph api. 12.3 introduced conditional nodes, so cust checks support for that and sets a cfg so the wrapper enum can account for it. With these changes, cust is able to build for on every version I tested. 12.0 can't codegen thought, as libnvvm doesn't like our data layout for some reason, but 12.1 and up were ok with it.

@adamcavendish
Copy link
Contributor

I gave generating bindings at build time a shoot and have a WIP branch focusing on cuda 12 and up for now on Windows. I built with versions 12.0, 12.1, 12.3, and 12.8. The only breaking changes were in the graph api. 12.3 introduced conditional nodes, so cust checks support for that and sets a cfg so the wrapper enum can account for it. With these changes, cust is able to build for on every version I tested. 12.0 can't codegen thought, as libnvvm doesn't like our data layout for some reason, but 12.1 and up were ok with it.

Hey @jorge-ortega , I'm also working on this and I have just saw your changes. I can help write the rest if something else is keeping your hands full or I can help test on Linux distributions on CUDA 12.8 if it's ready.

@jorge-ortega
Copy link
Collaborator

Hey @adamcavendish, would be great if you could give the changes a try on linux.

@adamcavendish
Copy link
Contributor

I gave generating bindings at build time a shoot and have a WIP branch focusing on cuda 12 and up for now on Windows. I built with versions 12.0, 12.1, 12.3, and 12.8. The only breaking changes were in the graph api. 12.3 introduced conditional nodes, so cust checks support for that and sets a cfg so the wrapper enum can account for it. With these changes, cust is able to build for on every version I tested. 12.0 can't codegen thought, as libnvvm doesn't like our data layout for some reason, but 12.1 and up were ok with it.

Hi @jorge-ortega, I've looked into your branch and I'd like to discuss a few considerations to refine the implementation of the bindgen-generated bindings. I know things are not finalized yet so there might be ideas that we think in common and you just didn't finish. I'm just raising these out for some discussions ahead of time.

  1. Recommend Moving Bindings to OUT_DIR

Generating bindings directly into the Git workspace introduces unstaged files, which complicates git status tracking and risks conflicting with intentional changes. Additionally, this approach limits us to a single binding version (e.g., cust_raw/nvvm), conflicting with our goal of supporting multiple CUDA versions.

The bindgen documentation recommends placing generated files in the OUT_DIR to ensure they are treated as build artifacts. This approach keeps the repository clean and allows version-specific bindings to be managed dynamically during build time.

  1. Suggest Renaming Crates to Follow -sys Convention

To align with Rust's ecosystem conventions for system bindings (e.g., *-sys naming), renaming:

  • cust_rawcust-sys
  • nvvmnvvm-sys

This improves discoverability and adheres to crate naming standards for FFI crates. The current cust crate already aliases cust_raw as sys (see here), which supports this transition.

  1. Enable non_exhaustive Enums for Backward Compatibility

To ensure compatibility across CUDA versions, we should mark generated enums (e.g., GraphNodeType) as non_exhaustive. This prevents breaking changes when new variants are introduced in future CUDA headers.

Example configuration for bindgen:

let bindings = bindgen::Builder::default()
    .header("wrapper.h")
    // ... allowlist_types ...
    .default_enum_style(bindgen::EnumVariation::Rust {
        non_exhaustive: true,
    });

The from_raw implementation should handle unknown variants gracefully:

#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[repr(u8)]
#[non_exhaustive]
pub enum GraphNodeType {
    // ... existing variants ...
}

impl GraphNodeType {
    pub fn from_raw(raw: cuda::CUgraphNodeType) -> Self {
        match raw {
            // ... known variants ...
            _ => GraphNodeType::Empty, // Fallback for unrecognized values
        }
    }
}
  1. Simplify Version Checks via Direct CUDA_VERSION Usage

Instead of parsing CUDA versions from a JSON file, leverage the cust_raw::CUDA_VERSION constant in cust's build.rs. This reduces parsing errors and simplifies conditional compilation.

Additionally, use version-based feature flags (e.g., cuda_12_3_0) rather than feature flags tied to individual API changes (e.g., conditional_node). This approach encapsulates all version-specific changes under a single gate, improving maintainability.

Example build.rs code:

use cust_raw::CUDA_VERSION;

fn main() {
    if CUDA_VERSION >= 12030 {
        println!("cargo:rustc-cfg=cuda_12_3_0");
    }
}

@jorge-ortega
Copy link
Collaborator

jorge-ortega commented Mar 25, 2025

Thanks @adamcavendish for providing this feedback, really appreciate it. I'm mostly in alignment with your suggestions.

  1. my changes are aligned here, so I read this as mostly being in favor of of generating bindings at build.

Technically, we're not limited to a single version if we do pre-gen bindings. See cudarc (more on this below).

  1. I wanted to suggest this as well. Ideal we would name this 'cuda-sys', but that crate name has already been taken by a completely different now archived project. Maybe there's a chance we can take over that?

Since the SDK contains several different libraries, and different abstractions (driver vs runtime api) I was also considering consolidating bindings to a single crate (cuda-sdk-sys), and using features flags to allow users/downstream crates to pick which libraries to generate and link against. There might be some benefits with this (consolidated build logic, maybe better build times?) and I'm sure there are downsides I just haven't thought of yet.

  1. I'm good with this approach, so long as we can agree on what graceful means depending on the context. Sometimes the most graceful thing you can do is panic.

  2. I was hesitant to link against the sys crate in the build script as it seemed a bit heavy handed just to get the version constant. I was also considering cases where users might build for a different target than the host, which would require building the sys crate twice. Though this isn't the 99% case, and linking sys in the build doesn't prevent the 1% case, so this could just be me overthinking it.

I don't necessarily agree that using version based flags would be more maintainable. It's straightforward when the range is open ended, but tricky if we need to gate on a closed range. We would either need to add not cfgs for every version above the max supported version, or define greater than/less than cfgs for every version stop. We'd need something like this if we want to also support version 11. The SDK has a fairly large surface area that could evolve in different ways, and the idea with the api cfgs was keep that version range gating in the build script, and applying a single cfg on the relevant api surfaces.

cudarc

@LegNeato has mentioned interest in collaborating with cudarc, which also wraps the cuda SDK libraries. Their abstractions are more mature then cust's and they support multiple versions of the SDK through multiple checked in bindings. I'm still surprised that all prior work I've found in this space seems to go that direction. Users have reported issues where the bindings fail layout tests on their platform, and the fix has been to disable layout tests, possibly hiding UB. For the moment this is the aspect I'd like to see changed if we do collaborate.

Multiple LLVMs

Since bindgen requires libclang 9+, we need users to have a more recent llvm install in addition to llvm 7 required by the codegen. This shouldn't be a major hurdle for most users, but does increase the complexity. We might also want to consider renaming environment variables to better distinguish which version should be set and for what.

@adamcavendish
Copy link
Contributor

@jorge-ortega Thanks for providing more info about the roadmap.

  1. cudarc

If we do plan to switch over to cudarc, I can help make some changes to see how much work is needed to switch over. cc @LegNeato

About the multiple checked-in bindings, yeah, I never thought of this path previously as many community crates are using the standard bindgen recommended ways.

  1. Multiple LLVMs

I can further improve the current Dockerfiles for various Linux distribution supports when we have the exact environment variables settled down. Hope the distribution provided official LLVM versions are good enough for bindgen (Most of the time, yes).

@jorge-ortega
Copy link
Collaborator

I just became aware of how to use cargo metadata to pass information to dependent packages.1 We could use this to pass the version information to higher level crates. This approach could potentially supersede most of what the find_cuda_helper crate provides.

Footnotes

  1. https://doc.rust-lang.org/cargo/reference/build-script-examples.html#using-another-sys-crate

@adamcavendish
Copy link
Contributor

adamcavendish commented Mar 27, 2025

I just became aware of how to use cargo metadata to pass information to dependent packages.1 We could use this to pass the version information to higher level crates. This approach could potentially supersede most of what the find_cuda_helper crate provides.

Footnotes

  1. https://doc.rust-lang.org/cargo/reference/build-script-examples.html#using-another-sys-crate

Wow, that's awesome. This is what we really need. I can work on a refactor of this to make it cleaner.
It also shows a way to specify multiple conditional compilation version support in the openssl-sys crate.

@jorge-ortega
Copy link
Collaborator

Sure, give it a try. I was looking at how openssl-sys passes version info as an example here. They hex encoded it and then parse it out to set their version cfgs. Haven't figured out exactly how they get the version information out of the headers yet. Has something to do with expanding version macros. It might be easier to read the version.json file instead?

@adamcavendish
Copy link
Contributor

Sure, give it a try. I was looking at how openssl-sys passes version info as an example here. They hex encoded it and then parse it out to set their version cfgs. Haven't figured out exactly how they get the version information out of the headers yet. Has something to do with expanding version macros. It might be easier to read the version.json file instead?

For the version.json file, it's not available in the Linux container. I guess it's available in your Windows setup but it's not cross platform?

@jorge-ortega
Copy link
Collaborator

I should have known that wouldn't be cross-platform... I guess you could just bindgen and parse the version out from the bindings?

@adamcavendish
Copy link
Contributor

I should have known that wouldn't be cross-platform... I guess you could just bindgen and parse the version out from the bindings?

Yeah, that's ok. May I cherry-pick your changes (so it'll show your Author information) and start with that? Before the merge, I'll write that we co-authored?

@adamcavendish
Copy link
Contributor

I should have known that wouldn't be cross-platform... I guess you could just bindgen and parse the version out from the bindings?

Yeah, that's ok. May I cherry-pick your changes (so it'll show your Author information) and start with that? Before the merge, I'll write that we co-authored?

Will be drafting in the branch feature/refactor-cust-raw-and-find-cuda and aim to be ready before next Wednesday.

@jorge-ortega
Copy link
Collaborator

Hit me this morning how version ranges would work with version feature flags. It's a lot simpler than I made it out to be. I'm onboard with that approach now.

@adamcavendish
Copy link
Contributor

@jorge-ortega Sure, thanks for viewing it, I'll continue on that track.

@LegNeato
Copy link
Contributor Author

LegNeato commented Mar 30, 2025

Do we want to try to get cuda-*-sys crate name? See https://www.reddit.com/r/rust/comments/1jmh4af/rustcudacudasys_has_been_archived/

@jorge-ortega
Copy link
Collaborator

Would be cool if we could get ownership of those crates transferred to us. If we do get them, our bindings should set a high bar of compatibility with what you would expect from the C/C++ code.

Along those lines, I've been digging deeper into the headers to understand its versioning and per-thread stream options. I was interested in why from 11 to 12 there was a breaking change in some of the functions. The function names didn't show up in the docs, causing even further confusion. Turns out, there's a series of defines in the headers that renames the functions. Some rename to a more descriptive name, some to a specific version, and when CUDA_API_PER_THREAD_DEFAULT_STREAM is defined, to per-thread default/null stream variants. So, in the C/C++ code, when you call e.g. cuCtxCreate, the macro defines make you instead call cuCtxCreate_v2. bindgen only ever sees the final name of the function after expanding macros, so the function signature that gets generated is pub fn cuCtxCreate_v2, causing a breaking change in Rust that would have just worked in C/C++.

The simplest fix would be handwritten aliases, e.g pub use self::cuCtxCreate_v2 as cuCtxCreate;. Repeat for all affected functions and version gate if needed. A second option would be to use bindgen callbacks to modify the name and link_name. I tried this out to some success. I haven't gotten it to handle all cases, but when it does work though, we get:

#[link_name = "\u{1}cuCtxCreate_v2"]
pub fn cuCtxCreate(...)

Note: \u{1} prevents name mangling.

With the per-thread settings, some functions also gain a _ptds or _ptsz suffix (possibly in addition to a version suffix). You can imagine how difficult for users of the sys bindings it would be having to update their code without aliases or link_names if we choose to support them. There was never any intention to support these in cust, but I think it worth having un-opinionated sys bindings for those who dare.

@adamcavendish
Copy link
Contributor

adamcavendish commented Mar 30, 2025

Would be cool if we could get ownership of those crates transferred to us. If we do get them, our bindings should set a high bar of compatibility with what you would expect from the C/C++ code.

Along those lines, I've been digging deeper into the headers to understand its versioning and per-thread stream options. I was interested in why from 11 to 12 there was a breaking change in some of the functions. The function names didn't show up in the docs, causing even further confusion. Turns out, there's a series of defines in the headers that renames the functions. Some rename to a more descriptive name, some to a specific version, and when CUDA_API_PER_THREAD_DEFAULT_STREAM is defined, to per-thread default/null stream variants. So, in the C/C++ code, when you call e.g. cuCtxCreate, the macro defines make you instead call cuCtxCreate_v2. bindgen only ever sees the final name of the function after expanding macros, so the function signature that gets generated is pub fn cuCtxCreate_v2, causing a breaking change in Rust that would have just worked in C/C++.

The simplest fix would be handwritten aliases, e.g pub use self::cuCtxCreate_v2 as cuCtxCreate;. Repeat for all affected functions and version gate if needed. A second option would be to use bindgen callbacks to modify the name and link_name. I tried this out to some success. I haven't gotten it to handle all cases, but when it does work though, we get:

#[link_name = "\u{1}cuCtxCreate_v2"]
pub fn cuCtxCreate(...)

Note: \u{1} prevents name mangling.

With the per-thread settings, some functions also gain a _ptds or _ptsz suffix (possibly in addition to a version suffix). You can imagine how difficult for users of the sys bindings it would be having to update their code without aliases or link_names if we choose to support them. There was never any intention to support these in cust, but I think it worth having un-opinionated sys bindings for those who dare.

The story just happens again like what NVidia did to cublas. When cublas v2 was put to production, they went to rename with #define from v2 to original functional signatures as well. I personally love the idea that we should give users the same user experiences as that of C/C++ for sys packages. Looking forward to see more experiment results as I push forward to finish the bindgen and merging sys packages.

@LegNeato
Copy link
Contributor Author

If we go the maintain our own bindings route I assume we'll use https://github.com/dtolnay/cxx?

@adamcavendish
Copy link
Contributor

If we go the maintain our own bindings route I assume we'll use https://github.com/dtolnay/cxx?

I personally do not yet see any uses of cxx and we might still be using bindgen for most of the cases. cc @jorge-ortega if he has anything experimental in his branches.

@jorge-ortega
Copy link
Collaborator

Haven't found a need for it just yet. Most of the host side API is just C, which bindgen can handle. What C++ specific host APIs I have found are templated wrappers over the C API. The cooperative groups and atomics are C++ but more device centric. Need to more information on how to handle that.

@adamcavendish
Copy link
Contributor

adamcavendish commented Mar 31, 2025

@jorge-ortega @LegNeato For OptiX, I find it hard to include in the cust_raw package because the OptiX SDK requires users to get registered. It's hard to redistribute a docker image that includes the OptiX SDK if I'm right.

I'll try to make something that builds on OptiX 9 or something that we originally built for and make a separate optix-sys package aside from optix package.


Wait ... I also wonder whether it is ok to commit the bindgen generated file to the repository since it's a kind of redistribution... If not, we should always ask user to have a SDK installed and bindgen from the SDK.

@jorge-ortega
Copy link
Collaborator

That should be the goal for all the bindings to use the user's installed SDK to generate them.

I got a working solution for the function renames today. It's based on macro expansion to check if a function name is macro defined, and if it is, expands a macro to an easy to parse line with the original function name and its macro expanded name. Pass that mapping to a custom bindgen callback, and it takes care of the rest. With that, several functions across cust lost their _v2 suffix. At the moment, this does unfortunately mean having to keep a list of hundreds of functions names to expand depending on the library. I do have another idea that would remove the need for that, but it would mean for each define bindgen finds, invoking the compiler to do another expansion, instead of doing one expansion with the current solution.

@adamcavendish
Copy link
Contributor

That should be the goal for all the bindings to use the user's installed SDK to generate them.

I got a working solution for the function renames today. It's based on macro expansion to check if a function name is macro defined, and if it is, expands a macro to an easy to parse line with the original function name and its macro expanded name. Pass that mapping to a custom bindgen callback, and it takes care of the rest. With that, several functions across cust lost their _v2 suffix. At the moment, this does unfortunately mean having to keep a list of hundreds of functions names to expand depending on the library. I do have another idea that would remove the need for that, but it would mean for each define bindgen finds, invoking the compiler to do another expansion, instead of doing one expansion with the current solution.

Roger. Will try to move optix-sys to use from-host SDK to avoid a complex matrix of features, e.g. OptiX version + CUDA version.

@jorge-ortega
Copy link
Collaborator

With #181, we can now generate bindings for almost any version of CUDA and provides the foundation to add more coverage of the CUDA libraries. #187 helps match the bindings to the C/C++ API experience in addition to making the bindings more resilient across versions (and from NVidia's liberal use of macros).

So what's left on this front, or can we close this out now? I think we still need to determine what versions Cust will officially support, as we've now dropped v11 from CI altogether now.

@adamcavendish
Copy link
Contributor

Yeah, I think CUDA versions shall align with major ML frameworks and render engines, e.g. PyTorch, Tensorflow, Jax, Blender Cycles, etc.

@LegNeato
Copy link
Contributor Author

LegNeato commented Apr 5, 2025

I'm fine dropping 11 as we likely have no users on it.

@LegNeato
Copy link
Contributor Author

LegNeato commented Apr 5, 2025

Let's close

@LegNeato LegNeato closed this as completed Apr 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants