-
Notifications
You must be signed in to change notification settings - Fork 223
Fix gemm example on CUDA 13.0.
#317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d535696 to
066df38
Compare
|
@FractalFir can you review this? |
|
EDIT: was wrong about initialization of shared memory. The PR is fine as is, but I think we need to rethink the abstraction over shared address space in general, to fix the |
|
What is the I agree that the shared memory handling is iffy, but this PR is a clear improvement over the status quo. |
`#[address_space(shared)] static mut` is used to model GPU shared
memory. It's a bit weird. In particular, GPU shared memory is
uninitialized, but `static mut` requires an initializer in Rust. `gemm`
uses a zero initialize, but this initializer is ignored by NVVM. At
least, it was in CUDA 12.x, but in CUDA 13.0 the `gemm` example fails
with this error:
```
thread 'rustc' panicked at crates/rustc_codegen_nvvm/src/nvvm.rs:120:9:
Malformed NVVM IR program rejected by libnvvm, dumping verifier log:
error: Error: : Global Variable `_ZN12gemm_kernels10gemm_tiled10gemm_tiled6TILE_A17hc9c66e758c373a7eE':
context: @_ZN12gemm_kernels10gemm_tiled10gemm_tiled6TILE_A17hc9c66e758c373a7eE = internal unnamed_addr addrspace(3) global <{ [1024 x i8] }> zeroinitializer, align 4
Shared variables can't be initialized
```
This memory looks like it's initialized to zero but isn't, and then is
written and read normally. This is incredibly dodgy and very likely UB.
The proper way to deal with uninitialized memory in Rust is with
`MaybeUninit`, and there are strict rules around its used, e.g. writes
must be done with `write` and `assume_init` must be used values after
they are written.
This commit changes `gemm` to use `MaybeUninit` for the shared memory.
This fixes the error on CUDA 13.0 and the example runs correctly.
(This is the only executed use of GPU shared memory in rust-cuda. There
is a `shared_array!` macro defined but it's only used in a compiletest
where it is compiled but not run. That macro is extremely dubious but I
will deal with it in a separate PR because it's not necessary to get
CUDA 13.0 working.)
066df38 to
2a28b70
Compare
One of the nice things about using Rust for both CPU and GPU code is the ability to share things between them. So let's do that in `gemm`.
|
I added a second commit to share |
Sending a &T, which points to a shared memory in one thread block, to another thread block will make the pointer refer to different, potentially unrelated data. A pointer(or reference) to shared memory is only valid within the thread block it lives in. References are Kind of like how sending a & that points to thread-local memory would be unsound(which is why Rust has some special handling around TLS). Your PR is an improvement, but I believe we should also do things like forbidding taking references to shared memory altogether. Just design work I did not have time to fully finish(I mention this whole mess in the Address spaces writeup). |
#[address_space(shared)] static mutis used to model GPU shared memory. It's a bit weird. In particular, GPU shared memory is uninitialized, butstatic mutrequires an initializer in Rust.gemmuses a zero initialize, but this initializer is ignored by NVVM. At least, it was in CUDA 12.x, but in CUDA 13.0 thegemmexample fails with this error:This memory looks like it's initialized to zero but isn't, and then is written and read normally. This is incredibly dodgy and very likely UB. The proper way to deal with uninitialized memory in Rust is with
MaybeUninit, and there are strict rules around its used, e.g. writes must be done withwriteandassume_initmust be used values after they are written.This commit changes
gemmto useMaybeUninitfor the shared memory. This fixes the error on CUDA 13.0 and the example runs correctly.(This is the only executed use of GPU shared memory in rust-cuda. There is a
shared_array!macro defined but it's only used in a compiletest where it is compiled but not run. That macro is extremely dubious but I will deal with it in a separate PR because it's not necessary to get CUDA 13.0 working.)