Support ordering argument for atomics by tkf · Pull Request #1393 · JuliaGPU/CUDA.jl

tkf · 2022-02-22T11:43:22Z

I tried if a simplistic approach can solve #1353. This patch simply lets users specify atomic ordering with Val(:monotonic) etc. passed to CUDA.atomic_*. It seems to generate the correct LLVM IR. But the orderings are not reflected to to ptx (and the run-time).

julia> function global_atomic_demo!(xs, order)
           ptr = pointer(xs, 1)
           CUDA.atomic_add!(ptr, 1, order)
           nothing
       end;

julia> @device_code_ptx @cuda global_atomic_demo!(CUDA.zeros(Int, 1), Val(:monotonic))

prints

...
// PTX CompilerJob of kernel global_atomic_demo!(CuDeviceVector{Int64, 1}, Val{:monotonic}) for sm_61

//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_61
.address_size 64

        // .globl       _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE // -- Begin function _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE
                                        // @_Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE
.visible .entry _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE(
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE_param_0[8],
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE_param_1[32]
)
{
        .reg .b64       %rd<3>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_Z31julia_global_atomic_demo__1619313CuDeviceArrayI5Int64Li1ELi1EE3ValI10_monotonicE_param_1];
        atom.global.add.u64     %rd2, [%rd1], 1;
        ret;
                                        // -- End function
}

and

julia> @device_code_ptx @cuda global_atomic_demo!(CUDA.zeros(Int, 1), Val(:sequentially_consistent))

prints

...
// PTX CompilerJob of kernel global_atomic_demo!(CuDeviceVector{Int64, 1}, Val{:sequentially_consistent}) for sm_61

//
// Generated by LLVM NVPTX Back-End
//

.version 6.3
.target sm_61
.address_size 64

        // .globl       _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE // -- Begin function _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE
                                        // @_Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE
.visible .entry _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE(
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE_param_0[8],
        .param .align 8 .b8 _Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE_param_1[32]
)
{
        .reg .b64       %rd<3>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [_Z31julia_global_atomic_demo__1640313CuDeviceArrayI5Int64Li1ELi1EE3ValI24_sequentially_consistentE_param_1];
        atom.global.add.u64     %rd2, [%rd1], 1;
        ret;
                                        // -- End function
}

i.e., atomic_add! with both Val(:monotonic) and Val(:sequentially_consistent) are compiled down to atom.global.add.u64. Note that @device_code_llvm prints expected LLVM IR (i.e., %2 = atomicrmw add i64 addrspace(1)* %1, i64 1 monotonic for Val(:monotonic) and %2 = atomicrmw add i64 addrspace(1)* %1, i64 1 seq_cst for Val(:sequentially_consistent)). Similar program on shared memory also shows that both Val(:monotonic) and Val(:sequentially_consistent) generate the same instruction (atom.shared.add.u64).

I'm not sure how to properly generate more accurate orderings as libcu++ does. Do we need to generate more ptx-specific LLVM IR?

maleadt · 2022-02-23T09:36:05Z

I'm not sure how to properly generate more accurate orderings as libcu++ does.

Which exact instruction (modifiers) are you expecting?

This is just LLVM not handling these orderings:

source_filename = "text"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define void @kernel_monotonic(i32 addrspace(1)* %ptr) {
entry:
    %0 = atomicrmw add i32 addrspace(1)* %ptr, i32 1 monotonic
    ret void
}

define void @kernel_seq_cst(i32 addrspace(1)* %ptr) {
entry:
    %0 = atomicrmw add i32 addrspace(1)* %ptr, i32 1seq_cst
    ret void
}

$ llc test.ll -o -                                                                                                                                                                            
//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20
.address_size 64

        // .globl       kernel_monotonic        // -- Begin function kernel_monotonic
                                        // @kernel_monotonic
.visible .func kernel_monotonic(
        .param .b64 kernel_monotonic_param_0
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [kernel_monotonic_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;
                                        // -- End function
}
        // .globl       kernel_seq_cst          // -- Begin function kernel_seq_cst
.visible .func kernel_seq_cst(
        .param .b64 kernel_seq_cst_param_0
)                                       // @kernel_seq_cst
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;

// %bb.0:                               // %entry
        ld.param.u64    %rd1, [kernel_seq_cst_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;
                                        // -- End function

At the same time, NVVM (NVIDIA's closed-source back-end) doesn't seem to handle these either:

#include <nvvm.h>
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#define check(ans) { _check((ans), __FILE__, __LINE__); }
void _check(nvvmResult code, const char *file, int line)
{
   if (code != NVVM_SUCCESS)
   {
      fprintf(stderr,"NVVM error: %s at %s:%d\n", nvvmGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

int main() {
    FILE *f = fopen("/tmp/test.ll", "rb");
    fseek(f, 0, SEEK_END);
    long input_size = ftell(f);
    fseek(f, 0, SEEK_SET);

    char *input = malloc(input_size);
    fread(input, input_size, 1, f);
    fclose(f);

    nvvmProgram program;
    check(nvvmCreateProgram(&program));

    check(nvvmAddModuleToProgram(program, input, input_size, "main"));

    if (nvvmCompileProgram(program, 0, NULL) != NVVM_SUCCESS) {
        size_t log_size;
        check(nvvmGetProgramLogSize(program, &log_size));
        char *log = malloc(log_size);
        check(nvvmGetProgramLog(program, log));
        fprintf(stderr, "Compilation failed: %s\n", log);
        return EXIT_FAILURE;
    }

    size_t result_size;
    check(nvvmGetCompiledResultSize(program, &result_size));
    char *result = malloc(result_size);
    check(nvvmGetCompiledResult(program, result));
    fprintf(stdout, result, "%s\n", result);

    return EXIT_SUCCESS;
}

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-30794723
// Cuda compilation tools, release 11.6, V11.6.55
// Based on NVVM 7.0.1
//

.version 7.6
.target sm_52
.address_size 64

        // .globl       kernel_monotonic

.visible .func kernel_monotonic(
        .param .b64 kernel_monotonic_param_0
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;


        ld.param.u64    %rd1, [kernel_monotonic_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;

}
        // .globl       kernel_seq_cst
.visible .func kernel_seq_cst(
        .param .b64 kernel_seq_cst_param_0
)
{
        .reg .b32       %r<2>;
        .reg .b64       %rd<2>;


        ld.param.u64    %rd1, [kernel_seq_cst_param_0];
        atom.global.add.u32     %r1, [%rd1], 1;
        ret;

}

tkf · 2022-02-23T11:13:54Z

I only tried it on godbolt https://godbolt.org/z/o6areY84z but NVCC with libcu++ compiles a->fetch_add(1, cuda::std::memory_order_relaxed) to atom.add.relaxed.sys.u32. So, I was hoping that we get something similar with :monotonic. Other orderings with libcu++ are also reflected to the assembly. I tried clang++ on godbolt to see check the LLVM IR but I couldn't make it compile. Interestingly, old C APIs like atomicAdd and atomicAdd_system seem to use different instructions. atomicAdd compiles to atom.global.add.u32. So, I wonder if LLVM is not updated yet to use the new instructions? Or maybe we need to use some @llvm.nvvm.* instructions?

Support ordering argument for atomics

2a0b8d2

maleadt force-pushed the master branch from db0ecb0 to 4d6dbd1 Compare May 13, 2022 07:37

tkf mentioned this pull request May 13, 2022

Add atomic intrinsics JuliaLLVM/LLVM.jl#308

Merged

tkf mentioned this pull request Jun 5, 2022

Instruction selection error from loads and stores on CUDA JuliaConcurrent/Atomix.jl#33

Open

tkf mentioned this pull request Oct 14, 2022

CUDA.@atomic error in GPU kernel EnzymeAD/Enzyme.jl#511

Open

maleadt force-pushed the master branch from 476979e to d53a63e Compare March 16, 2023 12:34

maleadt force-pushed the master branch from c97bc77 to d57e020 Compare September 8, 2023 20:12

maleadt force-pushed the master branch from 1cb1f53 to 1a1d127 Compare September 18, 2023 16:28

maleadt force-pushed the master branch from aef3298 to 4b017c6 Compare January 18, 2024 12:09

maleadt force-pushed the master branch 15 times, most recently from 5d585c4 to c850163 Compare December 20, 2024 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ordering argument for atomics#1393

Support ordering argument for atomics#1393
tkf wants to merge 1 commit intoJuliaGPU:masterfrom
tkf:ord

tkf commented Feb 22, 2022

Uh oh!

maleadt commented Feb 23, 2022

Uh oh!

tkf commented Feb 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tkf commented Feb 22, 2022

Uh oh!

maleadt commented Feb 23, 2022

Uh oh!

tkf commented Feb 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants