Skip to content

In-place mutation allocates despite assert_nonallocating=true #1462

@alexandergunnarson

Description

@alexandergunnarson

I'm trying to simply write a vector to a column-oriented matrix. What I have works for small sizes, but it allocates despite assert_nonallocating=true.

# NOTE — There's probably a simpler way to do this that's not so low-level. But maybe not, because
#        XLA is immutable by default.
function write_row!(xs_mut, i, v)
    v_col = Reactant.Ops.reshape(v, [length(v), 1])
    updated_xs = Reactant.Ops.dynamic_update_slice(xs_mut, v_col, [1, i])

    # Mutatively update xs with the new data
    Reactant.TracedUtils.set_mlir_data!(xs_mut, Reactant.TracedUtils.get_mlir_data(updated_xs))

    nothing
end

# Same issue with this
function write_row_simple!(xs_mut, i, v)
    xs_mut[:, i] = v
    nothing
end

I compile and run like so

T = Float32
dim = 768
max_n = 10_000_0
vectors_gpu = Reactant.@jit Reactant.Ops.fill(T(0), (dim, max_n))
i_default = 1 # just to have a value to pass into `write_row_rx!`
vector_gpu = Reactant.@jit Reactant.Ops.fill(T(0), (dim,))

write_row_rx! = Reactant.@compile assert_nonallocating=true sync=true write_row!(
    vectors_gpu, i_default, vector_gpu
)

write_row_rx!(vectors_gpu, 1, vector_gpu)

And get

julia> write_row_rx!(vectors_gpu, 1, vector_gpu)
2025-07-16 22:08:55.628804: W external/xla/xla/tsl/framework/bfc_allocator.cc:501] Allocator (GPU_0_bfc) ran out of memory trying to allocate 28.61GiB (rounded to 30720000000)requested by op
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2025-07-16 22:08:55.628872: I external/xla/xla/tsl/framework/bfc_allocator.cc:1049] BFCAllocator dump for GPU_0_bfc
2025-07-16 22:08:55.628888: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (256):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.628899: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (512):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.628909: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (1024):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.628947: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (2048):         Total Chunks: 1, Chunks in use: 1. 3.0KiB allocated for chunks. 3.0KiB in use in bin. 3.0KiB client-requested in use in bin.
2025-07-16 22:08:55.628968: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (4096):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.628992: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (8192):         Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629005: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (16384):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629034: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (32768):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629053: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (65536):        Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629074: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (131072):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629087: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (262144):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629132: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (524288):       Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629142: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (1048576):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629152: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (2097152):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629162: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (4194304):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629190: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (8388608):      Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629210: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (16777216):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629224: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (33554432):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629233: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (67108864):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629247: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (134217728):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2025-07-16 22:08:55.629263: I external/xla/xla/tsl/framework/bfc_allocator.cc:1056] Bin (268435456):    Total Chunks: 2, Chunks in use: 1. 32.00GiB allocated for chunks. 28.61GiB in use in bin. 28.61GiB client-requested in use in bin.
2025-07-16 22:08:55.629295: I external/xla/xla/tsl/framework/bfc_allocator.cc:1072] Bin for 28.61GiB was 256.00MiB, Chunk State:
2025-07-16 22:08:55.629320: I external/xla/xla/tsl/framework/bfc_allocator.cc:1078]   Size: 3.39GiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 3.0KiB | Requested Size: 3.0KiB | in_use: 1 | bin_num: -1
2025-07-16 22:08:55.629331: I external/xla/xla/tsl/framework/bfc_allocator.cc:1085] Next region of size 34359738368
2025-07-16 22:08:55.629342: I external/xla/xla/tsl/framework/bfc_allocator.cc:1105] InUse at 72fe14000000 of size 30720000000 next 1
2025-07-16 22:08:55.629356: I external/xla/xla/tsl/framework/bfc_allocator.cc:1105] InUse at 73053b0e0000 of size 3072 next 2
2025-07-16 22:08:55.629417: I external/xla/xla/tsl/framework/bfc_allocator.cc:1105] Free  at 73053b0e0c00 of size 3639735296 next 18446744073709551615
2025-07-16 22:08:55.629436: I external/xla/xla/tsl/framework/bfc_allocator.cc:1110]      Summary of in-use Chunks by size:
2025-07-16 22:08:55.629445: I external/xla/xla/tsl/framework/bfc_allocator.cc:1113] 1 Chunks of size 3072 totalling 3.0KiB
2025-07-16 22:08:55.629454: I external/xla/xla/tsl/framework/bfc_allocator.cc:1113] 1 Chunks of size 30720000000 totalling 28.61GiB
2025-07-16 22:08:55.629467: I external/xla/xla/tsl/framework/bfc_allocator.cc:1117] Sum Total of in-use chunks: 28.61GiB
2025-07-16 22:08:55.629479: I external/xla/xla/tsl/framework/bfc_allocator.cc:1119] Total bytes in pool: 34359738368 memory_limit_: 42909460070 available bytes: 8549721702 curr_region_allocation_bytes_: 34359738368
2025-07-16 22:08:55.629491: I external/xla/xla/tsl/framework/bfc_allocator.cc:1124] Stats:
Limit:                     42909460070
InUse:                     30720003072
MaxInUse:                  30720003072
NumAllocs:                           2
MaxAllocSize:              30720000000
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2025-07-16 22:08:55.629517: W external/xla/xla/tsl/framework/bfc_allocator.cc:512] ******************************************************************************************__________
E0000 00:00:1752703735.629561 2610542 pjrt_stream_executor_client.cc:2939] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 30720000000 bytes. [tf-allocator-allocation-error='']
ERROR: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 30720000000 bytes. [tf-allocator-allocation-error='']

Stacktrace:
 [1] reactant_err(msg::Cstring)
   @ Reactant.XLA ~/.julia/packages/Reactant/bK8OR/src/xla/Utils.jl:12
 [2] macro expansion
   @ ~/.julia/packages/Reactant/bK8OR/src/xla/PJRT/LoadedExecutable.jl:195 [inlined]
 [3] execute_sharded
   @ ~/.julia/packages/Reactant/bK8OR/src/xla/PJRT/LoadedExecutable.jl:164 [inlined]
 [4] macro expansion
   @ ~/.julia/packages/Reactant/bK8OR/src/Compiler.jl:3197 [inlined]
 [5] (::Reactant.Compiler.Thunk{…})(::Reactant.ConcretePJRTArray{…}, ::Int64, ::Reactant.ConcretePJRTArray{…})
   @ Reactant.Compiler ~/.julia/packages/Reactant/bK8OR/src/Compiler.jl:3644
 [6] top-level scope
   @ REPL[10]:1
Some type information was truncated. Use `show(err)` to see complete types.

Then if I try to run it again:

julia> write_row_rx!(vectors_gpu, 1, vector_gpu)
ERROR: Reactant.ConcretePJRTArray{Float32, 2, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}} has already been donated!
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] getproperty(x::Reactant.ConcretePJRTArray{Float32, 2, 1, Reactant.Sharding.ShardInfo{Reactant.Sharding.NoSharding, Nothing}}, f::Symbol)
   @ Reactant ~/.julia/packages/Reactant/bK8OR/src/Types.jl:10
 [3] __resolve_device_and_client(client::Nothing, seen_args::Reactant.OrderedIdDict{Any, Any}, linear_args::Vector{Union{…}}, is_sharded::Bool)
   @ Reactant.Compiler ~/.julia/packages/Reactant/bK8OR/src/Compiler.jl:3343
 [4] compile_xla(f::Function, args::Tuple{…}; client::Nothing, serializable::Bool, kwargs::@Kwargs{…})
   @ Reactant.Compiler ~/.julia/packages/Reactant/bK8OR/src/Compiler.jl:3408
 [5] compile_xla
   @ ~/.julia/packages/Reactant/bK8OR/src/Compiler.jl:3377 [inlined]
 [6] compile(f::Function, args::Tuple{…}; kwargs::@Kwargs{…})
   @ Reactant.Compiler ~/.julia/packages/Reactant/bK8OR/src/Compiler.jl:3462
 [7] top-level scope
   @ ~/.julia/packages/Reactant/bK8OR/src/Compiler.jl:2557
Some type information was truncated. Use `show(err)` to see complete types.

Am I missing something? In CUDA.jl this kind of mutation is easy, but it doesn't optimize like Reactant.jl does (~4x slower for brute-force similarity search). I'm considering getting a pointer to the data and passing it to CUDA.jl, but that seems dangerous. Any suggestions?

Also curious how the concurrency/atomicity model works. What are the visibility guarantees of Reactant.jl operations? I'd want writes to be linearizably reflected by subsequent reads to the same location.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions