GFX 942 Additions and test kernels by Zaneham · Pull Request #56 · Zaneham/BarraCUDA

Zaneham · 2026-03-05T09:44:37Z

Brings the GFX942 (CDNA3/MI300X) backend from "compiles hello world" to "compiles tinygrad kernels and passes 12 HW
test suites."

Backend (src/amdgpu/)

Dynamic SGPR layout: kernarg ptr at s[0:1], workgroup IDs at s2+, no dispatch_ptr. blockDim/gridDim read from hidden
kernarg (matches hipcc's ABI).
Wave64 divergence: s_and_saveexec_b64 with even-aligned physical SGPR pairs, s_xor_b64/s_or_b64 for else/merge
restore.
MFMA: all 22 matrix variants (f16/bf16/f32/i8/fp8/bf8/f64) mapped from BIR_MFMA subop to VOP3P_MAI encoding.
CDNA s_add_i32 zero-result errata: BIR_ADD, BIR_SUB, and scalar GEP promoted to VALU on is_cdna(). Two SMEM-sourced
operands cause s_add to return 0 on MI300X.
Shuffle operand fix (P0): correct BIR operand mapping ([0]=mask, [1]=val, [2]=delta), per-variant lane computation
(tid+delta for DOWN, tid-delta for UP, tid^delta for XOR) before ds_bpermute_b32.
Device call diagnostic: isel_call now errors instead of emitting garbage s_swappc_b64.
Tinygrad compat: _ockl* thread builtins, _ocml* math builtins (exp2, log2, sqrt, sin, cos, fabs, floor, ceil,
fmin, fmax), _Float16, half conversions.

BIR/frontend

BIR_MFMA opcode with subop variant selection
STYPE_BFLOAT16 in type system
Per-chip ELF target metadata
MFMA intrinsic lowering (_builtin_amdgcn_mfma*, 22 variants)

Tests (38 new files)

Test kernels were generated by QwenCoder and edited by hand. Apologies on the lack of wit.

8 HW-validated pairs (64/64 on MI300X): test_justN, test_noload, test_vadd, test_3arg_dispatch, test_loadonly,
test_11sgpr, test_branch, test_load

4 new validation pairs (0 decode failures, awaiting HW):

test_loop — for-loop accumulation (PHI, branch, VALU add)
test_lds — shared memory + barrier + reversed read
test_shfl — __shfl_down_sync via ds_bpermute_b32
test_mfma — v_mfma_f32_4x4x1f32 encoding smoke test

Tinygrad compat: tg_compat + tg_runner (thread model + math builtins)

Thanks to https://hotaisle.xyz for providing MI300X compute for hardware validation.

…d compat Major rework of the AMDGPU instruction selection for CDNA3/MI300X: - Dynamic SGPR layout: kernarg ptr at s[0:1], TGID at s2+, no dispatch_ptr. blockDim/gridDim read from hidden kernarg (same approach as hipcc). - Wave64 divergence: s_and_saveexec_b64 with even-aligned physical SGPR pairs, s_xor_b64/s_or_b64 for else/merge blocks. - MFMA: full isel for all 22 matrix variants (f16/bf16/f32/i8/fp8/f64). VOP3P_MAI encoding validated via llvm-objdump. - CDNA s_add_i32 errata: BIR_ADD, BIR_SUB, and scalar GEP all promoted to VALU on is_cdna() to avoid the zero-result bug with SMEM operands. - Shuffle fix: correct operand mapping (mask/val/delta/width), per-variant lane computation (tid+delta for DOWN, tid-delta for UP, tid^delta for XOR). - Device call diagnostic: isel_call now errors instead of emitting garbage. - Tinygrad compat: __ockl_* thread builtins, __ocml_* math builtins, _Float16 alias, blockDim/gridDim via hidden kernarg. - BIR extensions: BIR_MFMA opcode, STYPE_BFLOAT16, per-chip ELF metadata. - 8/8 HW tests passing on MI300X + 30 test kernels for validation.

Four new test pairs proving backend features work on GFX942: - test_loop: for-loop accumulation exercising BIR_BR/BR_COND/PHI. Expected: out[i] = n*(n-1)/2 = 2016 for n=64. - test_lds: shared memory write + barrier + reversed read. Expected: out[i] = (63-i)*2. - test_shfl: __shfl_down_sync via ds_bpermute_b32 with tid+delta. Expected: out[i] = in[i+1] for i<63. - test_mfma: v_mfma_f32_4x4x1f32 encoding validation. Expected: out[0] = a*b = 6.0. All compile with 0 decode failures. Awaiting MI300X HW test.

The ELF layout puts the kernel descriptor in .rodata and code in .text. The emu was reading .text byte 0 as KD, getting instruction bytes as scratch_size (~4GB), causing OOM on the CI runner. Now reads .rodata for KD, pads to 256 bytes, appends .text code. Falls back to old .text-only layout for backwards compat.

Three issues from the dynamic SGPR layout change: 1. USRDT ordering: kernel_code_properties now only sets bit 3 (KERNARG_PTR), not bit 1 (DISPATCH_PTR). Build USRDT from KPROP bits instead of hardcoding [dispatch, kernarg]. 2. Hidden kernargs: blockDim/gridDim now come from hidden kernarg fields (block_count + group_size) at explicit arg offset + 32. Populate these in run_vadd. 3. Memory overlap: DSPKT at KAOFF+32 clobbered the hidden kernargs (group_size_x got overwritten with NELMS=256). Move DPOFF to KAOFF+64 so they don't overlap.

Zaneham added 4 commits March 5, 2026 21:33

Zaneham merged commit 8e61dc7 into master Mar 5, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GFX 942 Additions and test kernels#56

GFX 942 Additions and test kernels#56
Zaneham merged 4 commits intomasterfrom
fixesAMD

Zaneham commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zaneham commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant