Skip to content

GFX 942 Additions and test kernels#56

Merged
Zaneham merged 4 commits intomasterfrom
fixesAMD
Mar 5, 2026
Merged

GFX 942 Additions and test kernels#56
Zaneham merged 4 commits intomasterfrom
fixesAMD

Conversation

@Zaneham
Copy link
Copy Markdown
Owner

@Zaneham Zaneham commented Mar 5, 2026

Brings the GFX942 (CDNA3/MI300X) backend from "compiles hello world" to "compiles tinygrad kernels and passes 12 HW
test suites."

Backend (src/amdgpu/)

  • Dynamic SGPR layout: kernarg ptr at s[0:1], workgroup IDs at s2+, no dispatch_ptr. blockDim/gridDim read from hidden
    kernarg (matches hipcc's ABI).
  • Wave64 divergence: s_and_saveexec_b64 with even-aligned physical SGPR pairs, s_xor_b64/s_or_b64 for else/merge
    restore.
  • MFMA: all 22 matrix variants (f16/bf16/f32/i8/fp8/bf8/f64) mapped from BIR_MFMA subop to VOP3P_MAI encoding.
  • CDNA s_add_i32 zero-result errata: BIR_ADD, BIR_SUB, and scalar GEP promoted to VALU on is_cdna(). Two SMEM-sourced
    operands cause s_add to return 0 on MI300X.
  • Shuffle operand fix (P0): correct BIR operand mapping ([0]=mask, [1]=val, [2]=delta), per-variant lane computation
    (tid+delta for DOWN, tid-delta for UP, tid^delta for XOR) before ds_bpermute_b32.
  • Device call diagnostic: isel_call now errors instead of emitting garbage s_swappc_b64.
  • Tinygrad compat: _ockl* thread builtins, _ocml* math builtins (exp2, log2, sqrt, sin, cos, fabs, floor, ceil,
    fmin, fmax), _Float16, half conversions.

BIR/frontend

  • BIR_MFMA opcode with subop variant selection
  • STYPE_BFLOAT16 in type system
  • Per-chip ELF target metadata
  • MFMA intrinsic lowering (_builtin_amdgcn_mfma*, 22 variants)

Tests (38 new files)

Test kernels were generated by QwenCoder and edited by hand. Apologies on the lack of wit.

8 HW-validated pairs (64/64 on MI300X): test_justN, test_noload, test_vadd, test_3arg_dispatch, test_loadonly,
test_11sgpr, test_branch, test_load

4 new validation pairs (0 decode failures, awaiting HW):

  • test_loop — for-loop accumulation (PHI, branch, VALU add)
  • test_lds — shared memory + barrier + reversed read
  • test_shfl — __shfl_down_sync via ds_bpermute_b32
  • test_mfma — v_mfma_f32_4x4x1f32 encoding smoke test

Tinygrad compat: tg_compat + tg_runner (thread model + math builtins)

Thanks to https://hotaisle.xyz for providing MI300X compute for hardware validation.

Zaneham added 4 commits March 5, 2026 21:33
…d compat

Major rework of the AMDGPU instruction selection for CDNA3/MI300X:

- Dynamic SGPR layout: kernarg ptr at s[0:1], TGID at s2+, no dispatch_ptr.
  blockDim/gridDim read from hidden kernarg (same approach as hipcc).
- Wave64 divergence: s_and_saveexec_b64 with even-aligned physical SGPR
  pairs, s_xor_b64/s_or_b64 for else/merge blocks.
- MFMA: full isel for all 22 matrix variants (f16/bf16/f32/i8/fp8/f64).
  VOP3P_MAI encoding validated via llvm-objdump.
- CDNA s_add_i32 errata: BIR_ADD, BIR_SUB, and scalar GEP all promoted
  to VALU on is_cdna() to avoid the zero-result bug with SMEM operands.
- Shuffle fix: correct operand mapping (mask/val/delta/width), per-variant
  lane computation (tid+delta for DOWN, tid-delta for UP, tid^delta for XOR).
- Device call diagnostic: isel_call now errors instead of emitting garbage.
- Tinygrad compat: __ockl_* thread builtins, __ocml_* math builtins,
  _Float16 alias, blockDim/gridDim via hidden kernarg.
- BIR extensions: BIR_MFMA opcode, STYPE_BFLOAT16, per-chip ELF metadata.
- 8/8 HW tests passing on MI300X + 30 test kernels for validation.
Four new test pairs proving backend features work on GFX942:

- test_loop: for-loop accumulation exercising BIR_BR/BR_COND/PHI.
  Expected: out[i] = n*(n-1)/2 = 2016 for n=64.
- test_lds: shared memory write + barrier + reversed read.
  Expected: out[i] = (63-i)*2.
- test_shfl: __shfl_down_sync via ds_bpermute_b32 with tid+delta.
  Expected: out[i] = in[i+1] for i<63.
- test_mfma: v_mfma_f32_4x4x1f32 encoding validation.
  Expected: out[0] = a*b = 6.0.

All compile with 0 decode failures. Awaiting MI300X HW test.
The ELF layout puts the kernel descriptor in .rodata and code in
.text.  The emu was reading .text byte 0 as KD, getting instruction
bytes as scratch_size (~4GB), causing OOM on the CI runner.

Now reads .rodata for KD, pads to 256 bytes, appends .text code.
Falls back to old .text-only layout for backwards compat.
Three issues from the dynamic SGPR layout change:

1. USRDT ordering: kernel_code_properties now only sets bit 3
   (KERNARG_PTR), not bit 1 (DISPATCH_PTR). Build USRDT from
   KPROP bits instead of hardcoding [dispatch, kernarg].

2. Hidden kernargs: blockDim/gridDim now come from hidden
   kernarg fields (block_count + group_size) at explicit arg
   offset + 32. Populate these in run_vadd.

3. Memory overlap: DSPKT at KAOFF+32 clobbered the hidden
   kernargs (group_size_x got overwritten with NELMS=256).
   Move DPOFF to KAOFF+64 so they don't overlap.
@Zaneham Zaneham merged commit 8e61dc7 into master Mar 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant