Conversation
…d compat Major rework of the AMDGPU instruction selection for CDNA3/MI300X: - Dynamic SGPR layout: kernarg ptr at s[0:1], TGID at s2+, no dispatch_ptr. blockDim/gridDim read from hidden kernarg (same approach as hipcc). - Wave64 divergence: s_and_saveexec_b64 with even-aligned physical SGPR pairs, s_xor_b64/s_or_b64 for else/merge blocks. - MFMA: full isel for all 22 matrix variants (f16/bf16/f32/i8/fp8/f64). VOP3P_MAI encoding validated via llvm-objdump. - CDNA s_add_i32 errata: BIR_ADD, BIR_SUB, and scalar GEP all promoted to VALU on is_cdna() to avoid the zero-result bug with SMEM operands. - Shuffle fix: correct operand mapping (mask/val/delta/width), per-variant lane computation (tid+delta for DOWN, tid-delta for UP, tid^delta for XOR). - Device call diagnostic: isel_call now errors instead of emitting garbage. - Tinygrad compat: __ockl_* thread builtins, __ocml_* math builtins, _Float16 alias, blockDim/gridDim via hidden kernarg. - BIR extensions: BIR_MFMA opcode, STYPE_BFLOAT16, per-chip ELF metadata. - 8/8 HW tests passing on MI300X + 30 test kernels for validation.
Four new test pairs proving backend features work on GFX942: - test_loop: for-loop accumulation exercising BIR_BR/BR_COND/PHI. Expected: out[i] = n*(n-1)/2 = 2016 for n=64. - test_lds: shared memory write + barrier + reversed read. Expected: out[i] = (63-i)*2. - test_shfl: __shfl_down_sync via ds_bpermute_b32 with tid+delta. Expected: out[i] = in[i+1] for i<63. - test_mfma: v_mfma_f32_4x4x1f32 encoding validation. Expected: out[0] = a*b = 6.0. All compile with 0 decode failures. Awaiting MI300X HW test.
The ELF layout puts the kernel descriptor in .rodata and code in .text. The emu was reading .text byte 0 as KD, getting instruction bytes as scratch_size (~4GB), causing OOM on the CI runner. Now reads .rodata for KD, pads to 256 bytes, appends .text code. Falls back to old .text-only layout for backwards compat.
Three issues from the dynamic SGPR layout change: 1. USRDT ordering: kernel_code_properties now only sets bit 3 (KERNARG_PTR), not bit 1 (DISPATCH_PTR). Build USRDT from KPROP bits instead of hardcoding [dispatch, kernarg]. 2. Hidden kernargs: blockDim/gridDim now come from hidden kernarg fields (block_count + group_size) at explicit arg offset + 32. Populate these in run_vadd. 3. Memory overlap: DSPKT at KAOFF+32 clobbered the hidden kernargs (group_size_x got overwritten with NELMS=256). Move DPOFF to KAOFF+64 so they don't overlap.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings the GFX942 (CDNA3/MI300X) backend from "compiles hello world" to "compiles tinygrad kernels and passes 12 HW
test suites."
Backend (src/amdgpu/)
kernarg (matches hipcc's ABI).
restore.
operands cause s_add to return 0 on MI300X.
(tid+delta for DOWN, tid-delta for UP, tid^delta for XOR) before ds_bpermute_b32.
fmin, fmax), _Float16, half conversions.
BIR/frontend
Tests (38 new files)
Test kernels were generated by QwenCoder and edited by hand. Apologies on the lack of wit.
8 HW-validated pairs (64/64 on MI300X): test_justN, test_noload, test_vadd, test_3arg_dispatch, test_loadonly,
test_11sgpr, test_branch, test_load
4 new validation pairs (0 decode failures, awaiting HW):
Tinygrad compat: tg_compat + tg_runner (thread model + math builtins)
Thanks to https://hotaisle.xyz for providing MI300X compute for hardware validation.