add scheduler#52
Conversation
2c1aca3 to
e63c858
Compare
|
Hello @nataliakokoromyti! I'll get to working on reviewing this soon I'm just having a couple days of low tempo with BarraCUDA now the hype train has moved on. I'll have a look at this tonight and thanks again for your submission! |
There was a problem hiding this comment.
Hey!! Thanks for this, the overall design is really solid. Stripping waits, DAG from register deps, critical-path scheduling, re-inserting waits at consumers. The architecture is exactly right and the epoch-based barrier handling is clean :-) A few things need fixing before we can merge though!
Correctness issues:
-
Missing WAR (Write-After-Read) on special registers. The DAG tracks RAW and WAW but not WAR. Vregs are SSA (single-def) so they're fine, but VCC/SCC get redefined. If isel emits
v_cmp(writes VCC) thenv_cndmask(reads VCC) then anotherv_cmp(writes VCC), there's no edge preventing the second cmp from being scheduled before the cndmask. It's like a librarian reshelving your book while you're still reading it. Need to track last_use alongside last_def and add use->def edges. -
Missing implicit SCC read on
s_cselect_b32. Isel emits it asemit2(dst, src0, src1)with no SCC operand, so the scheduler has no RAW edge from the SCC-writing cmp. Same issue for any instruction that implicitly reads SCC. -
Missing implicit SCC write on most SOP2 instructions.
s_and_b32,s_or_b32,s_lshl_b32, etc. all set SCC but onlys_add_u32/s_sub_u32and SOPC are tracked.
Safety violations (we're pretty strict about these!):
-
Unbounded loop. The main scheduling loop is
for (;;)with a break. All loops need bounded iteration counts -- something likefor (uint32_t guard = 0; guard < SCHED_MAX_BLOCK * 2; guard++). Runaway loops in a compiler are like runaway trains except the train is on fire and the tracks are your user's deadline. -
Missing bounds checks on
s_final,s_ready, ands_output. Thefnindex during wait reinsertion,nreadyin the scheduling loop, andout_posinamdgpu_sched()all increment without overflow guards. If they exceed the array size we get silent memory corruption. -
add_edgesilently drops edges atSCHED_MAX_DEPS(16). If a register has more than 16 consumers, dependency edges vanish and the scheduler could misorder those instructions. It's like air traffic control just forgetting about plane 17 because the whiteboard ran out of space. Should either bump the limit or pin the node as unschedulable when it overflows.
Housekeeping:
-
File naming. Other files in
src/amdgpu/use short names (isel.c,emit.c,encode.c). Would be great to rename tosched.c/sched.hto match. -
Merge conflict. The branch is based on pre-tensix-reorg so the Makefile has
src/tensix_isel.cetc. which moved tosrc/tensix/isel.c. Will need a rebase on master.
The core scheduling logic is great work though!! Really looking forward to getting this merged once the fixes are in :-) If you have any questions or need any help please feel free to reach out. Thanks again!
90b9af4 to
610595f
Compare
610595f to
75e8a99
Compare
The scheduler strips s_waitcnt after every load, builds a DAG from register deps (RAW/WAW), schedules by critical path priority, and re-inserts waits just before the first consumer of each load. So two independent loads issue back to back instead of load wait load wait. Runs on vregs between isel and regalloc, per-block only, barriers split into epochs. Should address #9.