Skip to content

[WIP] Enhance 3D register allocation strategy #442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: aie-public
Choose a base branch
from

Conversation

krishnamtibrewala
Copy link
Collaborator

@krishnamtibrewala krishnamtibrewala commented Apr 22, 2025

I have been investigating Conv2D_bfp16_0 and why it performs so poorly. The issue is that we spill 3D addressing registers frequently.

Upon further investigation of how we have implemented its staged register allocation and how the greedy register allocator works with registers (including sub-registers), I believe the problem lies in how LLVM treats and assumes the impact of modifications in a sub-register on the super-register. This assumption does not hold for AIE 2D/3D registers.
When we split a 2D/3D register, say "Z" (which was used by PADD_3D), we create two new virtual registers, "X" and "Y," which are treated as the same register class as "Z." Instead of "Z," "Y" is used with PADD_3D. The key point is that "X," even though it is considered a "3D" register (i.e., $d1_3d), does not need to follow the constraint of assigning sub-registers to (m1 dn1 dj1 dc1 dn5 dj5 dc5). But we end up forcing this constrain on reg-alloc.

We started by splitting the live ranges of one 3D register "Z" and ended with two 3D registers in the same "3D" register class. This has a snowball effect, as we continually split to achieve the "3D" register constraints for every new virtual register greedy-alloc ends up creating. In Conv2D_bfp16_0, we start with five 3D virtual registers and end up creating 32. Furthermore, due to the handling mechanism of sub-registers, we cannot use getLargestLegalSuperClass(...) to avoid spilling and use other scalar registers. (Note: If spilling has already occurred—common in our staged allocation when allocating 3D/2D registers—the aie-superreg-rewrite pass cannot undo its effects. It only helps in the next pass of the greedy register allocator.)

Ideally, what we could do is:

  1. If we split "Z" into "X" and "Y" virtual registers, focus only on allocating "Y" (used by PADD_3D after splitting) during the first pass of staged register allocation where we only allocate 3D registers.
  2. Retype the unassigned virtual register "X," which is a "3D" register only used in COPY, into the individually legal largest superclass.
  3. In the final pass of our staged register allocation, we will automatically have the flexibility to move these registers to other scalar registers. If necessary, they can be further split into smaller live ranges and moved to other 20-bit registers.

The potential benefits of this approach include reduced spilling and increasing the greedy allocator's flexibility to manage 2D/3D sub-registers.

To implement this with minimal modifications to the target-independent code:

  1. To focus on only the virtual registers used by 2D/3D instructions, we could extend RegAllocBase::enqueue(const LiveInterval *LI). After splitting and creating new virtual registers, RegAllocBase::enqueue(...) is called on them. Currently, we use ShouldAllocateClass(*TRI, RC), which allows us to focus only on specific reg-class like "onlyAllocate3DRegisters." We could establish a similar mechanism to check if the live range "LI" is used by a 2D/3D instruction and focus only on them.
  2. To retype unassigned 3D registers that were not directly used by a 2D/3D instruction, we could extend our aie-superreg-rewrite. This pass currently works only on virtual registers with an assigned physical register. We would need to extend this to include virtual registers that are 2D/3D, not assigned, and not used by any 2D/3D instructions (this should be fairly easy).
  3. There is no need for special handling to allocate unassigned sub-reg of 2D/3D registers (now as full registers) since the final pass of the GreedyRegisterAllocator will handle this.

@krishnamtibrewala
Copy link
Collaborator Author

krishnamtibrewala commented Apr 23, 2025

TODO : Verify if there is any negative impact of following change

Implement TRI.getSubClassWithSubReg to avoid generation of

15704B	  undef %1266.sub_lo_dim:eds = COPY %1267.sub_lo_dim:eds {
	    internal %1266.sub_hi_dim_then_sub_dim_count:eds = COPY %1267.sub_hi_dim_then_sub_dim_count:eds
	    internal %1266.sub_hi_dim_then_sub_dim_size:eds = COPY %1267.sub_hi_dim_then_sub_dim_size:eds
	    internal %1266.sub_hi_dim_then_sub_dim_stride:eds = COPY %1267.sub_hi_dim_then_sub_dim_stride:eds

Rather force it to generate 7 copies. Basically for all the individual sub-regs.
undef %1266.sub_lo_dim:eds = COPY %1267.sub_lo_dim:eds Should get split into 4 individual reg copy

This will help super-reg-rewrite to split them into new virtual reg with there own live-ranges

@krishnamtibrewala
Copy link
Collaborator Author

TODO :

  1. Extend the work to 2D reg
  2. Understand the scope of extending this 1024 bit FIFO register. CHECK : if its even possible.

@@ -66,6 +67,173 @@ void AIE2PPassConfig::addPreRegBankSelect() {
}
}

static bool onlyAllocateLIwith3DInstruction(MachineRegisterInfo &MRI,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: onlyAllocateLIWith3DInstruction

Copy link
Collaborator

@andcarminati andcarminati Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have access to TII, so we could move this switch to that class in a AIE-specific hook. We could also use a nice name, something like isHighPriorityLIUseInstruction (not sure about this name).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea of exposing TII was to move switch under a AIE-specific hook.
Chose to delay that until I see some result.

; AIE2P-VREGS-NEXT: [[COPY3:%[0-9]+]]:em = COPY [[LDA_dms_lda_idx_imm]]
; AIE2P-VREGS-NEXT: [[COPY4:%[0-9]+]]:edj = COPY [[LDA_dms_lda_idx_imm2]]
; AIE2P-VREGS-NEXT: [[COPY5:%[0-9]+]]:edc = COPY [[LDA_dms_lda_idx_imm3]]
; AIE2P-VREGS-NEXT: [[COPY2:%[0-9]+]]:edn_as_32bit = COPY [[LDA_dms_lda_idx_imm1]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we mixing AIE2 and AIE2P here? I mean, there is nothing related to your implementation, but I see no edn_as_32bit for AIE2P. I see spill_eDN_to_eR instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe this test is failing.....

DebugVars);
}

// Re-write all the collected unassigned VRegs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include a description in the commit message!


// Re-write all the collected unassigned VRegs
for (auto &VReg : UnAssignedPhysRegs) {
MCRegister DummyPhysReg;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A brief explanation of this part will be nice!

@andcarminati
Copy link
Collaborator

Hi @krishnamtibrewala, I see a good potential in your approach! Did you manage to get some QoR numbers?

@krishnamtibrewala
Copy link
Collaborator Author

Hi @krishnamtibrewala, I see a good potential in your approach! Did you manage to get some QoR numbers?

Thank you @andcarminati for the feedback, I will incorporate all of them.
Currently focusing on fixing a small issue(an annoying one) that get exposed due to this strategy.

Following are the result for Conv2D_bfp16*
image

@krishnamtibrewala krishnamtibrewala force-pushed the aie-reg-alloc branch 4 times, most recently from 03a5a5b to d944642 Compare May 1, 2025 20:35
; AIE2P-VREGS-NEXT: [[COPY4:%[0-9]+]]:edc_as_32bit = COPY [[MOV_PD_imm11_pseudo]]
; AIE2P-VREGS-NEXT: [[COPY5:%[0-9]+]]:edn_as_32bit = COPY [[COPY2]]
; AIE2P-VREGS-NEXT: [[COPY6:%[0-9]+]]:edj_as_32bit = COPY [[COPY3]]
; AIE2P-VREGS-NEXT: [[COPY7:%[0-9]+]]:em_as_32bit = COPY [[COPY1]]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO : Try to understand why COPY from physical reg leads to spill_* and COPY from some [[COPY*]] lead to *_32bit RC

…f copies

With the new strategy liverange splitting end up creating bundle copies where in some of sub-reg are no longer in use at all and when we split them in Super-Reg-Rewrite we end up creating live range that start of index and ends as dead on same slot index. But there is another reg on the same slot-index (since we have a MOV bundle) which actually have a valid live range.
The new strategy exposes a fundamental problem on how bundled instruction in case of sub-reg are created by live range splitting logic(Refer : SplitEditor::buildSingleSubRegCopy) . From standard llvm perspective it is not a problem but when it comes to AIE and what we do in Super-Reg-Rewrite pass. We make them a complete register(which we want/need to do) but now there are COPY instr where in we end a live range on the Bundle and create a new live range by a different COPY instruction in the same bundle which are using the same reg class for src & dst. The major issue comes when reg-alloc end up assigning same register to such COPY in the same bundle, AFAIK this happens because the bundle is assign one unique stack slot. By expanding the CopyBundle we provide the COPY MI a unique slot and the associate operands a proper LiveInterval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants