[PATCH] D143335: [AMDGPU] Use instruction order in machine function to process workList of moveToVALU

Tue Feb 21 04:15:00 PST 2023

skc7 added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/SIInstrInfo.cpp:6166
+      return false;
+    const MachineBasicBlock *BA = InstA->getParent();
+    const MachineBasicBlock *BB = InstB->getParent();
----------------
arsenm wrote:
> skc7 wrote:
> > arsenm wrote:
> > > skc7 wrote:
> > > > foad wrote:
> > > > > Do you really need to guarantee the order between different basic blocks? If so, surely you need some kind of topological ordering, not just based on BB number. (And what if there are cycles in the CFG?)
> > > > I'm finding it difficult to compare two instructions from different basic blocks other than using BB number. Can we use dominator tree information? May be DFS numbers be used to compare them?
> > > We used to do this in RPO. Not sure why this was changed.
> > > 
> > > Would it be easier to check all the operands of the one buffer instruction the first time you encounter it?
> > The first time a buffer instruction is processed in the worklist for legalizeOperands, both rsrc and soffset operands wouldn't have vgprs assigned by the pass. Only one of them would have vgpr. V2S copy result used by the other operand still need to be processed in the worklist.
> > 
> > So, if legalizeOperands has to legalize soffset and rsrc together, V2S copies that occur prior to buffer instruction have to be processed in the worklist first.
> Do they really need to be processed first? If you processed them at once, when those instructions are later processed they should see a copy with no uses which can be deleted
In the example below, si-fix-sgpr-copies pass calls moveToVALU for V2S copies in random order. So, BUFFER_LOAD instruction will have VGPRs for its soffset after processing  %8:sreg_32 = COPY %5:vgpr_32 in the worklist. Legalize Operands would be called on the BUFFER_LOAD instruction which would only add a waterfall loop for soffset.

In further iterations, when rsrc would be converted to VGPR, then again LegalizeOperands would be called which would then add waterfall loop for it. 

LegalizeOperands is called on the BUFFER_LOAD two times which is the cause of two waterfall loops in D141030.

So, the solution I think would work is to process the worklist instructions in order and each instruction will be called with legalizeOperands only once. This would make sure both rsrc and soffset are vgprs together once legalizeOperands is called. Then, D141030 logic can add single waterfall loop for them.

  define float @llvm_amdgcn_raw_buffer_load_f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset) {
    %val = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
    ret float %val
  }

  Input to si-fix-sgpr-copies pass:

  Function Live Ins: $vgpr0 in %0, $vgpr1 in %1, $vgpr2 in %2, $vgpr3 in %3, $vgpr4 in %4, $vgpr5 in %5
  bb.0 (%ir-block.0):
  liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
  %5:vgpr_32 = COPY $vgpr5
  %4:vgpr_32 = COPY $vgpr4
  %3:vgpr_32 = COPY $vgpr3
  %2:vgpr_32 = COPY $vgpr2
  %1:vgpr_32 = COPY $vgpr1
  %0:vgpr_32 = COPY $vgpr0
  %9:sgpr_32 = COPY %0:vgpr_32
  %10:sgpr_32 = COPY %1:vgpr_32
  %11:sgpr_32 = COPY %2:vgpr_32
  %12:sgpr_32 = COPY %3:vgpr_32
  %6:sgpr_128 = REG_SEQUENCE %9:sgpr_32, %subreg.sub0, %10:sgpr_32, %subreg.sub1, %11:sgpr_32, %subreg.sub2, %12:sgpr_32, %subreg.sub3
  %8:sreg_32 = COPY %5:vgpr_32
  %7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 7)
  $vgpr0 = COPY %7:vgpr_32
  SI_RETURN implicit $vgpr0

++++++++++++++++++++++++++++++++++++++++++++++++

   1st iteration
   Top of worklist -> %8:sreg_32 = COPY %5:vgpr_32

   %6:sgpr_128 = REG_SEQUENCE %9:sgpr_32, %subreg.sub0, %10:sgpr_32, %subreg.sub1, %11:sgpr_32, 
   %subreg.sub2, %12:sgpr_32, %subreg.sub3 
   %8:sreg_32 = COPY %5:vgpr_32 
   %7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, implicit $exec

++++++++++++++++++++++++++++++++++++++++++++++++
    2nd iteration 
    Top of worklist -> %11:sgpr_32 = COPY %2:vgpr_32

    %6:sgpr_128 = REG_SEQUENCE %9:sgpr_32, %subreg.sub0, %10:sgpr_32, %subreg.sub1, %11:sgpr_32, 
    %subreg.sub2, %12:sgpr_32, %subreg.sub3 
    %8:sreg_32 = IMPLICIT_DEF 
    %7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %5:vgpr_32, 0, 0, 0, implicit $exec

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143335/new/

https://reviews.llvm.org/D143335