[PATCH] D85517: [Scheduling] Implement a new way to cluster loads/stores

Fri Aug 7 04:57:30 PDT 2020

steven.zhang created this revision.
steven.zhang added reviewers: fhahn, evandro, arsenm, rampitec, foad, kbarton, jsji, PowerPC.
Herald added subscribers: kerbowa, wuzish, javed.absar, hiraditya, tpr, nhaehnle, jvesely, MatzeB.
Herald added a project: LLVM.
steven.zhang requested review of this revision.
Herald added a subscriber: wdng.

Before calling target hook to determine if two loads/stores are clusterable, we put them into different groups to avoid fake cluster due to dependency. For now, we are putting the loads/stores into the same group if they have the same predecessor. We assume that, if two loads/stores have the same predecessor, it is likely that, they didn't have dependency for each other.

However, one SUnit might have several predecessors and for now, we just pick up the first predecessor that has non-data/non-artificial dependency, which is too arbitrary. And we are struggling to fix it. D84139 <https://reviews.llvm.org/D84139> D74524 <https://reviews.llvm.org/D74524> D71717 <https://reviews.llvm.org/D71717>

This algorithm will become complete broken if the mutation is used post RA, because we are adding more dependency(anti, output). See this example:

             +----------------+
             |   y = add      +<----+
             +----------------+     |
                                    |
             +----------------+     |
        +---->   x = add      |     |
        |    +----------------+     |output
  output|                           |
        |    +----------------+     |
        +----+  x = load A    |     |
             +----------------+     |
                                    |
                                    |
             +----------------+     |
             |  y = load A,4  +-----+
             +----------------+

We won't cluster these two loads as they have output dependency with different instruction. So that, we will put them into different groups. Luckily, no one use the load/store cluster post-ra till now. But we need it.

Even for pre-ra mutation, there is still problems. i.e.

        +----------+
     +--+   Load A |                   +----------+
     |  +-----^----+                   |  Instr   |
     |        |                        +--+--^----+
     |        |         Dep               |  |
     |        +---------------------------+  |
  Mem|                                       |
     |                  +----------+  Data   |
     |                  |  Load B  +---------+
     |                  +----+-----+
     |          Mem          |
     |     +-----------------+
     |     |
     |     |
  +--v-----v--+
  |   Store   |
  +-----------+

Load A and Load B both have mem dependency on Store, for now, they will be put into the same group no matter if the data Load B dependent on has dependent with Load A.  If yes, we cannot cluster them as there is a barrier instruction in-between them.

So, I am proposing some better implementation. (there is no perfect grouping algorithm as far as I know as it is NP complete problem).

- Collect all the loads/stores that has memory info first to reduce the complexity.
- Sort these loads/stores so that we can stop the seeking as early as possible.
- For each load/store, seeking for the first non-dependency instruction with the sorted order, and check if they can cluster or not.

Our internal tests show great improvement with this patch on the load/store clustering. Any thoughts ?

For the test change, I indeed see some improvement from stats data, but need confirm from AMDGPU export. Notice that,  old implementation will add fake cluster edges which hurt the scheduler. i.e.

CodeGen/AMDGPU/callee-special-input-vgprs.ll

  1. Cluster ld/st SU(11) - SU(14)
  2.   Copy Pred SU(13)
  3.   Copy Pred SU(12)
  4.   Copy Pred SU(12)
  5.   Curr cluster length: 2, Curr cluster bytes: 8
  6. Cluster ld/st SU(14) - SU(16)
  7.   Copy Pred SU(15)
  8.   Copy Pred SU(12) 

And this is the final scheduler output with old implementation. (11-14-16 are not scheduled together due to the dependency). And there is many likewise case.

  8. SU(11):   BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
  9. SU(12):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg
  10. SU(14):   BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
  11. SU(15):   %15:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from %stack.0.alloca, addrspace 5)
  12. SU(17):   %16:sreg_64 = SI_PC_ADD_REL_OFFSET target-flags(amdgpu-gotprel32-lo) @too_many_args_use_workitem_id_x_byval + 4, target-flags(amdgpu-gotprel32-hi) @too_many_args_use_workitem_id_x_byval + 4, implicit-def dead $scc
  13. SU(18):   %17:sreg_64_xexec = S_LOAD_DWORDX2_IMM %16:sreg_64, 0, 0, 0 :: (dereferenceable invariant load 8 from got, addrspace 4)
  14. SU(19):   %20:vgpr_32 = V_LSHLREV_B32_e32 20, %2:vgpr_32(s32), implicit $exec
  15. SU(20):   %22:vgpr_32 = V_LSHLREV_B32_e32 10, %1:vgpr_32(s32), implicit $exec
  16. SU(21):   %23:vgpr_32 = V_OR_B32_e32 %0:vgpr_32(s32), %22:vgpr_32, implicit $exec
  17. SU(22):   %24:vgpr_32 = V_OR_B32_e32 %23:vgpr_32, %20:vgpr_32, implicit $exec
  18. SU(16):   BUFFER_STORE_DWORD_OFFSET %15:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack + 4, addrspace 5)

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D85517

Files:
  llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
  llvm/lib/CodeGen/MachineScheduler.cpp
  llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll
  llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
  llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
  llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll
  llvm/test/CodeGen/AMDGPU/global-saddr.ll
  llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll
  llvm/test/CodeGen/AMDGPU/sgpr-control-flow.ll
  llvm/test/CodeGen/AMDGPU/shift-i128.ll
  llvm/test/CodeGen/AMDGPU/stack-realign.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D85517.283866.patch
Type: text/x-patch
Size: 28000 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200807/4307ec07/attachment.bin>