[PATCH] D85517: [Scheduling] Implement a new way to cluster loads/stores

Mon Aug 10 20:45:12 PDT 2020

steven.zhang added inline comments.

================
Comment at: llvm/lib/CodeGen/MachineScheduler.cpp:1607
+    unsigned ClusterLength = 2;
+    unsigned CurrentClusterBytes = MemOpb.Width;
+    if (SUnit2ClusterInfo.count(MemOpa.SU->NodeNum)) {
----------------
foad wrote:
> Shouldn't this be `MemOpa.Width+MemOpb.Width`?
Good catch. I miss this and this is the main reason we see more clustered loads/stores from the tests.

================
Comment at: llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll:627-630
+; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], 0 offset:4{{$}}
 ; FIXEDABI: s_movk_i32 s32, 0x400{{$}}
 ; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140
 ; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
----------------
steven.zhang wrote:
> foad wrote:
> > This looks like a regression because the stores on lines 627 and 630 are no longer clustered. BUT see D85530: I don't think there is any reason for AMDGPU to try to cluster stores, so this may become a non-issue.
> This is the final seq with old implementation. In fact, there is a stack operation in-between these two stores which acts as a barrier for them, and that's why we didn't cluster them. So, this works as expected. But if there is any other concern on this, please let me know.
> ```
> SU(11):   BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
> SU(12):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg
> SU(14):   BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
> ```
> 
> Old implementation still cluster them, but appear to be good as we do nothing for the stack operation.
> ```
> Cluster ld/st SU(11) - SU(14)
>   Copy Pred SU(13)
>   Copy Pred SU(12)
>   Copy Pred SU(12)
>   Curr cluster length: 2, Curr cluster bytes: 8
> ```
This comment still apply. Old implementation cluster these three stores which is not right.

================
Comment at: llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll:672-674
 ; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], s33{{$}}
+; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140{{$}}
 ; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
----------------
steven.zhang wrote:
> foad wrote:
> > As above, these stores are no longer clustered.
> The same reason as above. Old implementation cluster them two stores.
> ```
> SU(10):   BUFFER_STORE_DWORD_OFFEN %10:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
> SU(11):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33
> SU(13):   BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
> ```
The same reason as above.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85517/new/

https://reviews.llvm.org/D85517