[PATCH] D85517: [Scheduling] Implement a new way to cluster loads/stores
Qing Shan Zhang via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 10 20:45:12 PDT 2020
steven.zhang added inline comments.
================
Comment at: llvm/lib/CodeGen/MachineScheduler.cpp:1607
+ unsigned ClusterLength = 2;
+ unsigned CurrentClusterBytes = MemOpb.Width;
+ if (SUnit2ClusterInfo.count(MemOpa.SU->NodeNum)) {
----------------
foad wrote:
> Shouldn't this be `MemOpa.Width+MemOpb.Width`?
Good catch. I miss this and this is the main reason we see more clustered loads/stores from the tests.
================
Comment at: llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll:627-630
+; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], 0 offset:4{{$}}
; FIXEDABI: s_movk_i32 s32, 0x400{{$}}
; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140
; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
----------------
steven.zhang wrote:
> foad wrote:
> > This looks like a regression because the stores on lines 627 and 630 are no longer clustered. BUT see D85530: I don't think there is any reason for AMDGPU to try to cluster stores, so this may become a non-issue.
> This is the final seq with old implementation. In fact, there is a stack operation in-between these two stores which acts as a barrier for them, and that's why we didn't cluster them. So, this works as expected. But if there is any other concern on this, please let me know.
> ```
> SU(11): BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
> SU(12): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg
> SU(14): BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
> ```
>
> Old implementation still cluster them, but appear to be good as we do nothing for the stack operation.
> ```
> Cluster ld/st SU(11) - SU(14)
> Copy Pred SU(13)
> Copy Pred SU(12)
> Copy Pred SU(12)
> Curr cluster length: 2, Curr cluster bytes: 8
> ```
This comment still apply. Old implementation cluster these three stores which is not right.
================
Comment at: llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll:672-674
; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], s33{{$}}
+; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140{{$}}
; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
----------------
steven.zhang wrote:
> foad wrote:
> > As above, these stores are no longer clustered.
> The same reason as above. Old implementation cluster them two stores.
> ```
> SU(10): BUFFER_STORE_DWORD_OFFEN %10:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
> SU(11): ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33
> SU(13): BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
> ```
The same reason as above.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D85517/new/
https://reviews.llvm.org/D85517
More information about the llvm-commits
mailing list