[PATCH] D85517: [Scheduling] Implement a new way to cluster loads/stores

Sun Aug 9 22:19:54 PDT 2020

steven.zhang added a comment.

Thank you for the review. I have add all the necessary comments in the test change to explain what happens inside. All works as expected from my view.

================
Comment at: llvm/lib/CodeGen/MachineScheduler.cpp:1620-1621
     SUnit *SUb = MemOpb.SU;
     if (SUa->NodeNum > SUb->NodeNum)
       std::swap(SUa, SUb);

----------------
foad wrote:
> This is supposed to prefer to keep loads/stores in their original code order. From the AMDGPU test case diffs (e.g. test/CodeGen/AMDGPU/global-saddr.ll) it looks like a lot of the clusters have been reordered. Do you have any idea why?
Good point.
Old implementation only cluster two loads(SU(8) and SU(9)). SU(12) isn't clustered with them. So, it looks like the right order. 
```
SU(8):   %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1)
SU(9):   %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1)
SU(12):   %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1)
```
If we are clustering more than 2 SUs, the swap logic isn't right here I think, as it cannot make them sorted. And that is the reason why we see it as this in the new implementation(we are clustering 3 SU's now):
```
SU(12):   %44:vreg_64 = GLOBAL_LOAD_DWORDX2_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 32, 0, 0, 0, implicit $exec, implicit $exec :: (load 8 from %ir.ptr3, addrspace 1)
SU(8):   %16:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 16, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 48, align 8, addrspace 1)
SU(9):   %21:vreg_128 = GLOBAL_LOAD_DWORDX4_SADDR %108:vreg_64, %4.sub2_sub3:sgpr_128, 0, 0, 0, 0, implicit $exec, implicit $exec :: (load 16 from %ir.4 + 32, align 8, addrspace 1)

Cluster ld/st SU(8) - SU(9)
  Copy Succ SU(16)
  Copy Succ SU(15)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(29)
  Curr cluster length: 2, Curr cluster bytes: 16
Cluster ld/st SU(8) - SU(12)
  Copy Succ SU(16)
  Copy Succ SU(15)
  Copy Succ SU(14)
  Copy Succ SU(13)
  Copy Succ SU(29)
  Copy Succ SU(9)
  Curr cluster length: 3, Curr cluster bytes: 24
```

SU(8) and SU(9) are clustered first, and then, we are trying to cluster SU(8) and SU(12). It is unable to make SU(12) as the succ of SU(8) as we have clustered the SU(8) and SU(9). So, the only available sequence would be:
```
SU(12) SU(8) SU(9) or
SU(9) SU(8) SU(12) (not swap them, this is the offset order)
```

================
Comment at: llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll:531
+; GFX7-NEXT:    s_load_dwordx8 s[4:11], s[0:1], 0x9
+; GFX7-NEXT:    s_load_dword s0, s[0:1], 0x11
 ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
----------------
s_load_dwordx8 and s_load_dword are not cluster pair from the log. And the new order is align with node order.
old:
```
SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 17, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 9, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
```
New:
```
SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 9, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 17, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
```

================
Comment at: llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll:551
+; GFX8-NEXT:    s_load_dwordx8 s[4:11], s[0:1], 0x24
+; GFX8-NEXT:    s_load_dword s0, s[0:1], 0x44
 ; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
----------------
Old implementation didn't cluster these two loads though they are scheduled together with lucky.
```
SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 68, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 36, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
```
New implementation take them as cluster pair and schedule them with node order.
```
Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(10)
  Copy Succ SU(5)
  Copy Succ SU(4)
  Copy Succ SU(3)
  Curr cluster length: 2, Curr cluster bytes: 4

*** Final schedule for %bb.0 ***
SU(0):   %5:sreg_64 = COPY $sgpr0_sgpr1
SU(1):   %12:sgpr_256 = S_LOAD_DWORDX8_IMM %5:sreg_64, 36, 0, 0 :: (dereferenceable invariant load 32 from %ir.1, align 4, addrspace 4)
SU(2):   %30:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %5:sreg_64, 68, 0, 0 :: (dereferenceable invariant load 4 from %ir.d.kernarg.offset.align.down.cast, addrspace 4)
SU(3):   %33:vreg_64 = COPY %12.sub2_sub3:sgpr_256
```

================
Comment at: llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll:627-630
+; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], 0 offset:4{{$}}
 ; FIXEDABI: s_movk_i32 s32, 0x400{{$}}
 ; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140
 ; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
----------------
foad wrote:
> This looks like a regression because the stores on lines 627 and 630 are no longer clustered. BUT see D85530: I don't think there is any reason for AMDGPU to try to cluster stores, so this may become a non-issue.
This is the final seq with old implementation. In fact, there is a stack operation in-between these two stores which acts as a barrier for them, and that's why we didn't cluster them. So, this works as expected. But if there is any other concern on this, please let me know.
```
SU(11):   BUFFER_STORE_DWORD_OFFEN %13:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
SU(12):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg, implicit-def $sgpr32, implicit $sgpr32, implicit $fp_reg
SU(14):   BUFFER_STORE_DWORD_OFFSET %14:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
```

Old implementation still cluster them, but appear to be good as we do nothing for the stack operation.
```
Cluster ld/st SU(11) - SU(14)
  Copy Pred SU(13)
  Copy Pred SU(12)
  Copy Pred SU(12)
  Curr cluster length: 2, Curr cluster bytes: 8
```

================
Comment at: llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll:672-674
 ; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], s33{{$}}
+; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140{{$}}
 ; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
----------------
foad wrote:
> As above, these stores are no longer clustered.
The same reason as above. Old implementation cluster them two stores.
```
SU(10):   BUFFER_STORE_DWORD_OFFEN %10:vgpr_32, %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.alloca, addrspace 5)
SU(11):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33
SU(13):   BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
```

================
Comment at: llvm/test/CodeGen/AMDGPU/global-saddr.ll:50
 ; GFX9: global_load_dwordx2 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+}}:{{[0-9]+}}] offset:4088{{$}}
+; GFX9: global_load_dwordx4 v{{\[[0-9]+:[0-9]+\]}}, v{{\[[0-9]+:[0-9]+\]}}, s[{{[0-9]+}}:{{[0-9]+}}] offset:2040{{$}}
   %gep11 = getelementptr inbounds i64, i64 addrspace(1)* %gep, i64 511
----------------
See the reason I explain above. This expose an issue our swap logic with more than 2 SU's clustered.

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll:241
 ; SI-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x9
+; SI-NEXT:    s_load_dwordx8 s[8:15], s[0:1], 0x11
 ; SI-NEXT:    s_mov_b32 s6, -1
----------------
Old implementation SU(1) and SU(2) are not cluster.
```
SU(1):   %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
  # preds left       : 1
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 5
  Depth              : 1
  Height             : 5
  Predecessors:
    SU(0): Data Latency=1 Reg=%1
  Successors:
    SU(104): Data Latency=5 Reg=%10
    SU(103): Data Latency=5 Reg=%10
  Pressure Diff      : SReg_32 -2
  Single Issue       : false;
```
New implementation cluster them and sort it with node order.
```
SU(1):   %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
  # preds left       : 1
  # succs left       : 2
  # weak succs left  : 1
  # rdefs left       : 0
  Latency            : 5
  Depth              : 1
  Height             : 17
  Predecessors:
    SU(0): Data Latency=1 Reg=%1
  Successors:
    SU(104): Data Latency=5 Reg=%10
    SU(103): Data Latency=5 Reg=%10
    SU(2): Ord  Latency=0 Cluster
  Pressure Diff      : SReg_32 -2
  Single Issue       : false;
```
```
*** Final schedule for %bb.0 ***
SU(0):   %1:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(3):   undef %10.sub3:sgpr_128 = S_MOV_B32 61440
SU(4):   %10.sub2:sgpr_128 = S_MOV_B32 -1
SU(6):   %13:sreg_32 = S_MOV_B32 -1023
SU(1):   %10.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %1:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
SU(2):   %5:sgpr_256 = S_LOAD_DWORDX8_IMM %1:sgpr_64(p4), 17, 0, 0 :: (dereferenceable invariant load 32 from %ir.in.kernarg.offset.cast, align 4, addrspace 4)
```

================
Comment at: llvm/test/CodeGen/AMDGPU/sgpr-control-flow.ll:14
 ; SI:       ; %bb.0: ; %entry
-; SI-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x9
-; SI-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0xb
-; SI-NEXT:    s_load_dword s0, s[0:1], 0xf
+; SI-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0xb
+; SI-NEXT:    s_load_dword s2, s[0:1], 0xf
----------------
Old implementation only cluster SU(1) and SU(2)
```
Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(4294967295)
  Curr cluster length: 2, Curr cluster bytes: 24

SU(0):   %10:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(1):   undef %36.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %10:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
SU(2):   %16:sgpr_128 = S_LOAD_DWORDX4_IMM %10:sgpr_64(p4), 11, 0, 0 :: (dereferenceable invariant load 16 from %ir.0, align 4, addrspace 4)
SU(4):   S_CMP_LG_U32 %16.sub0:sgpr_128, 0, implicit-def $scc
SU(3):   %17:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %10:sgpr_64(p4), 15, 0, 0 :: (dereferenceable invariant load 4 from %ir.0 + 16, addrspace 4)
```

New implementation cluster 3 SU's
```
Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(4294967295)
  Curr cluster length: 2, Curr cluster bytes: 16
Cluster ld/st SU(2) - SU(3)
  Copy Succ SU(4)
  Copy Succ SU(4294967295)
  Curr cluster length: 3, Curr cluster bytes: 20

*** Final schedule for %bb.0 ***
SU(0):   %10:sgpr_64(p4) = COPY $sgpr0_sgpr1
SU(2):   %16:sgpr_128 = S_LOAD_DWORDX4_IMM %10:sgpr_64(p4), 11, 0, 0 :: (dereferenceable invariant load 16 from %ir.0, align 4, addrspace 4)
SU(3):   %17:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM %10:sgpr_64(p4), 15, 0, 0 :: (dereferenceable invariant load 4 from %ir.0 + 16, addrspace 4)
SU(4):   S_CMP_LG_U32 %16.sub0:sgpr_128, 0, implicit-def $scc
SU(1):   undef %36.sub0_sub1:sgpr_128 = S_LOAD_DWORDX2_IMM %10:sgpr_64(p4), 9, 0, 0 :: (dereferenceable invariant load 8 from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
```

The reason why we didn't have SU(1) SU(2) SU(3) is a bit tricky and limitation of the scheduler. I can explain more if it is needed.

================
Comment at: llvm/test/CodeGen/AMDGPU/shift-i128.ll:450
 ; GCN-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x0
+; GCN-NEXT:    s_load_dwordx8 s[16:23], s[4:5], 0x8
 ; GCN-NEXT:    v_mov_b32_e32 v10, 16
----------------
Old implementation didn't cluster these two loads while new implementation did. And it is sorted with Node Order.
Old implementation:
```
SU(1):   %5:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 0, 0, 0 :: (dereferenceable invariant load 32 from %ir.lhs.kernarg.offset.cast, align 16, addrspace 4)
  # preds left       : 1
  # succs left       : 12
  # rdefs left       : 0
  Latency            : 5
  Depth              : 1
  Height             : 11
  Predecessors:
    SU(0): Data Latency=1 Reg=%2
  Successors:
    SU(51): Data Latency=5 Reg=%5
    SU(46): Data Latency=5 Reg=%5
    SU(44): Data Latency=5 Reg=%5
    SU(39): Data Latency=5 Reg=%5
    SU(32): Data Latency=5 Reg=%5
    SU(31): Data Latency=5 Reg=%5
    SU(29): Data Latency=5 Reg=%5
    SU(23): Data Latency=5 Reg=%5
    SU(18): Data Latency=5 Reg=%5
    SU(11): Data Latency=5 Reg=%5
    SU(10): Data Latency=5 Reg=%5
    SU(8): Data Latency=5 Reg=%5
  Pressure Diff      : SReg_32 -
```
New Implementation:
```
Cluster ld/st SU(1) - SU(2)
  Copy Succ SU(51)
  Copy Succ SU(46)
  Copy Succ SU(44)
  Copy Succ SU(39)
  Copy Succ SU(32)
  Copy Succ SU(31)
  Copy Succ SU(29)
  Copy Succ SU(23)
  Copy Succ SU(18)
  Copy Succ SU(11)
  Copy Succ SU(10)
  Copy Succ SU(8)
  Curr cluster length: 2, Curr cluster bytes: 32

SU(1):   %5:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 0, 0, 0 :: (dereferenceable invariant load 32 from %ir.lhs.kernarg.offset.cast, align 16, addrspace 4)
SU(2):   %6:sgpr_256 = S_LOAD_DWORDX8_IMM %2:sgpr_64(p4), 8, 0, 0 :: (dereferenceable invariant load 32 from %ir.rhs.kernarg.offset.cast, align 16, addrspace 4)
```

================
Comment at: llvm/test/CodeGen/AMDGPU/shift-i128.ll:520
 ; GCN-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x0
+; GCN-NEXT:    s_load_dwordx8 s[16:23], s[4:5], 0x8
 ; GCN-NEXT:    v_mov_b32_e32 v10, 16
----------------
Same reason as above.

================
Comment at: llvm/test/CodeGen/AMDGPU/shift-i128.ll:590
 ; GCN-NEXT:    s_load_dwordx8 s[8:15], s[4:5], 0x0
+; GCN-NEXT:    s_load_dwordx8 s[16:23], s[4:5], 0x8
 ; GCN-NEXT:    v_mov_b32_e32 v8, 0
----------------
Same reason as above.

================
Comment at: llvm/test/CodeGen/AMDGPU/stack-realign.ll:167-171
+; GCN: buffer_store_dword v32, off, s[0:3], s33 offset:1024
 ; GCN-NEXT: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s34
 ; GCN-NEXT: s_add_u32 s32, s32, 0x30000

 ; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s32
----------------
foad wrote:
> The stores are no longer clustered.
The same reason as before. There is ADJCALLSTACKUP in-between these two stores. So, we don't cluster them.

```
SU(35):   BUFFER_STORE_DWORD_OFFEN %35:vgpr_32, %stack.0.temp, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (volatile store 4 into %ir.temp, align 1024, addrspace 5)
SU(36):   ADJCALLSTACKUP 0, 0, implicit-def dead $scc, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33, implicit-def $sgpr32, implicit $sgpr32, implicit $sgpr33
SU(37):   BUFFER_STORE_DWORD_OFFSET %34:vgpr_32, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into stack, align 16, addrspace 5)
```
And this is the final assembly for old implementation. They are not sched together in fact.
```
        buffer_store_dword v33, off, s[0:3], s33 offset:1024
        s_waitcnt vmcnt(1) lgkmcnt(0)
        buffer_store_dword v32, off, s[0:3], s32
```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85517/new/

https://reviews.llvm.org/D85517