[PATCH] D89618: [AMDGPU] Optimize waitcnt insertion for flat memory operations

Tue Oct 20 15:38:42 PDT 2020

t-tye marked 4 inline comments as done.
t-tye added inline comments.

================
Comment at: llvm/test/CodeGen/AMDGPU/waitcnt.mir:103
+    $vgpr3 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 4 from %ir.flat4)
+    $vgpr4 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 4 from %ir.global4)
+    $vgpr0 = V_MOV_B32_e32 $vgpr3, implicit $exec
----------------
rampitec wrote:
> t-tye wrote:
> > rampitec wrote:
> > > Can you keep just load from flat here? The other load obscures the result.
> > Add the extra BB3 you suggested.
> > 
> > The waitcnts being generated seem correct from my inspection.
> They seem to be correct, but with two loads per block it is hard to understand which of the loads has actually caused the wait. If you want to keep it this way, add yet another bb.4, but with only a load from flat.
Add a bb.4 that has a single load from flat.

================
Comment at: llvm/test/CodeGen/AMDGPU/waitcnt.mir:66
 # CHECK: FLAT_LOAD_DWORD
-# GFX89: S_WAITCNT 112
+# GFX89: S_WAITCNT 3952
 # CHECK: FLAT_LOAD_DWORDX4
----------------
t-tye wrote:
> rampitec wrote:
> > rampitec wrote:
> > > t-tye wrote:
> > > > rampitec wrote:
> > > > > t-tye wrote:
> > > > > > rampitec wrote:
> > > > > > > That one was not supposed to change? The pointer is flat here.
> > > > > > Yes. Previously it was "s_waitcnt vmcnt(0) lgkmcnt(0)". Now it is "s_waitcnt vmcnt(0)" as the address space of global16 is 1 which is GLOBAL. Therefore there is no need to wait on LGKM.
> > > > > It is not global, it is flat:
> > > > > 
> > > > > 
> > > > > ```
> > > > > <4 x i32>* %flat16
> > > > > ...
> > > > >     $vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 $vgpr7_vgpr8, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 16 from %ir.flat16)
> > > > > ```
> > > > But isn't this test checking:
> > > > 
> > > >     $vgpr0 = FLAT_LOAD_DWORD $vgpr1_vgpr2, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 4 from %ir.global4)
> > > >     $vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 $vgpr7_vgpr8, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 16 from %ir.global16)
> > > > 
> > > > These are referencing global4 and global16 which are:
> > > > 
> > > >   i32 addrspace(1)* %global4,
> > > >   <4 x i32> addrspace(1)* %global16
> > > > 
> > > > Which are both marked as the global (1) not flat (0) address space.
> > > > 
> > > > Am I missing something?
> > > No, it is not. Note it first checks label bb.2. And after it:
> > > 
> > >     $vgpr3_vgpr4_vgpr5_vgpr6 = FLAT_LOAD_DWORDX4 $vgpr7_vgpr8, 0, 0, 0, 0, implicit $exec, implicit $flat_scr :: (load 16 from %ir.flat16)
> > > 
> > > It is flat pointer. Not global.
> > > 
> > > Think about the testcase itself: it is a standalone function (not kernel) taking a generic pointer. You are checking for the question: "is is this DEFINITELY an LDS pointer?" The answer is no, so you say: "this is DEFINITELY NOT an LDS pointer".
> > Or VMEM for that matter.
> I believe the waitcnts are correct, and added the extra test you recommended.
On checking the test the waitcnts do seem correct because the registers being waited on are produced by loads in earlier basic blocks. Those earlier loads are either global, or they are flat but there is intervening waitcnt that satisfies a vmemcnt(0). Add two additional basic blocks to test this better.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D89618/new/

https://reviews.llvm.org/D89618