[PATCH] D73880: AMDGPU/GlobalISel: Fix move s.buffer.load to VALU

Fri Feb 7 01:46:58 PST 2020

nhaehnle accepted this revision.
nhaehnle added a comment.
This revision is now accepted and ready to land.

One more comment, but apart from that LGTM.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp:1375-1377
+  // Use the alignment to ensure that the required offsets will fit into the
+  // immediate offsets.
+  const unsigned Align = NumLoads > 1 ? 16 * NumLoads : 4;
----------------
arsenm wrote:
> arsenm wrote:
> > nhaehnle wrote:
> > > What's the justification for the alignment in the `NumLoads > 1` case?
> > This ensures the last load's offset will fit in the immediate field
> Really the align 4 in the 1 case isn't necessary
Okay, that makes sense. Thanks.

================
Comment at: llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir:79
     ; CHECK: [[BUILD_VECTOR:%[0-9]+]]:sgpr(<4 x s32>) = G_BUILD_VECTOR [[V_READFIRSTLANE_B32_]](s32), [[V_READFIRSTLANE_B32_1]](s32), [[V_READFIRSTLANE_B32_2]](s32), [[V_READFIRSTLANE_B32_3]](s32)
-    ; CHECK: [[INT:%[0-9]+]]:sgpr(<4 x s32>) = G_INTRINSIC intrinsic(@llvm.amdgcn.s.buffer.load), [[BUILD_VECTOR]](<4 x s32>), [[COPY1]](s32), 0
+    ; CHECK: [[AMDGPU_BUFFER_LOAD:%[0-9]+]]:vgpr(<4 x s32>) = G_AMDGPU_BUFFER_LOAD [[BUILD_VECTOR]](<4 x s32>), [[C1]](s32), [[COPY2]], [[C]], 0, 0, 0 :: (dereferenceable invariant load 16, align 4)
     ; CHECK: [[S_AND_SAVEEXEC_B64_:%[0-9]+]]:sreg_64_xexec = S_AND_SAVEEXEC_B64 killed [[S_AND_B64_]], implicit-def $exec, implicit-def $scc, implicit $exec
----------------
arsenm wrote:
> nhaehnle wrote:
> > This should still be an s_buffer_load...
> How can this be a scalar result with a vector resource?
It's not a vector resource anymore after waterfalling...

Okay, so admittedly it's not entirely clear which way ultimately ends up with better code in practice. One case is a loop with
```
buffer_load_dwordx4 v[0:4], ...
```
in it. The other case is a loop with:
```
s_buffer_load_dwordx4 s[0:4], ...
s_waitcnt lgkmcnt(0)
v_mov_b32 v0, s0
v_mov_b32 v1, s1
v_mov_b32 v2, s2
v_mov_b32 v3, s3
```
Okay, so the second one would likely be better if it has better K$ behavior, e.g. if the resource is often dynamically uniform in practice. This is hard to predict... but it should be a conscious decision. It's not a high priority for graphics at the moment, but at the very least, please take a note of it in a comment the relevant part of the code.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73880/new/

https://reviews.llvm.org/D73880