[PATCH] D70118: [AMDGPU] Lower llvm.amdgcn.s.buffer.load.v3[i|f]32

Fri Nov 15 00:46:05 PST 2019

piotr marked an inline comment as done.
piotr added inline comments.

================
Comment at: llvm/test/CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll:97-98
+;GCN-NOT: s_waitcnt;
+;GCN: s_buffer_load_dword s{{[0-9]+}}, s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
+;GCN: s_buffer_load_dwordx2 s[{{[0-9]+:[0-9]+}}], s[{{[0-9]+:[0-9]+}}], s{{[0-9]+}}
+define amdgpu_ps void @s_buffer_loadx3_index(<4 x i32> inreg %desc, i32 inreg %index) {
----------------
piotr wrote:
> arsenm wrote:
> > There is no load dwordx3, so I'm slightly confused about why you need this, but I would expect this ot widen to 4x loads?
> The big picture is that I am working on cutting down the number of loaded components with various buffer loads. I have another change in instcombine (soon to be uploaded for review) that trims loads based on the components used. With that patch vec3 s_buffer_load crashes in the lowering so I am adding support for that. 
> 
> It is useful to have s_buffer_load.v3 for the case with divergent index, where s_buffer_load cannot be used and buffer_load_dword is generated instead. On newer GPU (VI and later) buffer_load_dwordx3 is present, only on SI we generate buffer_load_dwordx4 for that (see s_buffer_loadx3_index_divergent test). 
> 
> As for whether it is better to split or widen the s_buffer_load (non-divergent index), the advantage of splitting is that the split loads can be merged with an adjacent load more easily. But I do not have a strong opinion on that.
On second thought widening seems to make more sense, will update the patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70118/new/

https://reviews.llvm.org/D70118