[PATCH] D44585: [AMDGPU] Scalarize when scalar code cheaper than vector code.

Mon Mar 26 09:21:09 PDT 2018

arsenm added inline comments.

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6740-6742
+  // In general, shuffles are expensive.
+  if (N.getOpcode() == ISD::VECTOR_SHUFFLE && N.hasOneUse())
+    return true;
----------------
FarhanaAleen wrote:
> arsenm wrote:
> > For the 2x 16 vectors it's 2 instructions which isn't really that expensive.
> > 
> > Do you really need to consider VECTOR_SHUFFLE? They aren't legal so I would expect any useful combines to happen on them in the generic combiner
> There is a ticket where this optimization matters. From code complexity and compiler overhead perspective, this is a simple fix to implement in the compiler. Instruction count will have some implication at least on power though we are not focused on it now. Also, I understand latency doesn't matter as much in GPUs, but do we really want to say instruction count doesn't matter for GPUs? Don't those extra instructions make each thread run longer and if we do have well tuned workloads that happen to be ALU or thread context limited won't we be better off by having each thread consume the GPU for shorter time?
>  
> VECTOR_SHUFLLE operation is legal on AMDGPU. AMDGPU just does not have native support for vector_shuffle, that's why they get expanded. This optimization is not required for all sub-targets of AMD-GPU. Doing this in the generic part will cause unnecessary compile time checks for the other sub-targets which can matter for the runtime compilation, right? 
It's of course worse to have the extra instructions, but I wouldn't call it expensive.

VECTOR_SHUFFLE is not not legal. We could make v2x16 shuffles legal if it would make things easier, since that's basically what op_sel is.

There's no point in worrying about cheap checks in the generic combiner. SelectionDAG is already very slow, and all of the basic operations are slow.

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:7
+; FUNC-LABEL:  {{^}}vadd_half2:
+; {{BB|%bb}}{{.*}}:
+; GFX9: s_waitcnt
----------------
FarhanaAleen wrote:
> arsenm wrote:
> > You don't need these
> I wanted to make sure there are no other instructions in this block. Having {{BB|%bb}}{{.*}}: would have checked for that, right?
You would need to use -NEXT for that. For callable function's I've been just using the initial waiting with -NEXT checks after it

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:12
+entry:
+ %rdx.shuf1 = shufflevector <2 x half> %a, <2 x half> undef, <2 x i32> <i32 1, i32 undef>
+ %bin.rdx2 = fadd fast <2 x half> %a, %rdx.shuf1
----------------
FarhanaAleen wrote:
> arsenm wrote:
> > All of these test cases use an undef for one of the components. While it's useful to have some examples with undef components, most of these should use a valid component.
> > 
> > Because the component is undef, I would also expect all of these to happen on all targets (and I would expect the generic visitEXTRACT_VECTOR_ELT would already handle this). Why doesn't it now?
> This is an example of reduction. Other targets have native vector instructions that can handle reduction very well.
If one of the components is undef, it's just a scalar add with an unnecessary vector component

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:97
+entry:
+ %rdx.shuf1 = shufflevector <2 x i16> %a, <2 x i16> %b, <2 x i32> <i32 1, i32 2>
+ %bin.rdx2 = add <2 x i16> %a, %rdx.shuf1
----------------
This out of bounds vector index is undefined

https://reviews.llvm.org/D44585