[PATCH] D44585: [AMDGPU] Scalarize when scalar code cheaper than vector code.

Mon Mar 19 12:24:21 PDT 2018

FarhanaAleen added inline comments.

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6740-6742
+  // In general, shuffles are expensive.
+  if (N.getOpcode() == ISD::VECTOR_SHUFFLE && N.hasOneUse())
+    return true;
----------------
arsenm wrote:
> For the 2x 16 vectors it's 2 instructions which isn't really that expensive.
> 
> Do you really need to consider VECTOR_SHUFFLE? They aren't legal so I would expect any useful combines to happen on them in the generic combiner
There is a ticket where this optimization matters. From code complexity and compiler overhead perspective, this is a simple fix to implement in the compiler. Instruction count will have some implication at least on power though we are not focused on it now. Also, I understand latency doesn't matter as much in GPUs, but do we really want to say instruction count doesn't matter for GPUs? Don't those extra instructions make each thread run longer and if we do have well tuned workloads that happen to be ALU or thread context limited won't we be better off by having each thread consume the GPU for shorter time?

VECTOR_SHUFLLE operation is legal on AMDGPU. AMDGPU just does not have native support for vector_shuffle, that's why they get expanded. This optimization is not required for all sub-targets of AMD-GPU. Doing this in the generic part will cause unnecessary compile time checks for the other sub-targets which can matter for the runtime compilation, right? 

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6745-6746
+  if (N.getNumOperands() == 2) {
+    if (cheaperToScalarize(N.getOperand(0)) ||
+        cheaperToScalarize(N.getOperand(1)))
+      return true;
----------------
arsenm wrote:
> I would expect this to be an and?
No, it should be or. Having one shuffle gives enough incentive to do this optimization.

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6762
+
   if (Vec.getOpcode() == ISD::FNEG && allUsesHaveSourceMods(N)) {
     SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT,
----------------
arsenm wrote:
> I don't see what changed here, is there trailing whitespace or something?
There are no trailing whitespaces, the line has been moved.

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6786
+  case ISD::MUL:
+    if (Vec.hasOneUse() && cheaperToScalarize(Vec)) {
+      SDValue Lhs = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT,
----------------
arsenm wrote:
> Needs testcase with multiple uses
Added.

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:7
+; FUNC-LABEL:  {{^}}vadd_half2:
+; {{BB|%bb}}{{.*}}:
+; GFX9: s_waitcnt
----------------
arsenm wrote:
> You don't need these
I wanted to make sure there are no other instructions in this block. Having {{BB|%bb}}{{.*}}: would have checked for that, right?

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:12
+entry:
+ %rdx.shuf1 = shufflevector <2 x half> %a, <2 x half> undef, <2 x i32> <i32 1, i32 undef>
+ %bin.rdx2 = fadd fast <2 x half> %a, %rdx.shuf1
----------------
arsenm wrote:
> All of these test cases use an undef for one of the components. While it's useful to have some examples with undef components, most of these should use a valid component.
> 
> Because the component is undef, I would also expect all of these to happen on all targets (and I would expect the generic visitEXTRACT_VECTOR_ELT would already handle this). Why doesn't it now?
This is an example of reduction. Other targets have native vector instructions that can handle reduction very well.

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:38
+; GFX9-NEXT: v_mul_f16_sdwa
+define void @vmul_half2(<2 x half> %a, half addrspace(3)* %c) {
+entry:
----------------
arsenm wrote:
> Probably should use another parameter to shuffle rather than adding a shuffled operand to the original?
This is an example of reduction. I want to make sure compiler generates less number of instructions for reductions on AMDGPU when they are vectorized.

https://reviews.llvm.org/D44585