[PATCH] D44585: [AMDGPU] Scalarize when scalar code cheaper than vector code.

Sat Mar 17 08:38:35 PDT 2018

arsenm added inline comments.

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6740-6742
+  // In general, shuffles are expensive.
+  if (N.getOpcode() == ISD::VECTOR_SHUFFLE && N.hasOneUse())
+    return true;
----------------
For the 2x 16 vectors it's 2 instructions which isn't really that expensive.

Do you really need to consider VECTOR_SHUFFLE? They aren't legal so I would expect any useful combines to happen on them in the generic combiner

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6745-6746
+  if (N.getNumOperands() == 2) {
+    if (cheaperToScalarize(N.getOperand(0)) ||
+        cheaperToScalarize(N.getOperand(1)))
+      return true;
----------------
I would expect this to be an and?

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6762
+
   if (Vec.getOpcode() == ISD::FNEG && allUsesHaveSourceMods(N)) {
     SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT,
----------------
I don't see what changed here, is there trailing whitespace or something?

================
Comment at: lib/Target/AMDGPU/SIISelLowering.cpp:6786
+  case ISD::MUL:
+    if (Vec.hasOneUse() && cheaperToScalarize(Vec)) {
+      SDValue Lhs = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, EltVT,
----------------
Needs testcase with multiple uses

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:1
+; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx900 -verify-machineinstrs < %s | FileCheck -check-prefixes=FUNC,GFX9 %s
+
----------------
s/FUNC/GCN

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:7
+; FUNC-LABEL:  {{^}}vadd_half2:
+; {{BB|%bb}}{{.*}}:
+; GFX9: s_waitcnt
----------------
You don't need these

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:12
+entry:
+ %rdx.shuf1 = shufflevector <2 x half> %a, <2 x half> undef, <2 x i32> <i32 1, i32 undef>
+ %bin.rdx2 = fadd fast <2 x half> %a, %rdx.shuf1
----------------
All of these test cases use an undef for one of the components. While it's useful to have some examples with undef components, most of these should use a valid component.

Because the component is undef, I would also expect all of these to happen on all targets (and I would expect the generic visitEXTRACT_VECTOR_ELT would already handle this). Why doesn't it now?

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:14
+ %bin.rdx2 = fadd fast <2 x half> %a, %rdx.shuf1
+ %0 = extractelement <2 x half> %bin.rdx2, i32 0
+
----------------
Give these values names, opt -instnamer will do this

================
Comment at: test/CodeGen/AMDGPU/scalarize.ll:38
+; GFX9-NEXT: v_mul_f16_sdwa
+define void @vmul_half2(<2 x half> %a, half addrspace(3)* %c) {
+entry:
----------------
Probably should use another parameter to shuffle rather than adding a shuffled operand to the original?

https://reviews.llvm.org/D44585