[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.

Mon Mar 14 12:39:58 PDT 2022

ABataev added inline comments.

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:5117
+  FixedVectorType *Ty = dyn_cast<FixedVectorType>(VecTy);
+  if (ST->hasSSSE3() && Ty && Ty->getNumElements() == 2 &&
+      Ty->getElementType() == Type::getDoubleTy(Ty->getContext()))
----------------
I assume, you need to tweak the cost model for broadcast with loads.

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:5117-5121
+  if (ST->hasSSSE3() && Ty && Ty->getNumElements() == 2 &&
+      Ty->getElementType() == Type::getDoubleTy(Ty->getContext()))
+    // movddup
+    return true;
+  return false;
----------------
ABataev wrote:
> I assume, you need to tweak the cost model for broadcast with loads.
```
return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 && 
       Ty->getElementType() == Type::getDoubleTy(Ty->getContext());
```

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:1090-1091
+    /// The same load multiple times. This should have a better score than
+    /// `ScoreSplat` because it takes only 1 instruction in x86 for a 2-lane
+    /// vector using `movddup (%reg), xmm0`.
+    static const int ScoreSplatLoads = 3;
----------------
We don't care about the instruction count for SLP, but the throughput.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:1124-1125
+          // A broadcast of a load can be cheaper on some targets.
+          VectorType *VecTy = FixedVectorType::get(V1->getType(), NumLanes);
+          if (TTI->isLegalBroadcastLoad(VecTy))
+            return VLOperands::ScoreSplatLoads;
----------------
Maybe pass a scalar type and number of elements to avoid constructing vector type?

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT:    [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT:    [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT:    [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121354/new/

https://reviews.llvm.org/D121354