[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.
Alexey Bataev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Mar 14 12:39:58 PDT 2022
ABataev added inline comments.
================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:5117
+ FixedVectorType *Ty = dyn_cast<FixedVectorType>(VecTy);
+ if (ST->hasSSSE3() && Ty && Ty->getNumElements() == 2 &&
+ Ty->getElementType() == Type::getDoubleTy(Ty->getContext()))
----------------
I assume, you need to tweak the cost model for broadcast with loads.
================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:5117-5121
+ if (ST->hasSSSE3() && Ty && Ty->getNumElements() == 2 &&
+ Ty->getElementType() == Type::getDoubleTy(Ty->getContext()))
+ // movddup
+ return true;
+ return false;
----------------
ABataev wrote:
> I assume, you need to tweak the cost model for broadcast with loads.
```
return ST->hasSSSE3() && VecTy && VecTy->getElementCount().getKnownMinValue() == 2 &&
Ty->getElementType() == Type::getDoubleTy(Ty->getContext());
```
================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:1090-1091
+ /// The same load multiple times. This should have a better score than
+ /// `ScoreSplat` because it takes only 1 instruction in x86 for a 2-lane
+ /// vector using `movddup (%reg), xmm0`.
+ static const int ScoreSplatLoads = 3;
----------------
We don't care about the instruction count for SLP, but the throughput.
================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:1124-1125
+ // A broadcast of a load can be cheaper on some targets.
+ VectorType *VecTy = FixedVectorType::get(V1->getType(), NumLanes);
+ if (TTI->isLegalBroadcastLoad(VecTy))
+ return VLOperands::ScoreSplatLoads;
----------------
Maybe pass a scalar type and number of elements to avoid constructing vector type?
================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121354/new/
https://reviews.llvm.org/D121354
More information about the llvm-commits
mailing list