[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.
Alexey Bataev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 16 13:36:55 PDT 2022
ABataev added inline comments.
================
Comment at: llvm/include/llvm/Analysis/TargetTransformInfo.h:1058
+ VectorType *SubTp = nullptr,
+ ArrayRef<const Value *> Args = None) const;
----------------
You can drop `const` in `const Value *`
================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:1557-1559
+ if (ST->hasSSE3() && IsLoad)
+ if (const auto *Entry =
+ CostTableLookup(SSE2BroadcastLoadTbl, Kind, LT.second)) {
----------------
Maybe rename it to `SSE3BroadcastLoadTbl`?
================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
ABataev wrote:
> vporpo wrote:
> > ABataev wrote:
> > > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> > How did you come up with these throughput values ?
> >
> >
> > The assembly code that comes out of llc for the original code is:
> > ```
> > movl 8(%esp), %eax
> > movl 4(%esp), %ecx
> > vmovupd (%ecx), %xmm0
> > vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
> > vmovq %xmm0, %xmm0 ## xmm0 = xmm0[0],zero
> > vaddpd %xmm1, %xmm0, %xmm0
> > vmovupd %xmm0, (%eax)
> > ```
> >
> > The new code is:
> > ```
> > movl 8(%esp), %eax
> > movl 4(%esp), %ecx
> > vmovsd 8(%ecx), %xmm0 ## xmm0 = mem[0],zero
> > vmovddup (%ecx), %xmm1 ## xmm1 = mem[0,0]
> > vaddpd %xmm1, %xmm0, %xmm0
> > vmovupd %xmm0, (%eax)
> > ```
> > I ran the function in a loop on a skylake and the new code is 25% faster.
> https://godbolt.org/z/3rhGajsaT
>
> The first page is the result without patch, the second - with patch.
What about this?
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121354/new/
https://reviews.llvm.org/D121354
More information about the llvm-commits
mailing list