[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.

Wed Mar 16 13:36:55 PDT 2022

ABataev added inline comments.

================
Comment at: llvm/include/llvm/Analysis/TargetTransformInfo.h:1058
+                                 VectorType *SubTp = nullptr,
+                                 ArrayRef<const Value *> Args = None) const;

----------------
You can drop `const` in `const Value *`

================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:1557-1559
+    if (ST->hasSSE3() && IsLoad)
+      if (const auto *Entry =
+              CostTableLookup(SSE2BroadcastLoadTbl, Kind, LT.second)) {
----------------
Maybe rename it to `SSE3BroadcastLoadTbl`?

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT:    [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT:    [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT:    [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
ABataev wrote:
> vporpo wrote:
> > ABataev wrote:
> > > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> > How did you come up with these throughput values ?
> > 
> > 
> > The assembly code that comes out of llc for the original code is:
> > ```
> >   movl  8(%esp), %eax
> >   movl  4(%esp), %ecx
> >   vmovupd (%ecx), %xmm0
> >   vpermilpd $1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0]                                                                                                                          
> >   vmovq %xmm0, %xmm0                    ## xmm0 = xmm0[0],zero                                                                                                                   
> >   vaddpd  %xmm1, %xmm0, %xmm0
> >   vmovupd %xmm0, (%eax)
> > ```
> > 
> > The new code is:
> > ```
> >   movl  8(%esp), %eax
> >   movl  4(%esp), %ecx
> >   vmovsd  8(%ecx), %xmm0                  ## xmm0 = mem[0],zero                                                                                                                  
> >   vmovddup  (%ecx), %xmm1                   ## xmm1 = mem[0,0]                                                                                                                   
> >   vaddpd  %xmm1, %xmm0, %xmm0
> >   vmovupd %xmm0, (%eax)
> > ```
> > I ran the function in a loop on a skylake and the new code is 25% faster.
> https://godbolt.org/z/3rhGajsaT
> 
> The first page is the result without patch, the second - with patch.
What about this?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121354/new/

https://reviews.llvm.org/D121354