[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.

Mon Mar 14 16:29:17 PDT 2022

ABataev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:1090-1091
+    /// The same load multiple times. This should have a better score than
+    /// `ScoreSplat` because it takes only 1 instruction in x86 for a 2-lane
+    /// vector using `movddup (%reg), xmm0`.
+    static const int ScoreSplatLoads = 3;
----------------
vporpo wrote:
> ABataev wrote:
> > We don't care about the instruction count for SLP, but the throughput.
> Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ?
Use the throughput, not number of instructions. 

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT:    [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT:    [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT:    [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
vporpo wrote:
> ABataev wrote:
> > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> How did you come up with these throughput values ?
> 
> 
> The assembly code that comes out of llc for the original code is:
> ```
>   movl  8(%esp), %eax
>   movl  4(%esp), %ecx
>   vmovupd (%ecx), %xmm0
>   vpermilpd $1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0]                                                                                                                          
>   vmovq %xmm0, %xmm0                    ## xmm0 = xmm0[0],zero                                                                                                                   
>   vaddpd  %xmm1, %xmm0, %xmm0
>   vmovupd %xmm0, (%eax)
> ```
> 
> The new code is:
> ```
>   movl  8(%esp), %eax
>   movl  4(%esp), %ecx
>   vmovsd  8(%ecx), %xmm0                  ## xmm0 = mem[0],zero                                                                                                                  
>   vmovddup  (%ecx), %xmm1                   ## xmm1 = mem[0,0]                                                                                                                   
>   vaddpd  %xmm1, %xmm0, %xmm0
>   vmovupd %xmm0, (%eax)
> ```
> I ran the function in a loop on a skylake and the new code is 25% faster.
https://godbolt.org/z/3rhGajsaT

The first page is the result without patch, the second - with patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121354/new/

https://reviews.llvm.org/D121354