[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.

Wed Mar 16 14:03:49 PDT 2022

ABataev added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT:    [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT:    [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT:    [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
vporpo wrote:
> ABataev wrote:
> > ABataev wrote:
> > > vporpo wrote:
> > > > ABataev wrote:
> > > > > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> > > > How did you come up with these throughput values ?
> > > > 
> > > > 
> > > > The assembly code that comes out of llc for the original code is:
> > > > ```
> > > >   movl  8(%esp), %eax
> > > >   movl  4(%esp), %ecx
> > > >   vmovupd (%ecx), %xmm0
> > > >   vpermilpd $1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0]                                                                                                                          
> > > >   vmovq %xmm0, %xmm0                    ## xmm0 = xmm0[0],zero                                                                                                                   
> > > >   vaddpd  %xmm1, %xmm0, %xmm0
> > > >   vmovupd %xmm0, (%eax)
> > > > ```
> > > > 
> > > > The new code is:
> > > > ```
> > > >   movl  8(%esp), %eax
> > > >   movl  4(%esp), %ecx
> > > >   vmovsd  8(%ecx), %xmm0                  ## xmm0 = mem[0],zero                                                                                                                  
> > > >   vmovddup  (%ecx), %xmm1                   ## xmm1 = mem[0,0]                                                                                                                   
> > > >   vaddpd  %xmm1, %xmm0, %xmm0
> > > >   vmovupd %xmm0, (%eax)
> > > > ```
> > > > I ran the function in a loop on a skylake and the new code is 25% faster.
> > > https://godbolt.org/z/3rhGajsaT
> > > 
> > > The first page is the result without patch, the second - with patch.
> > What about this?
> I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost():
> ```
>         {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd                                                                                                                     
>         {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd                                                                                                                     
>         {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw                                                                                                      
>                                                     // + pshufd/unpck                                                                                                            
>       { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2*pshuflw + 2*pshufhw                                                                                                     
>                                                     // + 2*pshufd + 2*unpck + 2*packus  
> ```
It is still in terms of throughput.
Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121354/new/

https://reviews.llvm.org/D121354