[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.

Wed Mar 16 13:59:15 PDT 2022

vporpo added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT:    [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT:    [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT:    [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
ABataev wrote:
> ABataev wrote:
> > vporpo wrote:
> > > ABataev wrote:
> > > > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> > > How did you come up with these throughput values ?
> > > 
> > > 
> > > The assembly code that comes out of llc for the original code is:
> > > ```
> > >   movl  8(%esp), %eax
> > >   movl  4(%esp), %ecx
> > >   vmovupd (%ecx), %xmm0
> > >   vpermilpd $1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0]                                                                                                                          
> > >   vmovq %xmm0, %xmm0                    ## xmm0 = xmm0[0],zero                                                                                                                   
> > >   vaddpd  %xmm1, %xmm0, %xmm0
> > >   vmovupd %xmm0, (%eax)
> > > ```
> > > 
> > > The new code is:
> > > ```
> > >   movl  8(%esp), %eax
> > >   movl  4(%esp), %ecx
> > >   vmovsd  8(%ecx), %xmm0                  ## xmm0 = mem[0],zero                                                                                                                  
> > >   vmovddup  (%ecx), %xmm1                   ## xmm1 = mem[0,0]                                                                                                                   
> > >   vaddpd  %xmm1, %xmm0, %xmm0
> > >   vmovupd %xmm0, (%eax)
> > > ```
> > > I ran the function in a loop on a skylake and the new code is 25% faster.
> > https://godbolt.org/z/3rhGajsaT
> > 
> > The first page is the result without patch, the second - with patch.
> What about this?
I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost():
```
        {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd                                                                                                                     
        {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd                                                                                                                     
        {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw                                                                                                      
                                                    // + pshufd/unpck                                                                                                            
      { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2*pshuflw + 2*pshufhw                                                                                                     
                                                    // + 2*pshufd + 2*unpck + 2*packus  
```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121354/new/

https://reviews.llvm.org/D121354