[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.

Thu Mar 17 15:43:34 PDT 2022

vporpo added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT:    [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT:    [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT:    [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT:    [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT:    [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT:    [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
vporpo wrote:
> ABataev wrote:
> > vporpo wrote:
> > > ABataev wrote:
> > > > ABataev wrote:
> > > > > vporpo wrote:
> > > > > > ABataev wrote:
> > > > > > > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> > > > > > How did you come up with these throughput values ?
> > > > > > 
> > > > > > 
> > > > > > The assembly code that comes out of llc for the original code is:
> > > > > > ```
> > > > > >   movl  8(%esp), %eax
> > > > > >   movl  4(%esp), %ecx
> > > > > >   vmovupd (%ecx), %xmm0
> > > > > >   vpermilpd $1, %xmm0, %xmm1        ## xmm1 = xmm0[1,0]                                                                                                                          
> > > > > >   vmovq %xmm0, %xmm0                    ## xmm0 = xmm0[0],zero                                                                                                                   
> > > > > >   vaddpd  %xmm1, %xmm0, %xmm0
> > > > > >   vmovupd %xmm0, (%eax)
> > > > > > ```
> > > > > > 
> > > > > > The new code is:
> > > > > > ```
> > > > > >   movl  8(%esp), %eax
> > > > > >   movl  4(%esp), %ecx
> > > > > >   vmovsd  8(%ecx), %xmm0                  ## xmm0 = mem[0],zero                                                                                                                  
> > > > > >   vmovddup  (%ecx), %xmm1                   ## xmm1 = mem[0,0]                                                                                                                   
> > > > > >   vaddpd  %xmm1, %xmm0, %xmm0
> > > > > >   vmovupd %xmm0, (%eax)
> > > > > > ```
> > > > > > I ran the function in a loop on a skylake and the new code is 25% faster.
> > > > > https://godbolt.org/z/3rhGajsaT
> > > > > 
> > > > > The first page is the result without patch, the second - with patch.
> > > > What about this?
> > > I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost():
> > > ```
> > >         {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd                                                                                                                     
> > >         {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd                                                                                                                     
> > >         {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw                                                                                                      
> > >                                                     // + pshufd/unpck                                                                                                            
> > >       { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2*pshuflw + 2*pshufhw                                                                                                     
> > >                                                     // + 2*pshufd + 2*unpck + 2*packus  
> > > ```
> > It is still in terms of throughput.
> > Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria
> What kind of tweaking are you proposing?
As far as I understand the reason for the lower throughput is that we are adding pressure to the memory units ([4] and [5]). This is happening because we are loading twice, once for the scalar load, and once with `vmovddup`.  This means that if we try to pack many instances of this code in parallel, then the pipeline would stall more than in the original code. But in any other case, like if this is part of some other code which is not heavy on the load units, it is the latency that matters, and this code is better with respect to latency.

Modeling throughput would require an accurate pipeline model, which we are not using in SLP. The cost modeling that we are doing looks more like a latency model than a throughput one since it has no concept of pipeline resources. If we were actually modelling the pipeline, then we could check both the code latency and the  code throughput and decide which one to choose in each case. Given that we are not actually doing this I would argue against trying to fix a potential pipeline stall in an ad hoc way.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121354/new/

https://reviews.llvm.org/D121354