[PATCH] D101555: [SLP]Improve handling of compensate external uses cost.

Wed May 26 06:43:55 PDT 2021

ABataev added a comment.

In D101555#2781369 <https://reviews.llvm.org/D101555#2781369>, @dyung wrote:

> Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.
>
> Consider the following code (ymm-1undef-add_ps_002.cpp):
>
>   #include <x86intrin.h>
>   
>   __attribute__((noinline))
>   __m256 add_ps_002(__m256 a, __m256 b) {
>     __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
>                          b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
>     return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
>   }
>
> Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:
>
>   # %bb.0:                                # %entry                                                                                         
>           vhaddps %xmm0, %xmm0, %xmm2                                                                                                      
>           vextractf128    $1, %ymm0, %xmm0                                                                                                 
>           vhaddps %xmm0, %xmm1, %xmm3                                                                                                      
>           vinsertf128     $1, %xmm3, %ymm0, %ymm3                                                                                          
>           vhaddps %ymm0, %ymm1, %ymm0                                                                                                      
>           vblendps        $3, %ymm2, %ymm3, %ymm2         # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
>           vshufpd $2, %ymm0, %ymm2, %ymm0         # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
>           retq
>
> With the following characteristics according to llvm-mca:
>
>   Iterations:        100
>   Instructions:      800
>   Total Cycles:      902
>   Total uOps:        1200
>   
>   Dispatch Width:    2
>   uOps Per Cycle:    1.33
>   IPC:               0.89
>   Block RThroughput: 6.0
>
> But after this change, the compiler is now producing the following assembly for the same code:
>
>   # %bb.0:                                # %entry
>           vextractf128    $1, %ymm0, %xmm2
>           vmovlhps        %xmm2, %xmm0, %xmm3             # xmm3 = xmm0[0],xmm2[0]                                                         
>           vshufps $17, %xmm2, %xmm0, %xmm0        # xmm0 = xmm0[1,0],xmm2[1,0]                                                             
>           vshufps $232, %xmm2, %xmm3, %xmm3       # xmm3 = xmm3[0,2],xmm2[2,3]                                                             
>           vshufps $248, %xmm2, %xmm0, %xmm0       # xmm0 = xmm0[0,2],xmm2[3,3]                                                             
>           vextractf128    $1, %ymm1, %xmm2                            
>           vinsertps       $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]                                                            
>           vinsertps       $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]                                                           
>           vhaddps %xmm2, %xmm1, %xmm1                                                                                                      
>           vhaddps %xmm2, %xmm2, %xmm2                                                                                                      
>           vaddps  %xmm0, %xmm3, %xmm0                                 
>           vpermilps       $148, %xmm0, %xmm3      # xmm3 = xmm0[0,1,1,2]                                                                   
>           vinsertps       $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero                                                        
>           vinsertps       $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]                                                           
>           vinsertf128     $1, %xmm0, %ymm3, %ymm0                                                                                          
>           retq
>
> Which has the following characteristics according to llvm-mca:
>
>   Iterations:        100                                                                                                                   
>   Instructions:      1600                                                                                                                  
>   Total Cycles:      1007                                                                                                                  
>   Total uOps:        1700         
>   
>   Dispatch Width:    2
>   uOps Per Cycle:    1.69
>   IPC:               1.59
>   Block RThroughput: 8.5
>
> With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.
>
> Could you take a look?

Looks like codegen or some other later passes previously recognized the pattern while SLP vectorizer did not.  Actually, without SLP vectorizer I'm getting just this:

  vperm2f128      $49, %ymm1, %ymm0, %ymm2 # ymm2 = ymm0[2,3],ymm1[2,3]
  vinsertf128     $1, %xmm1, %ymm0, %ymm0
  vhaddps %ymm2, %ymm0, %ymm0
  retq

I assume SLP will be able to produce something similar (or even better) after we start supporting vectorization of non-power-2 vectors. Here we have a pattern that matches it exactly:

  return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);

`-1` causes the optimizer to optimize out `a[2] + a[3]` operation and SLP does not recognize vectorization of 7 addition operations. This is the price we have to pay till the landing of non-power-2 vectorization. Will try to speed up.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101555/new/

https://reviews.llvm.org/D101555