[PATCH] D101555: [SLP]Improve handling of compensate external uses cost.

Wed May 26 02:31:25 PDT 2021

dyung added a comment.

Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.

Consider the following code (ymm-1undef-add_ps_002.cpp):

  #include <x86intrin.h>

  __attribute__((noinline))
  __m256 add_ps_002(__m256 a, __m256 b) {
    __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
                         b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
    return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
  }

Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:

  # %bb.0:                                # %entry                                                                                         
          vhaddps %xmm0, %xmm0, %xmm2                                                                                                      
          vextractf128    $1, %ymm0, %xmm0                                                                                                 
          vhaddps %xmm0, %xmm1, %xmm3                                                                                                      
          vinsertf128     $1, %xmm3, %ymm0, %ymm3                                                                                          
          vhaddps %ymm0, %ymm1, %ymm0                                                                                                      
          vblendps        $3, %ymm2, %ymm3, %ymm2         # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
          vshufpd $2, %ymm0, %ymm2, %ymm0         # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
          retq

With the following characteristics according to llvm-mca:

  Iterations:        100
  Instructions:      800
  Total Cycles:      902
  Total uOps:        1200

  Dispatch Width:    2
  uOps Per Cycle:    1.33
  IPC:               0.89
  Block RThroughput: 6.0

But after this change, the compiler is now producing the following assembly for the same code:

  # %bb.0:                                # %entry
          vextractf128    $1, %ymm0, %xmm2
          vmovlhps        %xmm2, %xmm0, %xmm3             # xmm3 = xmm0[0],xmm2[0]                                                         
          vshufps $17, %xmm2, %xmm0, %xmm0        # xmm0 = xmm0[1,0],xmm2[1,0]                                                             
          vshufps $232, %xmm2, %xmm3, %xmm3       # xmm3 = xmm3[0,2],xmm2[2,3]                                                             
          vshufps $248, %xmm2, %xmm0, %xmm0       # xmm0 = xmm0[0,2],xmm2[3,3]                                                             
          vextractf128    $1, %ymm1, %xmm2                            
          vinsertps       $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]                                                            
          vinsertps       $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]                                                           
          vhaddps %xmm2, %xmm1, %xmm1                                                                                                      
          vhaddps %xmm2, %xmm2, %xmm2                                                                                                      
          vaddps  %xmm0, %xmm3, %xmm0                                 
          vpermilps       $148, %xmm0, %xmm3      # xmm3 = xmm0[0,1,1,2]                                                                   
          vinsertps       $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero                                                        
          vinsertps       $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]                                                           
          vinsertf128     $1, %xmm0, %ymm3, %ymm0                                                                                          
          retq

Which has the following characteristics according to llvm-mca:

  Iterations:        100                                                                                                                   
  Instructions:      1600                                                                                                                  
  Total Cycles:      1007                                                                                                                  
  Total uOps:        1700         

  Dispatch Width:    2
  uOps Per Cycle:    1.69
  IPC:               1.59
  Block RThroughput: 8.5

With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.

Could you take a look?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101555/new/

https://reviews.llvm.org/D101555