[PATCH] D101555: [SLP]Improve handling of compensate external uses cost.
Douglas Yung via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed May 26 02:31:25 PDT 2021
dyung added a comment.
Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.
Consider the following code (ymm-1undef-add_ps_002.cpp):
#include <x86intrin.h>
__attribute__((noinline))
__m256 add_ps_002(__m256 a, __m256 b) {
__m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
}
Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:
# %bb.0: # %entry
vhaddps %xmm0, %xmm0, %xmm2
vextractf128 $1, %ymm0, %xmm0
vhaddps %xmm0, %xmm1, %xmm3
vinsertf128 $1, %xmm3, %ymm0, %ymm3
vhaddps %ymm0, %ymm1, %ymm0
vblendps $3, %ymm2, %ymm3, %ymm2 # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
vshufpd $2, %ymm0, %ymm2, %ymm0 # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
retq
With the following characteristics according to llvm-mca:
Iterations: 100
Instructions: 800
Total Cycles: 902
Total uOps: 1200
Dispatch Width: 2
uOps Per Cycle: 1.33
IPC: 0.89
Block RThroughput: 6.0
But after this change, the compiler is now producing the following assembly for the same code:
# %bb.0: # %entry
vextractf128 $1, %ymm0, %xmm2
vmovlhps %xmm2, %xmm0, %xmm3 # xmm3 = xmm0[0],xmm2[0]
vshufps $17, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[1,0],xmm2[1,0]
vshufps $232, %xmm2, %xmm3, %xmm3 # xmm3 = xmm3[0,2],xmm2[2,3]
vshufps $248, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,2],xmm2[3,3]
vextractf128 $1, %ymm1, %xmm2
vinsertps $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]
vinsertps $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]
vhaddps %xmm2, %xmm1, %xmm1
vhaddps %xmm2, %xmm2, %xmm2
vaddps %xmm0, %xmm3, %xmm0
vpermilps $148, %xmm0, %xmm3 # xmm3 = xmm0[0,1,1,2]
vinsertps $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero
vinsertps $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]
vinsertf128 $1, %xmm0, %ymm3, %ymm0
retq
Which has the following characteristics according to llvm-mca:
Iterations: 100
Instructions: 1600
Total Cycles: 1007
Total uOps: 1700
Dispatch Width: 2
uOps Per Cycle: 1.69
IPC: 1.59
Block RThroughput: 8.5
With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.
Could you take a look?
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D101555/new/
https://reviews.llvm.org/D101555
More information about the llvm-commits
mailing list