[PATCH] D101555: [SLP]Improve handling of compensate external uses cost.
Alexey Bataev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed May 26 06:43:55 PDT 2021
ABataev added a comment.
In D101555#2781369 <https://reviews.llvm.org/D101555#2781369>, @dyung wrote:
> Hi, we are noticing a regression in the quality of the code generated by the compiler for btver2 after this change.
>
> Consider the following code (ymm-1undef-add_ps_002.cpp):
>
> #include <x86intrin.h>
>
> __attribute__((noinline))
> __m256 add_ps_002(__m256 a, __m256 b) {
> __m256 r = (__m256){ a[0] + a[1], a[2] + a[3], a[4] + a[5], a[6] + a[7],
> b[0] + b[1], b[2] + b[3], b[4] + b[5], b[6] + b[7] };
> return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
> }
>
> Prior to this change, when compiled with "-g0 -O3 -march=btver2" the compiler would generate the following assembly:
>
> # %bb.0: # %entry
> vhaddps %xmm0, %xmm0, %xmm2
> vextractf128 $1, %ymm0, %xmm0
> vhaddps %xmm0, %xmm1, %xmm3
> vinsertf128 $1, %xmm3, %ymm0, %ymm3
> vhaddps %ymm0, %ymm1, %ymm0
> vblendps $3, %ymm2, %ymm3, %ymm2 # ymm2 = ymm2[0,1],ymm3[2,3,4,5,6,7]
> vshufpd $2, %ymm0, %ymm2, %ymm0 # ymm0 = ymm2[0],ymm0[1],ymm2[2],ymm0[2]
> retq
>
> With the following characteristics according to llvm-mca:
>
> Iterations: 100
> Instructions: 800
> Total Cycles: 902
> Total uOps: 1200
>
> Dispatch Width: 2
> uOps Per Cycle: 1.33
> IPC: 0.89
> Block RThroughput: 6.0
>
> But after this change, the compiler is now producing the following assembly for the same code:
>
> # %bb.0: # %entry
> vextractf128 $1, %ymm0, %xmm2
> vmovlhps %xmm2, %xmm0, %xmm3 # xmm3 = xmm0[0],xmm2[0]
> vshufps $17, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[1,0],xmm2[1,0]
> vshufps $232, %xmm2, %xmm3, %xmm3 # xmm3 = xmm3[0,2],xmm2[2,3]
> vshufps $248, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,2],xmm2[3,3]
> vextractf128 $1, %ymm1, %xmm2
> vinsertps $48, %xmm1, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2],xmm1[0]
> vinsertps $112, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[1]
> vhaddps %xmm2, %xmm1, %xmm1
> vhaddps %xmm2, %xmm2, %xmm2
> vaddps %xmm0, %xmm3, %xmm0
> vpermilps $148, %xmm0, %xmm3 # xmm3 = xmm0[0,1,1,2]
> vinsertps $200, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[3],xmm1[1,2],zero
> vinsertps $112, %xmm2, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm2[1]
> vinsertf128 $1, %xmm0, %ymm3, %ymm0
> retq
>
> Which has the following characteristics according to llvm-mca:
>
> Iterations: 100
> Instructions: 1600
> Total Cycles: 1007
> Total uOps: 1700
>
> Dispatch Width: 2
> uOps Per Cycle: 1.69
> IPC: 1.59
> Block RThroughput: 8.5
>
> With some help understanding the llvm-mca output from @RKSimon, I understand that the increased RThroughput number is bad for hot loops, while the increase in the total cycles is worse for straight line code.
>
> Could you take a look?
Looks like codegen or some other later passes previously recognized the pattern while SLP vectorizer did not. Actually, without SLP vectorizer I'm getting just this:
vperm2f128 $49, %ymm1, %ymm0, %ymm2 # ymm2 = ymm0[2,3],ymm1[2,3]
vinsertf128 $1, %xmm1, %ymm0, %ymm0
vhaddps %ymm2, %ymm0, %ymm0
retq
I assume SLP will be able to produce something similar (or even better) after we start supporting vectorization of non-power-2 vectors. Here we have a pattern that matches it exactly:
return __builtin_shufflevector(r, a, 0, -1, 2, 3, 4, 5, 6, 7);
`-1` causes the optimizer to optimize out `a[2] + a[3]` operation and SLP does not recognize vectorization of 7 addition operations. This is the price we have to pay till the landing of non-power-2 vectorization. Will try to speed up.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D101555/new/
https://reviews.llvm.org/D101555
More information about the llvm-commits
mailing list