[PATCH] Improve Cost model for SLPVectorizer when we have a vector division by power of 2
Andrea Di Biagio
andrea.dibiagio at gmail.com
Thu Aug 21 07:16:33 PDT 2014
On Thu, Aug 21, 2014 at 2:44 PM, Karthik Bhat <kv.bhat at samsung.com> wrote:
> Hi Andrea,
> Thanks for the i/p's. Yes for UDIV the cost can be reduced to that of SRL.
> For SDIV though I checked the below code with gcc and clang we seem to get 3 extra instructions (vpsrld,vpaddd,vpsrad) for SDIV. Please correct me if i'm wrong.
> void f(int* restrict a,int* restrict b,int* restrict c) {
> int i;
> for(i=0;i<4;i=i+4) {
> a[i] = (b[i]+c[i])/2;
> a[i+1] = (b[i+1]+c[i+1])/2;
> a[i+2] = (b[i+2]+c[i+2])/2;
> a[i+3] = (b[i+3]+c[i+3])/2;
> }
> }
> compiled with clang -O3 -mavx2 test.c -S -o test.s
> #BB#0: # %entry
> vmovdqu (%rdx), %xmm0
> vpaddd (%rsi), %xmm0, %xmm0
> vpsrld $31, %xmm0, %xmm1
> vpaddd %xmm1, %xmm0, %xmm0
> vpsrad $1, %xmm0, %xmm0
> vmovdqu %xmm0, (%rdi)
> retq
That's only because you are dividing by 2.
If you replace 2 with 4 you should get an extra shift.
What you see there is the result of a DAGCombine transformation that
folds (srl (sra X, Y), 31) -> (srl X, 31). That's the reason why the
"extra" SRA disappears and you get a smaller cost.
If you divide by a power-of-2 different from number 2, then you get
the extra vpsrad (on AVX/AVX2).
Try for example this:
define <4 x i32> @foo(<4 x i32> %A) {
%div = sdiv <4 x i32> %A, <i32 4, i32 4, i32 4, i32 4>
}
You should get:
vpsrad $31, %xmm0, %xmm1
vpsrld $30, %xmm1, %xmm1
vpaddd %xmm1, %xmm0, %xmm0
vpsrad $2, %xmm0, %xmm0
Back to the cost model: unless we know that this is a "special
division by 2", you cannot remove the cost for the extra SRA.
So, conservatively I think we should keep it.
>
> I agree that reusing the existing table is a good option. I will update the patch accordingly.
Cool thanks :)
>
> Yes we can vectorize non constant power of 2 in avx2 but i suppose it will be using vpsrlvd,vpsravd?
> Thanks all for helping me out here.
Yes, I think it should use vpsrlvd/vpsravd. However, because of that
missing combine, we now end up scalarizing the sequence.
Cheers,
Andrea
More information about the llvm-commits
mailing list