[PATCH] Improve Cost model for SLPVectorizer when we have a vector division by power of 2

Thu Aug 21 07:16:33 PDT 2014

On Thu, Aug 21, 2014 at 2:44 PM, Karthik Bhat <kv.bhat at samsung.com> wrote:
> Hi Andrea,
> Thanks for the i/p's. Yes for UDIV the cost can be reduced to that of SRL.
> For SDIV though I checked the below code with gcc and clang we seem to get 3 extra instructions (vpsrld,vpaddd,vpsrad) for SDIV. Please correct me if i'm wrong.
>   void f(int* restrict a,int* restrict b,int* restrict c) {
>     int i;
>     for(i=0;i<4;i=i+4) {
>       a[i] = (b[i]+c[i])/2;
>       a[i+1] = (b[i+1]+c[i+1])/2;
>       a[i+2] = (b[i+2]+c[i+2])/2;
>       a[i+3] = (b[i+3]+c[i+3])/2;
>     }
>   }
> compiled with clang -O3 -mavx2 test.c -S -o test.s
>   #BB#0:                                 # %entry
>   vmovdqu       (%rdx), %xmm0
>   vpaddd        (%rsi), %xmm0, %xmm0
>   vpsrld        $31, %xmm0, %xmm1
>   vpaddd        %xmm1, %xmm0, %xmm0
>   vpsrad        $1, %xmm0, %xmm0
>   vmovdqu       %xmm0, (%rdi)
>   retq

That's only because you are dividing by 2.
If you replace 2 with 4 you should get an extra shift.
What you see there is the result of a DAGCombine transformation that
folds (srl (sra X, Y), 31) -> (srl X, 31). That's the reason why the
"extra" SRA disappears and you get a smaller cost.

If you divide by a power-of-2 different from number 2, then you get
the extra vpsrad (on AVX/AVX2).

Try for example this:
define <4 x i32> @foo(<4 x i32> %A) {
  %div = sdiv <4 x i32> %A, <i32 4, i32 4, i32 4, i32 4>
}

You should get:
  vpsrad $31, %xmm0, %xmm1
  vpsrld $30, %xmm1, %xmm1
  vpaddd %xmm1, %xmm0, %xmm0
  vpsrad $2, %xmm0, %xmm0

Back to the cost model: unless we know that this is a "special
division by 2", you cannot remove the cost for the extra SRA.
So, conservatively I think we should keep it.

>
> I agree that reusing the existing table is a good option. I will update the patch accordingly.

Cool thanks :)

>
> Yes we can vectorize non constant power of 2 in avx2 but i suppose it will be using vpsrlvd,vpsravd?
> Thanks all for helping me out here.

Yes, I think it should use vpsrlvd/vpsravd. However, because of that
missing combine, we now end up scalarizing the sequence.

Cheers,
Andrea