[PATCH] Adjust the cost of vectorized SHL/SRL/SRA

Fri May 22 10:18:25 PDT 2015

Sorry. The format of the assembly code was removed when I replied from
phabricator. repaste it here:

>   #define TYPE char
>   #define OP >>
>   #define SIZE 1024
>   #define TYPE_ALIGN __attribute__((aligned(16)))
>
>   TYPE A1[SIZE] TYPE_ALIGN;
>   TYPE B1[SIZE] TYPE_ALIGN;
>   TYPE C1[SIZE] TYPE_ALIGN;
>
>   void kernel1() {
>     for (int i = 0; i < SIZE; ++i) {
>       A1[i] = B1[i] OP C1[i];
>   }

Without the patch, the kernel loop:
.LBB0_1:                                # %for.body
        movsbl  B1+1024(%rax), %edx
        movb    C1+1024(%rax), %cl
        sarl    %cl, %edx
        movb    %dl, A1+1024(%rax)
        incq    %rax
        jne     .LBB0_1

With the patch, the kernel loop:
.LBB0_1:                                # %vector.body
        movd    B1+1024(%rax), %xmm1    # xmm1 = mem[0],zero,zero,zero
        punpcklbw       %xmm1, %xmm1    # xmm1 =
xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        punpcklwd       %xmm1, %xmm1    # xmm1 = xmm1[0,0,1,1,2,2,3,3]
        psrad   $24, %xmm1
        movd    C1+1024(%rax), %xmm2    # xmm2 = mem[0],zero,zero,zero
        punpcklbw       %xmm2, %xmm2    # xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        punpcklwd       %xmm2, %xmm2    # xmm2 = xmm2[0,0,1,1,2,2,3,3]
        psrad   $24, %xmm2
        psrad   %xmm2, %xmm1
        pand    %xmm0, %xmm1
        packuswb        %xmm1, %xmm1
        packuswb        %xmm1, %xmm1
        movd    %xmm1, A1+1024(%rax)
        addq    $4, %rax
        jne     .LBB0_1

Although the vectorized version is slightly better, the cost
estimation is not precise (vectorizer cost estmiation says VF==4 (cost
is 2) is much better than VF==1 (cost is 8)).

Wei.

On Fri, May 22, 2015 at 10:03 AM, Wei Mi <wmi at google.com> wrote:
> In http://reviews.llvm.org/D9923#177012, @aschwaighofer wrote:
>
>> I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).
>>
>>   #define TYPE char
>>   #define OP >>
>>   #define SIZE 1024
>>   #define TYPE_ALIGN __attribute__((aligned(16)))
>>
>>   TYPE A1[SIZE] TYPE_ALIGN;
>>   TYPE B1[SIZE] TYPE_ALIGN;
>>   TYPE C1[SIZE] TYPE_ALIGN;
>>
>>   void kernel1() {
>>     for (int i = 0; i < SIZE; ++i) {
>>       A1[i] = B1[i] OP C1[i];
>>   }
>>
>>
>> or:
>>
>>   for(k=0, r=0; k<pos; k++)
>>     r += (MAX_UNSIGNED) 1 << k;
>
>
> Thanks for sharing the testcase. For the first testcase:
>
> Without the patch, the generated code for the kernel loop is:
> .LBB0_1:                                # %for.body
>
> 1. =>This Inner Loop Header: Depth=1 movsbl  B1+1024(%rax), %edx movb    C1+1024(%rax), %cl sarl    %cl, %edx movb    %dl, A1+1024(%rax) incq    %rax jne     .LBB0_1
>
> With the patch, the generated code for the kernel loop is:
> .LBB0_1:                                # %vector.body
>
> 1. =>This Inner Loop Header: Depth=1 movd    B1+1024(%rax), %xmm1    # xmm1 = mem[0],zero,zero,zero punpcklbw       %xmm1, %xmm1    # xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd       %xmm1, %xmm1    # xmm1 = xmm1[0,0,1,1,2,2,3,3] psrad   $24, %xmm1 movd    C1+1024(%rax), %xmm2    # xmm2 = mem[0],zero,zero,zero punpcklbw       %xmm2, %xmm2    # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd       %xmm2, %xmm2    # xmm2 = xmm2[0,0,1,1,2,2,3,3] psrad   $24, %xmm2 psrad   %xmm2, %xmm1 pand    %xmm0, %xmm1 packuswb        %xmm1, %xmm1 packuswb        %xmm1, %xmm1 movd    %xmm1, A1+1024(%rax) addq    $4, %rax jne     .LBB0_1
>
> The vectorized version is slightly better than the scalarized version. But the cost estimation to compute VF is not very good -- The cost estimation shows cost is 8 when VF==1 and cost is 2 when VF==4. The estimated costs of vectorized sext and trunc are too low and don't match the real costs.
>
> Another problem is that vectorizer doesn't know the char->int type promotion here is unnecessary.
>
> Can you give me the whole version of the second testcase? I am not sure my tweaked version is the right one.
>
>
> REPOSITORY
>   rL LLVM
>
> http://reviews.llvm.org/D9923
>
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
>
>