[PATCH] Adjust the cost of vectorized SHL/SRL/SRA
Wei Mi
wmi at google.com
Fri May 22 10:18:25 PDT 2015
Sorry. The format of the assembly code was removed when I replied from
phabricator. repaste it here:
> #define TYPE char
> #define OP >>
> #define SIZE 1024
> #define TYPE_ALIGN __attribute__((aligned(16)))
>
> TYPE A1[SIZE] TYPE_ALIGN;
> TYPE B1[SIZE] TYPE_ALIGN;
> TYPE C1[SIZE] TYPE_ALIGN;
>
> void kernel1() {
> for (int i = 0; i < SIZE; ++i) {
> A1[i] = B1[i] OP C1[i];
> }
Without the patch, the kernel loop:
.LBB0_1: # %for.body
movsbl B1+1024(%rax), %edx
movb C1+1024(%rax), %cl
sarl %cl, %edx
movb %dl, A1+1024(%rax)
incq %rax
jne .LBB0_1
With the patch, the kernel loop:
.LBB0_1: # %vector.body
movd B1+1024(%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero
punpcklbw %xmm1, %xmm1 # xmm1 =
xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
punpcklwd %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3]
psrad $24, %xmm1
movd C1+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero
punpcklbw %xmm2, %xmm2 # xmm2 =
xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
punpcklwd %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3]
psrad $24, %xmm2
psrad %xmm2, %xmm1
pand %xmm0, %xmm1
packuswb %xmm1, %xmm1
packuswb %xmm1, %xmm1
movd %xmm1, A1+1024(%rax)
addq $4, %rax
jne .LBB0_1
Although the vectorized version is slightly better, the cost
estimation is not precise (vectorizer cost estmiation says VF==4 (cost
is 2) is much better than VF==1 (cost is 8)).
Wei.
On Fri, May 22, 2015 at 10:03 AM, Wei Mi <wmi at google.com> wrote:
> In http://reviews.llvm.org/D9923#177012, @aschwaighofer wrote:
>
>> I share Simon's concerns. Please make sure that we still get a good estimate for kernels like (these are from the rdar mentioned in the commit).
>>
>> #define TYPE char
>> #define OP >>
>> #define SIZE 1024
>> #define TYPE_ALIGN __attribute__((aligned(16)))
>>
>> TYPE A1[SIZE] TYPE_ALIGN;
>> TYPE B1[SIZE] TYPE_ALIGN;
>> TYPE C1[SIZE] TYPE_ALIGN;
>>
>> void kernel1() {
>> for (int i = 0; i < SIZE; ++i) {
>> A1[i] = B1[i] OP C1[i];
>> }
>>
>>
>> or:
>>
>> for(k=0, r=0; k<pos; k++)
>> r += (MAX_UNSIGNED) 1 << k;
>
>
> Thanks for sharing the testcase. For the first testcase:
>
> Without the patch, the generated code for the kernel loop is:
> .LBB0_1: # %for.body
>
> 1. =>This Inner Loop Header: Depth=1 movsbl B1+1024(%rax), %edx movb C1+1024(%rax), %cl sarl %cl, %edx movb %dl, A1+1024(%rax) incq %rax jne .LBB0_1
>
> With the patch, the generated code for the kernel loop is:
> .LBB0_1: # %vector.body
>
> 1. =>This Inner Loop Header: Depth=1 movd B1+1024(%rax), %xmm1 # xmm1 = mem[0],zero,zero,zero punpcklbw %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm1, %xmm1 # xmm1 = xmm1[0,0,1,1,2,2,3,3] psrad $24, %xmm1 movd C1+1024(%rax), %xmm2 # xmm2 = mem[0],zero,zero,zero punpcklbw %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7] punpcklwd %xmm2, %xmm2 # xmm2 = xmm2[0,0,1,1,2,2,3,3] psrad $24, %xmm2 psrad %xmm2, %xmm1 pand %xmm0, %xmm1 packuswb %xmm1, %xmm1 packuswb %xmm1, %xmm1 movd %xmm1, A1+1024(%rax) addq $4, %rax jne .LBB0_1
>
> The vectorized version is slightly better than the scalarized version. But the cost estimation to compute VF is not very good -- The cost estimation shows cost is 8 when VF==1 and cost is 2 when VF==4. The estimated costs of vectorized sext and trunc are too low and don't match the real costs.
>
> Another problem is that vectorizer doesn't know the char->int type promotion here is unnecessary.
>
> Can you give me the whole version of the second testcase? I am not sure my tweaked version is the right one.
>
>
> REPOSITORY
> rL LLVM
>
> http://reviews.llvm.org/D9923
>
> EMAIL PREFERENCES
> http://reviews.llvm.org/settings/panel/emailpreferences/
>
>
More information about the llvm-commits
mailing list