[LLVMdev] [Vectorization] Mis match in code generated

suyog sarda sardask01 at gmail.com
Thu Sep 18 11:48:03 PDT 2014


Hi,

I am trying to understand LLVM vectorization implementation and was looking
into both loop and SLP vectorization.

test case 1:






*int foo(int *a) {int sum = 0,i;for(i=0; i<16; i++)  sum += a[i];return
sum;}*

This code is vectorized by loop vectorizer where we calculate scalar loop
cost as 4 and vector loop cost as 2.
Since vector loop cost is less and above reduction is legal to vectorize,
loop vectorizer produces vectorized code :

*IR without vectorization:*





















*target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind readonlydefine i32
@foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  br label
%for.bodyfor.body:                                         ; preds =
%for.body, %entry  %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%sum.04 = phi i32 [ 0, %entry ], [ %add, %for.body ]  %arrayidx =
getelementptr inbounds i32* %a, i32 %i.05  %0 = load i32* %arrayidx, align
4, !tbaa !1  %add = add nsw i32 %0, %sum.04  %inc = add nsw i32 %i.05, 1
%exitcond = icmp eq i32 %i.05, 15  br i1 %exitcond, label %for.end, label
%for.bodyfor.end:                                          ; preds =
%for.body  ret i32 %add}*


*IR after vectorization:*











































































*target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind readonlydefine i32
@foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %backedge.overflow =
icmp eq i32 15, -1  %overflow.check.anchor = add i32 0, 0  br i1
%backedge.overflow, label %scalar.ph <http://scalar.ph>, label
%overflow.checkedoverflow.checked:                                 ; preds
= %entry  br i1 false, label %middle.block, label %vector.ph
<http://vector.ph>vector.ph
<http://vector.ph>:                                        ; preds =
%overflow.checked  br label
%vector.bodyvector.body:                                      ; preds =
%vector.body, %vector.ph <http://vector.ph>  %index = phi i32 [ 0,
%vector.ph <http://vector.ph> ], [ %index.next, %vector.body ]  %vec.phi =
phi <4 x i32> [ zeroinitializer, %vector.ph <http://vector.ph> ], [ %14,
%vector.body ]  %broadcast.splatinsert = insertelement <4 x i32> undef, i32
%index, i32 0  %broadcast.splat = shufflevector <4 x i32>
%broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
%induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
%0 = extractelement <4 x i32> %induction, i32 0  %1 = getelementptr
inbounds i32* %a, i32 %0  %2 = insertelement <4 x i32*> undef, i32* %1, i32
0  %3 = extractelement <4 x i32> %induction, i32 1  %4 = getelementptr
inbounds i32* %a, i32 %3  %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
%6 = extractelement <4 x i32> %induction, i32 2  %7 = getelementptr
inbounds i32* %a, i32 %6  %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
%9 = extractelement <4 x i32> %induction, i32 3  %10 = getelementptr
inbounds i32* %a, i32 %9  %11 = insertelement <4 x i32*> %8, i32* %10, i32
3  %12 = getelementptr i32* %1, i32 0  %13 = bitcast i32* %12 to <4 x
i32>*  %wide.load = load <4 x i32>* %13, align 4, !tbaa !1  %14 = add nsw
<4 x i32> %wide.load, %vec.phi  %15 = add nsw <4 x i32> %induction, <i32 1,
i32 1, i32 1, i32 1>  %16 = icmp eq <4 x i32> %induction, <i32 15, i32 15,
i32 15, i32 15>  %index.next = add i32 %index, 4  %17 = icmp eq i32
%index.next, 16  br i1 %17, label %middle.block, label %vector.body,
!llvm.loop !5middle.block:                                     ; preds =
%vector.body, %overflow.checked  %resume.val = phi i32 [ 0,
%overflow.checked ], [ 16, %vector.body ]  %trunc.resume.val = phi i32 [ 0,
%overflow.checked ], [ 16, %vector.body ]  %rdx.vec.exit.phi = phi <4 x
i32> [ zeroinitializer, %overflow.checked ], [ %14, %vector.body ]
%rdx.shuf = shufflevector <4 x i32> %rdx.vec.exit.phi, <4 x i32> undef, <4
x i32> <i32 2, i32 3, i32 undef, i32 undef>  %bin.rdx = add <4 x i32>
%rdx.vec.exit.phi, %rdx.shuf  %rdx.shuf1 = shufflevector <4 x i32>
%bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32
undef>  %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1  %18 =
extractelement <4 x i32> %bin.rdx2, i32 0  %cmp.n = icmp eq i32 16,
%resume.val  br i1 %cmp.n, label %for.end, label %scalar.ph
<http://scalar.ph>scalar.ph
<http://scalar.ph>:                                        ; preds =
%middle.block, %entry  %bc.resume.val = phi i32 [ %resume.val,
%middle.block ], [ 0, %entry ]  %bc.trunc.resume.val = phi i32 [
%trunc.resume.val, %middle.block ], [ 0, %entry ]  %bc.merge.rdx = phi i32
[ 0, %entry ], [ %18, %middle.block ]  br label
%for.bodyfor.body:                                         ; preds =
%for.body, %scalar.ph <http://scalar.ph>  %i.05 = phi i32 [
%bc.trunc.resume.val, %scalar.ph <http://scalar.ph> ], [ %inc, %for.body ]
%sum.04 = phi i32 [ %bc.merge.rdx, %scalar.ph <http://scalar.ph> ], [ %add,
%for.body ]  %arrayidx = getelementptr inbounds i32* %a, i32 %i.05  %19 =
load i32* %arrayidx, align 4, !tbaa !1  %add = add nsw i32 %19, %sum.04
%inc = add nsw i32 %i.05, 1  %exitcond = icmp eq i32 %i.05, 15  br i1
%exitcond, label %for.end, label %for.body, !llvm.loop
!8for.end:                                          ; preds =
%middle.block, %for.body  %add.lcssa = phi i32 [ %add, %for.body ], [ %18,
%middle.block ]  ret i32 %add.lcssa}*


Now for test case 2, where we unroll the loop totally :





*int foo(int *a) {int sum = 0;sum =
a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];return
sum;}*

This code is not vectorized by SLP vectorization.

*IR before vectorization: *

test.ll























































*target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind readonlydefine i32
@foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %0 = load i32* %a,
align 4, !tbaa !1  %arrayidx1 = getelementptr inbounds i32* %a, i32 1  %1 =
load i32* %arrayidx1, align 4, !tbaa !1  %add = add nsw i32 %1, %0
%arrayidx2 = getelementptr inbounds i32* %a, i32 2  %2 = load i32*
%arrayidx2, align 4, !tbaa !1  %add3 = add nsw i32 %add, %2  %arrayidx4 =
getelementptr inbounds i32* %a, i32 3  %3 = load i32* %arrayidx4, align 4,
!tbaa !1  %add5 = add nsw i32 %add3, %3  %arrayidx6 = getelementptr
inbounds i32* %a, i32 4  %4 = load i32* %arrayidx6, align 4, !tbaa !1
%add7 = add nsw i32 %add5, %4  %arrayidx8 = getelementptr inbounds i32* %a,
i32 5  %5 = load i32* %arrayidx8, align 4, !tbaa !1  %add9 = add nsw i32
%add7, %5  %arrayidx10 = getelementptr inbounds i32* %a, i32 6  %6 = load
i32* %arrayidx10, align 4, !tbaa !1  %add11 = add nsw i32 %add9, %6
%arrayidx12 = getelementptr inbounds i32* %a, i32 7  %7 = load i32*
%arrayidx12, align 4, !tbaa !1  %add13 = add nsw i32 %add11, %7
%arrayidx14 = getelementptr inbounds i32* %a, i32 8  %8 = load i32*
%arrayidx14, align 4, !tbaa !1  %add15 = add nsw i32 %add13, %8
%arrayidx16 = getelementptr inbounds i32* %a, i32 9  %9 = load i32*
%arrayidx16, align 4, !tbaa !1  %add17 = add nsw i32 %add15, %9
%arrayidx18 = getelementptr inbounds i32* %a, i32 10  %10 = load i32*
%arrayidx18, align 4, !tbaa !1  %add19 = add nsw i32 %add17, %10
%arrayidx20 = getelementptr inbounds i32* %a, i32 11  %11 = load i32*
%arrayidx20, align 4, !tbaa !1  %add21 = add nsw i32 %add19, %11
%arrayidx22 = getelementptr inbounds i32* %a, i32 12  %12 = load i32*
%arrayidx22, align 4, !tbaa !1  %add23 = add nsw i32 %add21, %12
%arrayidx24 = getelementptr inbounds i32* %a, i32 13  %13 = load i32*
%arrayidx24, align 4, !tbaa !1  %add25 = add nsw i32 %add23, %13
%arrayidx26 = getelementptr inbounds i32* %a, i32 14  %14 = load i32*
%arrayidx26, align 4, !tbaa !1  %add27 = add nsw i32 %add25, %14
%arrayidx28 = getelementptr inbounds i32* %a, i32 15  %15 = load i32*
%arrayidx28, align 4, !tbaa !1  %add29 = add nsw i32 %add27, %15  ret i32
%add29}*
$ opt -S -slp-vectorizer -slp-vectorize-hor test.ll -debug -o test2.ll

Features:+64bit,+sse2
CPU:generic

Subtarget features: SSELevel 3, 3DNowLevel 0, 64bit 1
SLP: Analyzing blocks in foo.

*test2.ll (IR after SLP vectorization) :*
























































*target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"target
triple = "x86_64-pc-linux-gnu"; Function Attrs: nounwind readonlydefine i32
@foo(i32* nocapture readonly %a, i32 %n) #0 {entry:  %0 = load i32* %a,
align 4, !tbaa !1  %arrayidx1 = getelementptr inbounds i32* %a, i32 1  %1 =
load i32* %arrayidx1, align 4, !tbaa !1  %add = add nsw i32 %1, %0
%arrayidx2 = getelementptr inbounds i32* %a, i32 2  %2 = load i32*
%arrayidx2, align 4, !tbaa !1  %add3 = add nsw i32 %add, %2  %arrayidx4 =
getelementptr inbounds i32* %a, i32 3  %3 = load i32* %arrayidx4, align 4,
!tbaa !1  %add5 = add nsw i32 %add3, %3  %arrayidx6 = getelementptr
inbounds i32* %a, i32 4  %4 = load i32* %arrayidx6, align 4, !tbaa !1
%add7 = add nsw i32 %add5, %4  %arrayidx8 = getelementptr inbounds i32* %a,
i32 5  %5 = load i32* %arrayidx8, align 4, !tbaa !1  %add9 = add nsw i32
%add7, %5  %arrayidx10 = getelementptr inbounds i32* %a, i32 6  %6 = load
i32* %arrayidx10, align 4, !tbaa !1  %add11 = add nsw i32 %add9, %6
%arrayidx12 = getelementptr inbounds i32* %a, i32 7  %7 = load i32*
%arrayidx12, align 4, !tbaa !1  %add13 = add nsw i32 %add11, %7
%arrayidx14 = getelementptr inbounds i32* %a, i32 8  %8 = load i32*
%arrayidx14, align 4, !tbaa !1  %add15 = add nsw i32 %add13, %8
%arrayidx16 = getelementptr inbounds i32* %a, i32 9  %9 = load i32*
%arrayidx16, align 4, !tbaa !1  %add17 = add nsw i32 %add15, %9
%arrayidx18 = getelementptr inbounds i32* %a, i32 10  %10 = load i32*
%arrayidx18, align 4, !tbaa !1  %add19 = add nsw i32 %add17, %10
%arrayidx20 = getelementptr inbounds i32* %a, i32 11  %11 = load i32*
%arrayidx20, align 4, !tbaa !1  %add21 = add nsw i32 %add19, %11
%arrayidx22 = getelementptr inbounds i32* %a, i32 12  %12 = load i32*
%arrayidx22, align 4, !tbaa !1  %add23 = add nsw i32 %add21, %12
%arrayidx24 = getelementptr inbounds i32* %a, i32 13  %13 = load i32*
%arrayidx24, align 4, !tbaa !1  %add25 = add nsw i32 %add23, %13
%arrayidx26 = getelementptr inbounds i32* %a, i32 14  %14 = load i32*
%arrayidx26, align 4, !tbaa !1  %add27 = add nsw i32 %add25, %14
%arrayidx28 = getelementptr inbounds i32* %a, i32 15  %15 = load i32*
%arrayidx28, align 4, !tbaa !1  %add29 = add nsw i32 %add27, %15  ret i32
%add29}*
Shouldn't the SLP vectorizer vectorize above code? Am I missing out
something?
How does loop unrolling affect the final vectorization of code?

I tried to debug above IR. Since there is no store in above IR, SLP
vectorization doesn't call
vectorizeStoreChains(). In vectorizeChainsInBlock(), as there is no phi
node in above IR, it
also fails to vectorize the IR.

If it is perfectly legal to vectorize above IR, are we lacking code for it
right now? Or its not
possible to vectorize above code (as far as i know, it is perfectly
possible to vectorize above code)?
If its possible, can someone help me/point out specifics how to start for
implementing vectorization for such code?
(We do have function calls isConsecutiveAccess() to identify it. Is it a
case for horizontal reduction, though the term
'reduction' is associated with loops/phi nodes.)

This brings to me another question - if we perform aggressive loop
unrolling, we might loose opportunity for
vectorization as visible above !!

(I was looking into bug 20035, when i encountered above issue. Though both
are not related.)

Your comments/suggestions are most awaited.


-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/192dd430/attachment.html>


More information about the llvm-dev mailing list