[LLVMdev] [Vectorization] Mis match in code generated

Mon Nov 10 07:43:26 PST 2014

Hi Suyog,

Thanks for looking at this. This has recently got itself onto my TODO list
too.

> I am not sure how much all this will improve the code quality for
horizontal reduction
> (donno how frequently such pattern of horizontal reduction from same
array occurs in real world/SPECS).

Actually the main loop of 470.lbm can be SLP vectorized like this. We have
three parts to it: A fully unrolled reduction, a scatter/gather part (not
really vectorizable) and a contiguous-load-scatter-store part which should
be vectorized even when scatter stores are not available in HW.

Cheers,

James

On 10 November 2014 14:30, Arnold Schwaighofer <aschwaighofer at apple.com>
wrote:

> Hi Suyog,
>
> the horizontal reduction code needs work. It was written long ago and was
> never enabled so it has bit rotted somewhat and has lurking bugs.
> For example, it could not handle reduction value nodes that were not
> binary operations.
>
> It could vectorize
>
> (+ (* x y) (*z a) (*b c) (*d e))
>
> but not
>
>
> (+ (load a) (load b) (load c) (load d))
>
>
> With the attached patches:
> 0001-Test-hor-redux.patch
> 0002-Print-out-tree.patch
>
> We would attempt to vectorize the tree stemming from your example:
>
>
> > float a[4], sum;
> > void foo() {
> >   sum = a[0]+a[1]+a[2]+a[3];
> > }
>
> but because we happen to end up with the vector tuple (a[1],
> a[0],a[2],a[3)) in need of gathering the existing code fails. Note, that I
> used fast-math because the horizontal reduction code assumes it can
> reassociate freely.
>
> > clang -O3 test_hor_redux.c -emit-llvm -S -o - -mllvm
> -slp-vectorize-hor-store  -debug-only=SLP -ffast-math -mllvm -debug-only=SLP
>
> ...
>
>   Reduced val:   %1 = load float* getelementptr inbounds ([4 x float]* @a,
> i64 0, i64 1), align 4, !tbaa !1
>   Reduced val:   %0 = load float* getelementptr inbounds ([4 x float]* @a,
> i64 0, i64 0), align 16, !tbaa !1
>   Reduced val:   %2 = load float* getelementptr inbounds ([4 x float]* @a,
> i64 0, i64 2), align 8, !tbaa !1
>   Reduced val:   %3 = load float* getelementptr inbounds ([4 x float]* @a,
> i64 0, i64 3), align 4, !tbaa !1
>   Reduction ops:   %add = fadd fast float %1, %0
>   Reduction ops:   %add1 = fadd fast float %add, %2
>   Reduction ops:   %add2 = fadd fast float %add1, %3
> Assertion failed: (VectorizedValue && "Need to have a vectorized tree
> node"), function emitReduction, file
> ../lib/Transforms/Vectorize/SLPVectorizer.cpp, line 3518.
>
> The reduction looks like
>
> (((a[1] + a[0]) + a[2]) + a[3])
>
> So we build a tree entry  (load a[1], load a[0], load a[2], load a[3]) and
> end up with a vectorized tree entry of one who needs gathering.
>
> The code does not handle this case correctly (in normal slp mode there is
> no point in "vectorizing" a single entry tree that needs gathering).
> The patch
> 0003-We-don-t-correctly-vectorize-single-entry-gather-nod.patch fixes this
> by disabling vectorization in such a case but then we don't vectorize.
> As you can see this needs more work.
>
> * We could reassociate reductions like the one above.
>
> (((a[1] + a[0]) + a[2]) + a[3])
>
> =>
>
> (((a[0] + a[1]) + a[2]) + a[3])
>
> Then the reduction tree builder code would not have to be clever. That
> will only work if your initial reduction values are loads. If the loads are
> hidden behind a deeper tree that won't help. In such a case the next
> solution would not be so bad though.
>
> * We could teach the tree code to do a vector load + permute operation if
> applicable. This will no generate the best code though. But for deeper
> trees the overhead of the permute probably does not matter much.
>
>
> Best,
> Arnold
>
> On 11/09/14, suyog sarda   wrote:
> > Hi Arnold, Nadav, Hal,
> > all,
> >
> >
> > Restarting this thread with few more observations on horizontal
> reduction store
> > (after using flags - slp-vectorize-hor-store and -slp-vectorize-hor,
> thanks Arnold for pointing out earlier).
> >
> >
> > The SLP vectorizer does vectorize straight line code :
> >
> >
> > (this code may result as part of loop unrolling as stated earlier in the
> thread.
> > Though for loop vectorization , trip count should be atleast 16, taking
> smaller
> > code here to put some findings.)
> >
> >
> > float a[4], sum;
> >
> > void foo() {
> >
> >  int i;
> >
> >  for(i=0;i<4;i++);
> >
> >    sum += a[i];
> > }
> >
> > after loop unrolling -
> >
> > float a[4], sum;
> > void foo() {
> >   sum = a[0]+a[1]+a[2]+a[3];
> > }
> >
> >
> > This code gets vectorized depending on how we organize the code, and
> subsequently
> >
> > how the expression tree is formed.
> >
> >
> > case 1:
> >
> > float a[4], sum;
> > void foo() {
> >   sum = a[0]+a[1]+a[2]+a[3];
> > }
> >
> >
> > The expression tree for this will be
> >
> >
> >     a[0]    a[1]
> >
> >        \       /
> >         \     /
> >
> >            +      a[2]
> >
> >             \     /
> >              \   /
> >
> >                +       a[3]
> >
> >                  \      /
> >                   \    /
> >                     +
> >                      |
> >                      |
> >                      ↓
> >
> >                      sum
> >
> >
> > This doesn't vectorize, because in function tryToVectorizeList(), we
> check if the
> >
> > scalar instruction are of the same type. In above case, it will be fadd
> and load,
> >
> > and hence no vectorization takes place.
> >
> >
> >
> > case 2:
> >
> > float a[4], sum;
> > void foo() {
> >   sum = (a[0]+a[1])+(a[2]+a[3]);
> > }
> >
> >
> > The expression tree for this will be (notice brackets above):
> >
> >
> > a[0]    a[1]      a[2]    a[3]
> >
> >    \       /            \        /
> >     \     /              \      /
> >        +                   +
> >          \                /
> >           \              /
> >            \            /
> >                  +
> >                    |
> >                   |
> >                   ↓
> >                  sum
> >
> >
> > In this case, in function tryToVectorizeList(), both the scalar
> instructions are
> >
> > same i.e. fadd. Hence we proceed for vectorization ahead.
> >
> > But this still doesn't vectorizes the code. Ahead, while trying to
> schedule bundle of code,
> >
> > it checks whether the loads are for consecutive memory access across two
> subtrees from
> > same base pointer.
> >
> >
> > For above code, it checks if (a[0],a[2]) and (a[1],a[3]) are consecutive
> memory
> > access (which, obviously, they are not). This cancels scheduling of
> bundle of loads,
> > and hence calculated cost for non-consecutive loads comes greater
> increasing overall
> >
> > cost of vectorization (and hence no vectorization).
> >
> >
> > Seems this check for consecutive memory access was written keeping in
> mind following sort of code :
> >
> > sum = (a[0] + b[0]) + (a[1] + b[1]);
> >
> >
> >
> > Finally,
> >
> > case 3:
> >
> > float a[4], sum;
> > void foo() {
> >   sum = (a[0]+a[2])+(a[1]+a[3]);
> > }
> >
> >
> > The expression tree for this will be (notice brackets above):
> >
> >
> > a[0]    a[2]      a[1]    a[3]
> >
> >    \       /            \        /
> >     \     /              \      /
> >        +                   +
> >          \                /
> >           \              /
> >            \            /
> >                  +
> >                    |
> >                   |
> >                   ↓
> >                  sum
> >
> >
> >
> > This code gets vectorized, as tryToVectorizeList() gets both scalar
> operation as fadd,
> >
> > as well as there is a consecutive memory access across subtress rooted
> at both fadd.
> >
> >
> > case 1 IR after SLP vectorization :
> >
> > define void @foo() #0 {
> > entry:
> >   %0 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 0), align 16, !tbaa !1
> >   %1 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 1), align 4, !tbaa !1
> >   %add = fadd float %0, %1
> >   %2 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 2), align 8, !tbaa !1
> >   %add1 = fadd float %add, %2
> >   %3 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 3), align 4, !tbaa !1
> >   %add2 = fadd float %add1, %3
> >   store float %add2, float* @sum, align 4, !tbaa !1
> >   ret void
> > }
> >
> >
> > case 2 IR after SLP vectorization :
> >
> > define void @foo() #0 {
> > entry:
> >   %0 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 0), align 16, !tbaa !1
> >   %1 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 1), align 4, !tbaa !1
> >   %add = fadd float %0, %1
> >   %2 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 2), align 8, !tbaa !1
> >   %3 = load float* getelementptr inbounds ([8 x float]* @a, i64 0, i64
> 3), align 4, !tbaa !1
> >   %add1 = fadd float %2, %3
> >   %add2 = fadd float %add, %add1
> >   store float %add2, float* @sum, align 4, !tbaa !1
> >   ret void
> > }
> >
> >
> > case 3 IR after SLP vectorization :
> >
> >
> > define void @foo() #0 {
> > entry:
> >   %0 = load <2 x float>* bitcast ([4 x float]* @a to <2 x float>*),
> align 16, !tbaa !1
> >   %1 = load <2 x float>* bitcast (float* getelementptr inbounds ([4 x
> float]* @a, i64 0, i64 2) to <2 x float>*), align 8, !tbaa !1
> >   %2 = fadd <2 x float> %0, %1
> >   %3 = extractelement <2 x float> %2, i32 0
> >   %4 = extractelement <2 x float> %2, i32 1
> >   %add2 = fadd float %3, %4
> >   store float %add2, float* @sum, align 4, !tbaa !1
> >   ret void
> > }
> >
> >
> >
> > Few cases of improvements, as visible from above (noting down roughly,
> didn't give detail thought on it yet):
> >
> > 1. The IR expression tree can be re-organized to cater above need (some
> kind of balancing like AVL tree).
> >
> >     Alternatively, parse the expression tree without changing it and
> look upward tree if we encounter a
> >
> >     load instruction.
> >
> > 2. Seems, there is a case for improvement in checking if memory access
> are consecutive.
> >     This needs to take into consideration all the loads of all the
> subtree .
> >      If all the loads are from same base pointer, then are the memory
> access in same subtree as
> >      well as across the subtree consecutive. (i will try to write some
> code on this, any suggestions are welcomed)
> >
> >
> >
> > I am not sure how much all this will improve the code quality for
> horizontal reduction
> > (donno how frequently such pattern of horizontal reduction from same
> array occurs in real world/SPECS).
> >
> > Perhaps the best case i can think of is when loops are unrolled as noted
> earlier in the thread.
> >
> >
> > Adding few more people to pitch in !!
> >
> >
> > Comments/Suggestions are most welcomed.
> >
> >
> > Regards,
> >
> > Suyog
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Sat, Sep 20, 2014 at 12:05 AM, suyog sarda <sardask01 at gmail.com <
> sardask01 at gmail.com>> wrote:
> >
> > > Hi Arnold,
> > >
> > >
> > > Thanks for your reply.
> > >
> > >
> > > I tried test case as suggested by you.
> > >
> > > void foo(int *a, int *sum) {
> > > *sum =
> a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];
> > > }
> > >
> > >
> > > so that it has a 'store' in its IR.
> > >
> > >
> > > IR before vectorization :
> > >
> > > target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
> > > target triple = "x86_64-pc-linux-gnu"
> > >
> > > ; Function Attrs: nounwind
> > > define void @foo(i32* nocapture readonly %a, i32* nocapture %sum) #0 {
> > > entry:
> > >   %0 = load i32* %a, align 4, !tbaa !1
> > >   %arrayidx1 = getelementptr inbounds i32* %a, i32 1
> > >   %1 = load i32* %arrayidx1, align 4, !tbaa !1
> > >   %add = add nsw i32 %1, %0
> > >   %arrayidx2 = getelementptr inbounds i32* %a, i32 2
> > >   %2 = load i32* %arrayidx2, align 4, !tbaa !1
> > >   %add3 = add nsw i32 %add, %2
> > >   %arrayidx4 = getelementptr inbounds i32* %a, i32 3
> > >   %3 = load i32* %arrayidx4, align 4, !tbaa !1
> > >   %add5 = add nsw i32 %add3, %3
> > >   %arrayidx6 = getelementptr inbounds i32* %a, i32 4
> > >   %4 = load i32* %arrayidx6, align 4, !tbaa !1
> > >   %add7 = add nsw i32 %add5, %4
> > >   %arrayidx8 = getelementptr inbounds i32* %a, i32 5
> > >   %5 = load i32* %arrayidx8, align 4, !tbaa !1
> > >   %add9 = add nsw i32 %add7, %5
> > >   %arrayidx10 = getelementptr inbounds i32* %a, i32 6
> > >   %6 = load i32* %arrayidx10, align 4, !tbaa !1
> > >   %add11 = add nsw i32 %add9, %6
> > >   %arrayidx12 = getelementptr inbounds i32* %a, i32 7
> > >   %7 = load i32* %arrayidx12, align 4, !tbaa !1
> > >   %add13 = add nsw i32 %add11, %7
> > >   %arrayidx14 = getelementptr inbounds i32* %a, i32 8
> > >   %8 = load i32* %arrayidx14, align 4, !tbaa !1
> > >   %add15 = add nsw i32 %add13, %8
> > >   %arrayidx16 = getelementptr inbounds i32* %a, i32 9
> > >   %9 = load i32* %arrayidx16, align 4, !tbaa !1
> > >   %add17 = add nsw i32 %add15, %9
> > >   %arrayidx18 = getelementptr inbounds i32* %a, i32 10
> > >   %10 = load i32* %arrayidx18, align 4, !tbaa !1
> > >   %add19 = add nsw i32 %add17, %10
> > >   %arrayidx20 = getelementptr inbounds i32* %a, i32 11
> > >   %11 = load i32* %arrayidx20, align 4, !tbaa !1
> > >   %add21 = add nsw i32 %add19, %11
> > >   %arrayidx22 = getelementptr inbounds i32* %a, i32 12
> > >   %12 = load i32* %arrayidx22, align 4, !tbaa !1
> > >   %add23 = add nsw i32 %add21, %12
> > >   %arrayidx24 = getelementptr inbounds i32* %a, i32 13
> > >   %13 = load i32* %arrayidx24, align 4, !tbaa !1
> > >   %add25 = add nsw i32 %add23, %13
> > >   %arrayidx26 = getelementptr inbounds i32* %a, i32 14
> > >   %14 = load i32* %arrayidx26, align 4, !tbaa !1
> > >   %add27 = add nsw i32 %add25, %14
> > >   %arrayidx28 = getelementptr inbounds i32* %a, i32 15
> > >   %15 = load i32* %arrayidx28, align 4, !tbaa !1
> > >   %add29 = add nsw i32 %add27, %15
> > >
> > >
> > >   store i32 %add29, i32* %sum, align 4, !tbaa !1
> > >   ret void
> > > }
> > >
> > >
> > >
> > > IR after SLP vectorization with appropriate flags :
> > >
> > > $ opt -S -slp-vectorizer -slp-vectorize-hor=1
> -slp-vectorize-hor-store=1 test.ll -debug
> > >
> > >
> > > (I hope i am passing the args correctly to opt)
> > >
> > > Subtarget features: SSELevel 3, 3DNowLevel 0, 64bit 1
> > > SLP: Analyzing blocks in foo.
> > > SLP: Found 1 stores to vectorize.
> > > SLP: Vectorizing a list of length = 2.
> > > ; ModuleID = 'test.ll'
> > > target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
> > > target triple = "x86_64-pc-linux-gnu"
> > >
> > > ; Function Attrs: nounwind
> > > define void @foo(i32* nocapture readonly %a, i32* nocapture %sum) #0 {
> > > entry:
> > >   %0 = load i32* %a, align 4, !tbaa !1
> > >   %arrayidx1 = getelementptr inbounds i32* %a, i32 1
> > >   %1 = load i32* %arrayidx1, align 4, !tbaa !1
> > >   %add = add nsw i32 %1, %0
> > >   %arrayidx2 = getelementptr inbounds i32* %a, i32 2
> > >   %2 = load i32* %arrayidx2, align 4, !tbaa !1
> > >   %add3 = add nsw i32 %add, %2
> > >   %arrayidx4 = getelementptr inbounds i32* %a, i32 3
> > >   %3 = load i32* %arrayidx4, align 4, !tbaa !1
> > >   %add5 = add nsw i32 %add3, %3
> > >   %arrayidx6 = getelementptr inbounds i32* %a, i32 4
> > >   %4 = load i32* %arrayidx6, align 4, !tbaa !1
> > >   %add7 = add nsw i32 %add5, %4
> > >   %arrayidx8 = getelementptr inbounds i32* %a, i32 5
> > >   %5 = load i32* %arrayidx8, align 4, !tbaa !1
> > >   %add9 = add nsw i32 %add7, %5
> > >   %arrayidx10 = getelementptr inbounds i32* %a, i32 6
> > >   %6 = load i32* %arrayidx10, align 4, !tbaa !1
> > >   %add11 = add nsw i32 %add9, %6
> > >   %arrayidx12 = getelementptr inbounds i32* %a, i32 7
> > >   %7 = load i32* %arrayidx12, align 4, !tbaa !1
> > >   %add13 = add nsw i32 %add11, %7
> > >   %arrayidx14 = getelementptr inbounds i32* %a, i32 8
> > >   %8 = load i32* %arrayidx14, align 4, !tbaa !1
> > >   %add15 = add nsw i32 %add13, %8
> > >   %arrayidx16 = getelementptr inbounds i32* %a, i32 9
> > >   %9 = load i32* %arrayidx16, align 4, !tbaa !1
> > >   %add17 = add nsw i32 %add15, %9
> > >   %arrayidx18 = getelementptr inbounds i32* %a, i32 10
> > >   %10 = load i32* %arrayidx18, align 4, !tbaa !1
> > >   %add19 = add nsw i32 %add17, %10
> > >   %arrayidx20 = getelementptr inbounds i32* %a, i32 11
> > >   %11 = load i32* %arrayidx20, align 4, !tbaa !1
> > >   %add21 = add nsw i32 %add19, %11
> > >   %arrayidx22 = getelementptr inbounds i32* %a, i32 12
> > >   %12 = load i32* %arrayidx22, align 4, !tbaa !1
> > >   %add23 = add nsw i32 %add21, %12
> > >   %arrayidx24 = getelementptr inbounds i32* %a, i32 13
> > >   %13 = load i32* %arrayidx24, align 4, !tbaa !1
> > >   %add25 = add nsw i32 %add23, %13
> > >   %arrayidx26 = getelementptr inbounds i32* %a, i32 14
> > >   %14 = load i32* %arrayidx26, align 4, !tbaa !1
> > >   %add27 = add nsw i32 %add25, %14
> > >   %arrayidx28 = getelementptr inbounds i32* %a, i32 15
> > >   %15 = load i32* %arrayidx28, align 4, !tbaa !1
> > >   %add29 = add nsw i32 %add27, %15
> > >
> > >
> > >   store i32 %add29, i32* %sum, align 4, !tbaa !1
> > >   ret void
> > > }
> > >
> > >
> > > As observed above, SLP vectorization doesn't vectorize this code.
> > >
> > >
> > > I tried debugging, just mentioning what i found -
> > > the code hits vectorizeChainsInBlock() ->
> if(ShouldStartVectorizeHorAtStore) ->
> > > tryToVectorize() -> tryToVectorizePair() -> tryToVectorizeList() -
> inside this function, it iterates in the
> > >
> > > for loop for checking if scalar instructions are of same type. It
> exits for 2nd iteration in the loop as
> > >
> > > OpCode does not match , one opcode is 'add' and second is 'load'. So
> it exits from function
> > > 'tryToVectorizeList()', and comes back to function 'tryToVectorize()',
> but fails to satisfy any of the
> > >
> > > further 'if' in that function and returns back without vectorizing any
> of the code.
> > >
> > >
> > > I will try to dig more into it and try to understand where the code
> can be implemented.
> > >
> > >
> > > Comments/suggestions/explanations are always welcomed !!
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Sep 19, 2014 at 1:49 AM, Arnold <aschwaighofer at apple.com <
> aschwaighofer at apple.com>> wrote:
> > >
> > > > There is experimental code to handle horizontal reductions hidden
> behind a flag "slp-vectorize-hor" you could try to enable that. It won't
> currently work for your example though because we only start to look for
> reductions at phis and stores "slp-vectorize-hor-stores".
> > > >
> > > >
> > > > You could start off by rewriting your example with a store and see
> whether the tree match works after using both flags.
> > > >
> > > > F(int *sum ...){
> > > > *sum =
> a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];
> > > > }
> > > >
> > > >
> > > > If it works we would only need to look for horizontal reductions at
> returns.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Sep 18, 2014, at 12:24 PM, suyog sarda <sardask01 at gmail.com <
> sardask01 at gmail.com>> wrote:
> > > >
> > > >
> > > >
> > > > > Hi Nadav,
> > > > >
> > > > >
> > > > > Thanks for the quick reply !!
> > > > >
> > > > >
> > > > > Ok, so as of now we are lacking capability to handle flat large
> reductions.
> > > > >
> > > > >
> > > > > I did go through function vectorizeChainsInBlock() (line number
> 2862). In this function,
> > > > >
> > > > > we try to vectorize if we have phi nodes in the IR (several if's
> check for phi nodes) i.e we try to
> > > > >
> > > > > construct tree that starts at chains.
> > > > >
> > > > >
> > > > > Any pointers on how to join multiple trees? I will also try to dig
> more into it.
> > > > >
> > > > >
> > > > > Comments/Suggestions are most welcomed !!
> > > > >
> > > > >
> > > > > On Fri, Sep 19, 2014 at 12:23 AM, Nadav Rotem <nrotem at apple.com <
> nrotem at apple.com>> wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > > On Sep 18, 2014, at 11:48 AM, suyog sarda <sardask01 at gmail.com
> <sardask01 at gmail.com>> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > >
> > > > > > > I am trying to understand LLVM vectorization implementation
> and was looking into both loop and SLP vectorization.
> > > > > > >
> > > > > > >
> > > > > > > test case 1:
> > > > > > >
> > > > > > >
> > > > > > > int foo(int *a) {
> > > > > > > int sum = 0,i;
> > > > > > > for(i=0; i<16; i++)
> > > > > > >   sum += a[i];
> > > > > > > return sum;
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > This code is vectorized by loop vectorizer where we calculate
> scalar loop cost as 4 and vector loop cost as 2.
> > > > > > >
> > > > > > > Since vector loop cost is less and above reduction is legal to
> vectorize, loop vectorizer produces vectorized code :
> > > > > > >
> > > > > > >
> > > > > > > IR without vectorization:
> > > > > > >
> > > > > > > target datalayout =
> "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
> > > > > > > target triple = "x86_64-pc-linux-gnu"
> > > > > > >
> > > > > > > ; Function Attrs: nounwind readonly
> > > > > > > define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
> > > > > > > entry:
> > > > > > >   br label %for.body
> > > > > > >
> > > > > > > for.body:                                         ; preds =
> %for.body, %entry
> > > > > > >   %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
> > > > > > >   %sum.04 = phi i32 [ 0, %entry ], [ %add, %for.body ]
> > > > > > >   %arrayidx = getelementptr inbounds i32* %a, i32 %i.05
> > > > > > >   %0 = load i32* %arrayidx, align 4, !tbaa !1
> > > > > > >   %add = add nsw i32 %0, %sum.04
> > > > > > >   %inc = add nsw i32 %i.05, 1
> > > > > > >   %exitcond = icmp eq i32 %i.05, 15
> > > > > > >   br i1 %exitcond, label %for.end, label %for.body
> > > > > > >
> > > > > > > for.end:                                          ; preds =
> %for.body
> > > > > > >   ret i32 %add
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > IR after vectorization:
> > > > > > >
> > > > > > > target datalayout =
> "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
> > > > > > > target triple = "x86_64-pc-linux-gnu"
> > > > > > >
> > > > > > > ; Function Attrs: nounwind readonly
> > > > > > > define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
> > > > > > > entry:
> > > > > > >   %backedge.overflow = icmp eq i32 15, -1
> > > > > > >   %overflow.check.anchor = add i32 0, 0
> > > > > > >   br i1 %backedge.overflow, label %scalar.ph(http://scalar.ph/),
> label %overflow.checked
> > > > > > >
> > > > > > > overflow.checked:                                 ; preds =
> %entry
> > > > > > >   br i1 false, label %middle.block, label %vector.ph(
> http://vector.ph/)
> > > > > > >
> > > > > > > vector.ph(http://vector.ph/):
> ; preds = %overflow.checked
> > > > > > >   br label %vector.body
> > > > > > >
> > > > > > > vector.body:                                      ; preds =
> %vector.body, %vector.ph(http://vector.ph/)
> > > > > > >   %index = phi i32 [ 0, %vector.ph(http://vector.ph/) ], [
> %index.next, %vector.body ]
> > > > > > >   %vec.phi = phi <4 x i32> [ zeroinitializer, %vector.ph(
> http://vector.ph/) ], [ %14, %vector.body ]
> > > > > > >   %broadcast.splatinsert = insertelement <4 x i32> undef, i32
> %index, i32 0
> > > > > > >   %broadcast.splat = shufflevector <4 x i32>
> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
> > > > > > >   %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1,
> i32 2, i32 3>
> > > > > > >   %0 = extractelement <4 x i32> %induction, i32 0
> > > > > > >   %1 = getelementptr inbounds i32* %a, i32 %0
> > > > > > >   %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
> > > > > > >   %3 = extractelement <4 x i32> %induction, i32 1
> > > > > > >   %4 = getelementptr inbounds i32* %a, i32 %3
> > > > > > >   %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
> > > > > > >   %6 = extractelement <4 x i32> %induction, i32 2
> > > > > > >   %7 = getelementptr inbounds i32* %a, i32 %6
> > > > > > >   %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
> > > > > > >   %9 = extractelement <4 x i32> %induction, i32 3
> > > > > > >   %10 = getelementptr inbounds i32* %a, i32 %9
> > > > > > >   %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
> > > > > > >   %12 = getelementptr i32* %1, i32 0
> > > > > > >   %13 = bitcast i32* %12 to <4 x i32>*
> > > > > > >   %wide.load = load <4 x i32>* %13, align 4, !tbaa !1
> > > > > > >   %14 = add nsw <4 x i32> %wide.load, %vec.phi
> > > > > > >   %15 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1,
> i32 1>
> > > > > > >   %16 = icmp eq <4 x i32> %induction, <i32 15, i32 15, i32 15,
> i32 15>
> > > > > > >   %index.next = add i32 %index, 4
> > > > > > >   %17 = icmp eq i32 %index.next, 16
> > > > > > >   br i1 %17, label %middle.block, label %vector.body,
> !llvm.loop !5
> > > > > > >
> > > > > > > middle.block:                                     ; preds =
> %vector.body, %overflow.checked
> > > > > > >   %resume.val = phi i32 [ 0, %overflow.checked ], [ 16,
> %vector.body ]
> > > > > > >   %trunc.resume.val = phi i32 [ 0, %overflow.checked ], [ 16,
> %vector.body ]
> > > > > > >   %rdx.vec.exit.phi = phi <4 x i32> [ zeroinitializer,
> %overflow.checked ], [ %14, %vector.body ]
> > > > > > >   %rdx.shuf = shufflevector <4 x i32> %rdx.vec.exit.phi, <4 x
> i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
> > > > > > >   %bin.rdx = add <4 x i32> %rdx.vec.exit.phi, %rdx.shuf
> > > > > > >   %rdx.shuf1 = shufflevector <4 x i32> %bin.rdx, <4 x i32>
> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
> > > > > > >   %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1
> > > > > > >   %18 = extractelement <4 x i32> %bin.rdx2, i32 0
> > > > > > >   %cmp.n = icmp eq i32 16, %resume.val
> > > > > > >   br i1 %cmp.n, label %for.end, label %scalar.ph(
> http://scalar.ph/)
> > > > > > >
> > > > > > > scalar.ph(http://scalar.ph/):
> ; preds = %middle.block, %entry
> > > > > > >   %bc.resume.val = phi i32 [ %resume.val, %middle.block ], [
> 0, %entry ]
> > > > > > >   %bc.trunc.resume.val = phi i32 [ %trunc.resume.val,
> %middle.block ], [ 0, %entry ]
> > > > > > >   %bc.merge.rdx = phi i32 [ 0, %entry ], [ %18, %middle.block ]
> > > > > > >   br label %for.body
> > > > > > >
> > > > > > > for.body:                                         ; preds =
> %for.body, %scalar.ph(http://scalar.ph/)
> > > > > > >   %i.05 = phi i32 [ %bc.trunc.resume.val, %scalar.ph(
> http://scalar.ph/) ], [ %inc, %for.body ]
> > > > > > >   %sum.04 = phi i32 [ %bc.merge.rdx, %scalar.ph(
> http://scalar.ph/) ], [ %add, %for.body ]
> > > > > > >   %arrayidx = getelementptr inbounds i32* %a, i32 %i.05
> > > > > > >   %19 = load i32* %arrayidx, align 4, !tbaa !1
> > > > > > >   %add = add nsw i32 %19, %sum.04
> > > > > > >   %inc = add nsw i32 %i.05, 1
> > > > > > >   %exitcond = icmp eq i32 %i.05, 15
> > > > > > >   br i1 %exitcond, label %for.end, label %for.body, !llvm.loop
> !8
> > > > > > >
> > > > > > > for.end:                                          ; preds =
> %middle.block, %for.body
> > > > > > >   %add.lcssa = phi i32 [ %add, %for.body ], [ %18,
> %middle.block ]
> > > > > > >   ret i32 %add.lcssa
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Now for test case 2, where we unroll the loop totally :
> > > > > > >
> > > > > > > int foo(int *a) {
> > > > > > > int sum = 0;
> > > > > > > sum =
> a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7]+a[8]+a[9]+a[10]+a[11]+a[12]+a[13]+a[14]+a[15];
> > > > > > > return sum;
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > This code is not vectorized by SLP vectorization.
> > > > > > >
> > > > > > >
> > > > > > > IR before vectorization:
> > > > > > >
> > > > > > >
> > > > > > > test.ll
> > > > > > >
> > > > > > >
> > > > > > > target datalayout =
> "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
> > > > > > > target triple = "x86_64-pc-linux-gnu"
> > > > > > >
> > > > > > > ; Function Attrs: nounwind readonly
> > > > > > > define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
> > > > > > > entry:
> > > > > > >   %0 = load i32* %a, align 4, !tbaa !1
> > > > > > >   %arrayidx1 = getelementptr inbounds i32* %a, i32 1
> > > > > > >   %1 = load i32* %arrayidx1, align 4, !tbaa !1
> > > > > > >   %add = add nsw i32 %1, %0
> > > > > > >   %arrayidx2 = getelementptr inbounds i32* %a, i32 2
> > > > > > >   %2 = load i32* %arrayidx2, align 4, !tbaa !1
> > > > > > >   %add3 = add nsw i32 %add, %2
> > > > > > >   %arrayidx4 = getelementptr inbounds i32* %a, i32 3
> > > > > > >   %3 = load i32* %arrayidx4, align 4, !tbaa !1
> > > > > > >   %add5 = add nsw i32 %add3, %3
> > > > > > >   %arrayidx6 = getelementptr inbounds i32* %a, i32 4
> > > > > > >   %4 = load i32* %arrayidx6, align 4, !tbaa !1
> > > > > > >   %add7 = add nsw i32 %add5, %4
> > > > > > >   %arrayidx8 = getelementptr inbounds i32* %a, i32 5
> > > > > > >   %5 = load i32* %arrayidx8, align 4, !tbaa !1
> > > > > > >   %add9 = add nsw i32 %add7, %5
> > > > > > >   %arrayidx10 = getelementptr inbounds i32* %a, i32 6
> > > > > > >   %6 = load i32* %arrayidx10, align 4, !tbaa !1
> > > > > > >   %add11 = add nsw i32 %add9, %6
> > > > > > >   %arrayidx12 = getelementptr inbounds i32* %a, i32 7
> > > > > > >   %7 = load i32* %arrayidx12, align 4, !tbaa !1
> > > > > > >   %add13 = add nsw i32 %add11, %7
> > > > > > >   %arrayidx14 = getelementptr inbounds i32* %a, i32 8
> > > > > > >   %8 = load i32* %arrayidx14, align 4, !tbaa !1
> > > > > > >   %add15 = add nsw i32 %add13, %8
> > > > > > >   %arrayidx16 = getelementptr inbounds i32* %a, i32 9
> > > > > > >   %9 = load i32* %arrayidx16, align 4, !tbaa !1
> > > > > > >   %add17 = add nsw i32 %add15, %9
> > > > > > >   %arrayidx18 = getelementptr inbounds i32* %a, i32 10
> > > > > > >   %10 = load i32* %arrayidx18, align 4, !tbaa !1
> > > > > > >   %add19 = add nsw i32 %add17, %10
> > > > > > >   %arrayidx20 = getelementptr inbounds i32* %a, i32 11
> > > > > > >   %11 = load i32* %arrayidx20, align 4, !tbaa !1
> > > > > > >   %add21 = add nsw i32 %add19, %11
> > > > > > >   %arrayidx22 = getelementptr inbounds i32* %a, i32 12
> > > > > > >   %12 = load i32* %arrayidx22, align 4, !tbaa !1
> > > > > > >   %add23 = add nsw i32 %add21, %12
> > > > > > >   %arrayidx24 = getelementptr inbounds i32* %a, i32 13
> > > > > > >   %13 = load i32* %arrayidx24, align 4, !tbaa !1
> > > > > > >   %add25 = add nsw i32 %add23, %13
> > > > > > >   %arrayidx26 = getelementptr inbounds i32* %a, i32 14
> > > > > > >   %14 = load i32* %arrayidx26, align 4, !tbaa !1
> > > > > > >   %add27 = add nsw i32 %add25, %14
> > > > > > >   %arrayidx28 = getelementptr inbounds i32* %a, i32 15
> > > > > > >   %15 = load i32* %arrayidx28, align 4, !tbaa !1
> > > > > > >   %add29 = add nsw i32 %add27, %15
> > > > > > >   ret i32 %add29
> > > > > > > }
> > > > > > >
> > > > > > > $ opt -S -slp-vectorizer -slp-vectorize-hor test.ll -debug -o
> test2.ll
> > > > > > >
> > > > > > > Features:+64bit,+sse2
> > > > > > > CPU:generic
> > > > > > >
> > > > > > > Subtarget features: SSELevel 3, 3DNowLevel 0, 64bit 1
> > > > > > > SLP: Analyzing blocks in foo.
> > > > > > >
> > > > > > >
> > > > > > > test2.ll (IR after SLP vectorization) :
> > > > > > >
> > > > > > > target datalayout =
> "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
> > > > > > > target triple = "x86_64-pc-linux-gnu"
> > > > > > >
> > > > > > > ; Function Attrs: nounwind readonly
> > > > > > > define i32 @foo(i32* nocapture readonly %a, i32 %n) #0 {
> > > > > > > entry:
> > > > > > >   %0 = load i32* %a, align 4, !tbaa !1
> > > > > > >   %arrayidx1 = getelementptr inbounds i32* %a, i32 1
> > > > > > >   %1 = load i32* %arrayidx1, align 4, !tbaa !1
> > > > > > >   %add = add nsw i32 %1, %0
> > > > > > >   %arrayidx2 = getelementptr inbounds i32* %a, i32 2
> > > > > > >   %2 = load i32* %arrayidx2, align 4, !tbaa !1
> > > > > > >   %add3 = add nsw i32 %add, %2
> > > > > > >   %arrayidx4 = getelementptr inbounds i32* %a, i32 3
> > > > > > >   %3 = load i32* %arrayidx4, align 4, !tbaa !1
> > > > > > >   %add5 = add nsw i32 %add3, %3
> > > > > > >   %arrayidx6 = getelementptr inbounds i32* %a, i32 4
> > > > > > >   %4 = load i32* %arrayidx6, align 4, !tbaa !1
> > > > > > >   %add7 = add nsw i32 %add5, %4
> > > > > > >   %arrayidx8 = getelementptr inbounds i32* %a, i32 5
> > > > > > >   %5 = load i32* %arrayidx8, align 4, !tbaa !1
> > > > > > >   %add9 = add nsw i32 %add7, %5
> > > > > > >   %arrayidx10 = getelementptr inbounds i32* %a, i32 6
> > > > > > >   %6 = load i32* %arrayidx10, align 4, !tbaa !1
> > > > > > >   %add11 = add nsw i32 %add9, %6
> > > > > > >   %arrayidx12 = getelementptr inbounds i32* %a, i32 7
> > > > > > >   %7 = load i32* %arrayidx12, align 4, !tbaa !1
> > > > > > >   %add13 = add nsw i32 %add11, %7
> > > > > > >   %arrayidx14 = getelementptr inbounds i32* %a, i32 8
> > > > > > >   %8 = load i32* %arrayidx14, align 4, !tbaa !1
> > > > > > >   %add15 = add nsw i32 %add13, %8
> > > > > > >   %arrayidx16 = getelementptr inbounds i32* %a, i32 9
> > > > > > >   %9 = load i32* %arrayidx16, align 4, !tbaa !1
> > > > > > >   %add17 = add nsw i32 %add15, %9
> > > > > > >   %arrayidx18 = getelementptr inbounds i32* %a, i32 10
> > > > > > >   %10 = load i32* %arrayidx18, align 4, !tbaa !1
> > > > > > >   %add19 = add nsw i32 %add17, %10
> > > > > > >   %arrayidx20 = getelementptr inbounds i32* %a, i32 11
> > > > > > >   %11 = load i32* %arrayidx20, align 4, !tbaa !1
> > > > > > >   %add21 = add nsw i32 %add19, %11
> > > > > > >   %arrayidx22 = getelementptr inbounds i32* %a, i32 12
> > > > > > >   %12 = load i32* %arrayidx22, align 4, !tbaa !1
> > > > > > >   %add23 = add nsw i32 %add21, %12
> > > > > > >   %arrayidx24 = getelementptr inbounds i32* %a, i32 13
> > > > > > >   %13 = load i32* %arrayidx24, align 4, !tbaa !1
> > > > > > >   %add25 = add nsw i32 %add23, %13
> > > > > > >   %arrayidx26 = getelementptr inbounds i32* %a, i32 14
> > > > > > >   %14 = load i32* %arrayidx26, align 4, !tbaa !1
> > > > > > >   %add27 = add nsw i32 %add25, %14
> > > > > > >   %arrayidx28 = getelementptr inbounds i32* %a, i32 15
> > > > > > >   %15 = load i32* %arrayidx28, align 4, !tbaa !1
> > > > > > >   %add29 = add nsw i32 %add27, %15
> > > > > > >   ret i32 %add29
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > Shouldn't the SLP vectorizer vectorize above code? Am I
> missing out something?
> > > > > > >
> > > > > > > How does loop unrolling affect the final vectorization of code?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi Suyog,
> > > > > >
> > > > > >
> > > > > > Thank you for taking the time to investigate the performance of
> the vectorizers. You are correct that the SLP-vectorizer should be able to
> do something about the second case (the unrolled one).  Please look into
> line 2861 in the SLP-vectorizer. This is where we try to analyze reductions
> and vectorize them. At the moment we don’t have support for a flat yet
> large reduction that is the result of unrolled loops because the current
> algorithm tries to construct trees that start at the chains, and it does
> not support joining multiple trees.
> > > > > >
> > > > > >
> > > > > >
> http://llvm.org/docs/doxygen/html/SLPVectorizer_8cpp_source.html#l02861
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Nadav
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > I tried to debug above IR. Since there is no store in above
> IR, SLP vectorization doesn't call
> > > > > > > vectorizeStoreChains(). In vectorizeChainsInBlock(), as there
> is no phi node in above IR, it
> > > > > > > also fails to vectorize the IR.
> > > > > > >
> > > > > > >
> > > > > > > If it is perfectly legal to vectorize above IR, are we lacking
> code for it right now? Or its not
> > > > > > >
> > > > > > > possible to vectorize above code (as far as i know, it is
> perfectly possible to vectorize above code)?
> > > > > > >
> > > > > > > If its possible, can someone help me/point out specifics how
> to start for implementing vectorization for such code?
> > > > > > >
> > > > > > > (We do have function calls isConsecutiveAccess() to identify
> it. Is it a case for horizontal reduction, though the term
> > > > > > > 'reduction' is associated with loops/phi nodes.)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This brings to me another question - if we perform aggressive
> loop unrolling, we might loose opportunity for
> > > > > > >
> > > > > > > vectorization as visible above !!
> > > > > > >
> > > > > > >
> > > > > > > (I was looking into bug 20035, when i encountered above issue.
> Though both are not related.)
> > > > > > >
> > > > > > >
> > > > > > > Your comments/suggestions are most awaited.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > With regards,
> > > > > > > Suyog Sarda
> > > > > > >
> >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > With regards,
> > > > > Suyog Sarda
> > > > >
> >
> > > > >
> > > > >
> >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > With regards,
> > > Suyog Sarda
> > >
> >
> > >
> > >
> >
> > >
> > >
> > >
> >
> >
> >
> >
> > --
> > With regards,
> > Suyog Sarda
> >
> >
> >
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141110/a902c6e7/attachment.html>