[PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Tue Dec 16 00:09:41 PST 2014

Hi suyog,

Yes, we could pattern match this at the DAG level, but it would be rather
fragile I think? My previous suggestion was to use IR level intrinsics to
model horizontal reductions , because it avoids the pattern matching and
each backed could lower it into an efficient form without pattern matching.

As its you rather than me doing this work however I won't push hard for it
:)

Cheers,

James
On Tue, 16 Dec 2014 at 08:00, Suyog Kamal Sarda <suyog.sarda at samsung.com>
wrote:

> Hi James,
>
> Thanks for the review.
>
> Yes, I agree the code generated can be optimized further to have v0.4s
> (addition of 4 scalars at a time) instead of
> v0.2s (addition of 2 scalars at a time). So, basically we need to emit
> <4x> instead of <2x> vectors in the IR.
>
> AFAIK, the way we build the Bottom up tree in SLP ( for the kind of tree I
> have
> described in the description below ), when we try to bundle up loads for a
> vector binary operator, we always
> bundle up in pair of 2 loads.
>
> For ex (I am numbering the + operators for reference):
>
>                                                 + 1
>                                               /    \
>                                             /       \
>                                           /          \
>                                          /            \
>                                         + 2           + 3
>                                       /   \            /  \
>                                      /     \          /    \
>                                     /       \        /      \
>                                   + 4     + 5    + 6     + 7
>                                  /  \      /  \    /  \     /  \
>                                0    1   2   3  4   5  6   7
>
> When visiting this tree, we encounter 2 and 3 + and we say that it can be
> vectorized. Now we visit, left of 2 and 3,
> and come to 4 and 6 + , which can again be vectorized. So we visit, Left
> and Right of 4 and 6 + . We now find, that
>
> Left -> a[0] and a[4]
> Right -> a[1] and a[5]
>
> With my patch, we re-arrange them as
>
> Left -> a[0] and a[1]
> Right -> a[4] and a[5].
>
> We see that both Left and Right now have consecutive loads and hence can
> be bundled into a vector load of <2x>.
> Note, that at this point, we are unaware of the other loads and 5 7 +.
> Hence, we are not emitting <4x> vector loads.
>
> This traversal of operators and operands was already in existing code, and
> I didn't disturb that :).
>
> May be we can put code to handle such type of IR's in DAG combine where if
> we encounter consecutive vector loads of 2 loads
> at a time, we can combine them into vector load of 4 loads.
>
> So basically, we need to reduce the tree :
>
>                                                              +
>                                                          /       \
>                                                        /          \
>                                                      +            +
>                                                    /   \         /    \
>                                               load   load    load  load
>                                              2x at 0 2x at 2  2x at 4 2x at 6
>
> to something like :
>
>                                                           +
>                                                        /     \
>                                                       /       \
>                                                     load     load
>                                                     4x at 0   4x at 4
>
> Feel free to correct me in my understanding :)
> I am trying to solve this type of problems in incremental steps.
>
> (Exciting thing is, as I was writing this mail, I got an idea for above
> type of reduction,
> where I can vectorize 2x loads into 4xloads :).
> Need to check if it already exist and come up with a patch if not.)
>
> Regards,
> Suyog
>
>  ------- Original Message -------
> Sender : James Molloy<james at jamesmolloy.co.uk>
> Date : Dec 16, 2014 16:25 (GMT+09:00)
> Title : Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize
> Horizontal Reductions from Consecutive Loads
>
> Hi suyog,
> This is a good improvement, thanks for working on it!
>
> I'll take a closer look today, but for now I did notice that the generated
> aarch64 assembly isn't as optimal as it could be. I'd expect:
>
> Ldp q0, q1
> Add v0.4s, v0.4s, v1.4s
> Addv s0, v0.4s
>
> Cheers,
>
> James
>
> On Tue, 16 Dec 2014 at 05:29, suyog <suyog.sarda at samsung.com> wrote:
>
> Hi nadav, aschwaighofer, jmolloy,
>
> This patch is enhancement to r224119 which vectorizes horizontal
> reductions from consecutive loads.
>
> Earlier in r224119, we handled tree :
>
>                                                   +
>                                                 /    \
>                                               /       \
>                                             +         +
>                                            /  \       /  \
>                                          /     \     /    \
>                                      a[0]  a[1] a[2] a[3]
>
> where originally, we had
>                                       Left              Right
>                                       a[0]              a[1]
>                                       a[2]              a[3]
>
> In r224119, we compared, (Left[i], Right[i]) and (Right[i], Left[i+1])
>
>                                      Left        Right
>                                      a[0] ---> a[1]
>                                                 /
>                                                /
>                                               /
>                                             \/
>                                      a[2]       a[3]
>
>
> And then rearrange it to
>                                      Left        Right
>                                      a[0]        a[2]
>                                      a[1]        a[3]
> so that, we can bundle left and right into vector of loads.
>
> However, with bigger tree,
>
>                                                 +
>                                               /    \
>                                             /       \
>                                           /          \
>                                          /            \
>                                         +              +
>                                       /   \            /  \
>                                      /     \          /    \
>                                     /       \        /      \
>                                   +        +      +       +
>                                  /  \      /  \    /  \     /  \
>                                0    1   2   3  4   5  6   7
>
>
>                                   Left              Right
>                                    0                  1
>                                    4                  5
>                                    2                  3
>                                    6                  7
>
> In this case, Comparison of Right[i]  and Left[i+1] would fail, and code
> remains scalar.
>
> If we eliminate comparison Right[i] and Left[i+1], and just compare
> Left[i] with Right[i],
> we would be able to re-arrange Left and Right into :
>                                Left               Right
>                                 0                    4
>                                 1                    5
>                                 2                    6
>                                 3                    7
>
> And then would bundle (0,1) (4,5) and (2,3) (6,7) into vector loads.
> And then have vector adds of (01, 45) and (23, 67).
>
> However, notice that, this would disturb the sequence of addition.
> Originally, (01) and (23) should have been added. Same with (45) and (67).
> For integer type addition, this would not create any issue, but for other
> data types with precision concerns, there might be a problem.
>
> ffast-math would have eliminated this precision concern, but it would have
> re-associated the tree itself into (+(+(+(+(0,1)2)3....)
>
> Hence, in this patch we are checking for integer types and then only
> skipping
> the extra comparison of (Right[i], Left[i+1]).
>
> With this patch, we now vectorize above type of tree for any length of
> consecutive loads
> of integer type.
>
>
> For test case:
>
>                 #include <arm_neon.h>
>                 int hadd(int* a){
>                     return (a[0] + a[1]) + (a[2] + a[3]) + (a[4] + a[5]) +
> (a[6] + a[7]);
>                 }
>
> AArch64 assembly before this patch :
>
>                 ldp      w8, w9, [x0]
>                 ldp     w10, w11, [x0, #8]
>                 ldp     w12, w13, [x0, #16]
>                 ldp     w14, w15, [x0, #24]
>                 add      w8, w8, w9
>                 add      w9, w10, w11
>                 add      w10, w12, w13
>                 add      w11, w14, w15
>                 add      w8, w8, w9
>                 add      w9, w10, w11
>                 add      w0, w8, w9
>                 ret
>
> AArch64 assembly after this patch :
>
>                 ldp      d0, d1, [x0]
>                 ldp     d2, d3, [x0, #16]
>                 add     v0.2s, v0.2s, v2.2s
>                 add     v1.2s, v1.2s, v3.2s
>                 add     v0.2s, v0.2s, v1.2s
>                 fmov    w8, s0
>                 mov             w9, v0.s[1]
>                 add      w0, w8, w9
>                 ret
>
>
>
> Please help in reviewing this patch. I did not run LNT as of now, since
> this is just enhancement
> to r224119. I will update with LNT results if required.
>
> Regards,
> Suyog
>
> REPOSITORY
>   rL LLVM
>
> http://reviews.llvm.org/D6675
>
> Files:
>   lib/Transforms/Vectorize/SLPVectorizer.cpp
>   test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
>
> Index: lib/Transforms/Vectorize/SLPVectorizer.cpp
> ===================================================================
> --- lib/Transforms/Vectorize/SLPVectorizer.cpp
> +++ lib/Transforms/Vectorize/SLPVectorizer.cpp
> @@ -1831,8 +1831,11 @@
>    for (unsigned i = 0, e = Left.size(); i < e - 1; ++i) {
>      if (!isa<LoadInst>(Left[i]) || !isa<LoadInst>(Right[i]))
>        return;
> -    if (!(isConsecutiveAccess(Left[i], Right[i]) &&
> -          isConsecutiveAccess(Right[i], Left[i + 1])))
> +    LoadInst *L = dyn_cast<LoadInst>(Left[i]);
> +    bool isInt = L->getType()->isIntegerTy();
> +    if (!(isConsecutiveAccess(Left[i], Right[i])))
> +      continue;
> +    else if (!isInt && !isConsecutiveAccess(Right[i], Left[i + 1]))
>        continue;
>      else
>        std::swap(Left[i + 1], Right[i]);
> Index: test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
> ===================================================================
> --- test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
> +++ test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
> @@ -25,3 +25,34 @@
>    %add5 = fadd float %add, %add4
>    ret float %add5
>  }
> +
> +; CHECK-LABEL: @hadd_int
> +; CHECK: load <2 x i32>*
> +; CHECK: add <2 x i32>
> +; CHECK: extractelement <2 x i32>
> +define i32 @hadd_int(i32* nocapture readonly %a) {
> +entry:
> +  %0 = load i32* %a, align 4
> +  %arrayidx1 = getelementptr inbounds i32* %a, i64 1
> +  %1 = load i32* %arrayidx1, align 4
> +  %arrayidx2 = getelementptr inbounds i32* %a, i64 2
> +  %2 = load i32* %arrayidx2, align 4
> +  %arrayidx3 = getelementptr inbounds i32* %a, i64 3
> +  %3 = load i32* %arrayidx3, align 4
> +  %arrayidx6 = getelementptr inbounds i32* %a, i64 4
> +  %4 = load i32* %arrayidx6, align 4
> +  %arrayidx7 = getelementptr inbounds i32* %a, i64 5
> +  %5 = load i32* %arrayidx7, align 4
> +  %arrayidx10 = getelementptr inbounds i32* %a, i64 6
> +  %6 = load i32* %arrayidx10, align 4
> +  %arrayidx11 = getelementptr inbounds i32* %a, i64 7
> +  %7 = load i32* %arrayidx11, align 4
> +  %add1 = add i32 %0, %1
> +  %add2 = add i32 %2, %3
> +  %add3 = add i32 %4, %5
> +  %add4 = add i32 %6, %7
> +  %add5 = add i32 %add1, %add2
> +  %add6 = add i32 %add3, %add4
> +  %add7 = add i32 %add5, %add6
> +  ret i32 %add7
> +}
>
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141216/ead6c013/attachment.html>