Hi suyog,<br><br>Yes, we could pattern match this at the DAG level, but it would be rather fragile I think? My previous suggestion was to use IR level intrinsics to model horizontal reductions , because it avoids the pattern matching and each backed could lower it into an efficient form without pattern matching. <br><br>As its you rather than me doing this work however I won't push hard for it :)<br><br>Cheers,<br><br>James <br><div class="gmail_quote">On Tue, 16 Dec 2014 at 08:00, Suyog Kamal Sarda <<a href="mailto:suyog.sarda@samsung.com">suyog.sarda@samsung.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi James,<br>


<br>


Thanks for the review.<br>


<br>


Yes, I agree the code generated can be optimized further to have v0.4s (addition of 4 scalars at a time) instead of<br>


v0.2s (addition of 2 scalars at a time). So, basically we need to emit <4x> instead of <2x> vectors in the IR.<br>


<br>


AFAIK, the way we build the Bottom up tree in SLP ( for the kind of tree I have<br>


described in the description below ), when we try to bundle up loads for a vector binary operator, we always<br>


bundle up in pair of 2 loads.<br>


<br>


For ex (I am numbering the + operators for reference):<br>


<br>


                                                + 1<br>


                                              /    \<br>


                                            /       \<br>


                                          /          \<br>


                                         /            \<br>


                                        + 2           + 3<br>


                                      /   \            /  \<br>


                                     /     \          /    \<br>


                                    /       \        /      \<br>


                                  + 4     + 5    + 6     + 7<br>


                                 /  \      /  \    /  \     /  \<br>


                               0    1   2   3  4   5  6   7<br>


<br>


When visiting this tree, we encounter 2 and 3 + and we say that it can be vectorized. Now we visit, left of 2 and 3,<br>


and come to 4 and 6 + , which can again be vectorized. So we visit, Left and Right of 4 and 6 + . We now find, that<br>


<br>


Left -> a[0] and a[4]<br>


Right -> a[1] and a[5]<br>


<br>


With my patch, we re-arrange them as<br>


<br>


Left -> a[0] and a[1]<br>


Right -> a[4] and a[5].<br>


<br>


We see that both Left and Right now have consecutive loads and hence can be bundled into a vector load of <2x>.<br>


Note, that at this point, we are unaware of the other loads and 5 7 +. Hence, we are not emitting <4x> vector loads.<br>


<br>


This traversal of operators and operands was already in existing code, and I didn't disturb that :).<br>


<br>


May be we can put code to handle such type of IR's in DAG combine where if we encounter consecutive vector loads of 2 loads<br>


at a time, we can combine them into vector load of 4 loads.<br>


<br>


So basically, we need to reduce the tree :<br>


<br>


                                                             +<br>


                                                         /       \<br>


                                                       /          \<br>


                                                     +            +<br>


                                                   /   \         /    \<br>


                                              load   load    load  load<br>


                                             2x@0 2x@2  2x@4 2x@6<br>


<br>


to something like :<br>


<br>


                                                          +<br>


                                                       /     \<br>


                                                      /       \<br>


                                                    load     load<br>


                                                    4x@0   4x@4<br>


<br>


Feel free to correct me in my understanding :)<br>


I am trying to solve this type of problems in incremental steps.<br>


<br>


(Exciting thing is, as I was writing this mail, I got an idea for above type of reduction,<br>


where I can vectorize 2x loads into 4xloads :).<br>


Need to check if it already exist and come up with a patch if not.)<br>


<br>


Regards,<br>


Suyog<br>


<br>


 ------- Original Message -------<br>


Sender : James Molloy<<a href="mailto:james@jamesmolloy.co.uk" target="_blank">james@jamesmolloy.co.uk</a><u></u>><br>


Date : Dec 16, 2014 16:25 (GMT+09:00)<br>


Title : Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads<br>


<br>


Hi suyog,<br>


This is a good improvement, thanks for working on it!<br>


<br>


I'll take a closer look today, but for now I did notice that the generated aarch64 assembly isn't as optimal as it could be. I'd expect:<br>


<br>


Ldp q0, q1<br>


Add v0.4s, v0.4s, v1.4s<br>


Addv s0, v0.4s<br>


<br>


Cheers,<br>


<br>


James<br>


<br>


On Tue, 16 Dec 2014 at 05:29, suyog <<a href="mailto:suyog.sarda@samsung.com" target="_blank">suyog.sarda@samsung.com</a>> wrote:<br>


<br>


Hi nadav, aschwaighofer, jmolloy,<br>


<br>


This patch is enhancement to r224119 which vectorizes horizontal reductions from consecutive loads.<br>


<br>


Earlier in r224119, we handled tree :<br>


<br>


                                                  +<br>


                                                /    \<br>


                                              /       \<br>


                                            +         +<br>


                                           /  \       /  \<br>


                                         /     \     /    \<br>


                                     a[0]  a[1] a[2] a[3]<br>


<br>


where originally, we had<br>


                                      Left              Right<br>


                                      a[0]              a[1]<br>


                                      a[2]              a[3]<br>


<br>


In r224119, we compared, (Left[i], Right[i]) and (Right[i], Left[i+1])<br>


<br>


                                     Left        Right<br>


                                     a[0] ---> a[1]<br>


                                                /<br>


                                               /<br>


                                              /<br>


                                            \/<br>


                                     a[2]       a[3]<br>


<br>


<br>


And then rearrange it to<br>


                                     Left        Right<br>


                                     a[0]        a[2]<br>


                                     a[1]        a[3]<br>


so that, we can bundle left and right into vector of loads.<br>


<br>


However, with bigger tree,<br>


<br>


                                                +<br>


                                              /    \<br>


                                            /       \<br>


                                          /          \<br>


                                         /            \<br>


                                        +              +<br>


                                      /   \            /  \<br>


                                     /     \          /    \<br>


                                    /       \        /      \<br>


                                  +        +      +       +<br>


                                 /  \      /  \    /  \     /  \<br>


                               0    1   2   3  4   5  6   7<br>


<br>


<br>


                                  Left              Right<br>


                                   0                  1<br>


                                   4                  5<br>


                                   2                  3<br>


                                   6                  7<br>


<br>


In this case, Comparison of Right[i]  and Left[i+1] would fail, and code remains scalar.<br>


<br>


If we eliminate comparison Right[i] and Left[i+1], and just compare Left[i] with Right[i],<br>


we would be able to re-arrange Left and Right into :<br>


                               Left               Right<br>


                                0                    4<br>


                                1                    5<br>


                                2                    6<br>


                                3                    7<br>


<br>


And then would bundle (0,1) (4,5) and (2,3) (6,7) into vector loads.<br>


And then have vector adds of (01, 45) and (23, 67).<br>


<br>


However, notice that, this would disturb the sequence of addition.<br>


Originally, (01) and (23) should have been added. Same with (45) and (67).<br>


For integer type addition, this would not create any issue, but for other<br>


data types with precision concerns, there might be a problem.<br>


<br>


ffast-math would have eliminated this precision concern, but it would have<br>


re-associated the tree itself into (+(+(+(+(0,1)2)3....)<br>


<br>


Hence, in this patch we are checking for integer types and then only skipping<br>


the extra comparison of (Right[i], Left[i+1]).<br>


<br>


With this patch, we now vectorize above type of tree for any length of consecutive loads<br>


of integer type.<br>


<br>


<br>


For test case:<br>


<br>


                #include <arm_neon.h><br>


                int hadd(int* a){<br>


                    return (a[0] + a[1]) + (a[2] + a[3]) + (a[4] + a[5]) + (a[6] + a[7]);<br>


                }<br>


<br>


AArch64 assembly before this patch :<br>


<br>


                ldp      w8, w9, [x0]<br>


                ldp     w10, w11, [x0, #8]<br>


                ldp     w12, w13, [x0, #16]<br>


                ldp     w14, w15, [x0, #24]<br>


                add      w8, w8, w9<br>


                add      w9, w10, w11<br>


                add      w10, w12, w13<br>


                add      w11, w14, w15<br>


                add      w8, w8, w9<br>


                add      w9, w10, w11<br>


                add      w0, w8, w9<br>


                ret<br>


<br>


AArch64 assembly after this patch :<br>


<br>


                ldp      d0, d1, [x0]<br>


                ldp     d2, d3, [x0, #16]<br>


                add     v0.2s, v0.2s, v2.2s<br>


                add     v1.2s, v1.2s, v3.2s<br>


                add     v0.2s, v0.2s, v1.2s<br>


                fmov    w8, s0<br>


                mov             w9, v0.s[1]<br>


                add      w0, w8, w9<br>


                ret<br>


<br>


<br>


<br>


Please help in reviewing this patch. I did not run LNT as of now, since this is just enhancement<br>


to r224119. I will update with LNT results if required.<br>


<br>


Regards,<br>


Suyog<br>


<br>


REPOSITORY<br>


  rL LLVM<br>


<br>


<a href="http://reviews.llvm.org/D6675" target="_blank">http://reviews.llvm.org/D6675</a><br>


<br>


Files:<br>


  lib/Transforms/Vectorize/<u></u>SLPVectorizer.cpp<br>


  test/Transforms/SLPVectorizer/<u></u>AArch64/horizontaladd.ll<br>


<br>


Index: lib/Transforms/Vectorize/<u></u>SLPVectorizer.cpp<br>


==============================<u></u>==============================<u></u>=======<br>


--- lib/Transforms/Vectorize/<u></u>SLPVectorizer.cpp<br>


+++ lib/Transforms/Vectorize/<u></u>SLPVectorizer.cpp<br>


@@ -1831,8 +1831,11 @@<br>


   for (unsigned i = 0, e = Left.size(); i < e - 1; ++i) {<br>


     if (!isa<LoadInst>(Left[i]) || !isa<LoadInst>(Right[i]))<br>


       return;<br>


-    if (!(isConsecutiveAccess(Left[i]<u></u>, Right[i]) &&<br>


-          isConsecutiveAccess(Right[i], Left[i + 1])))<br>


+    LoadInst *L = dyn_cast<LoadInst>(Left[i]);<br>


+    bool isInt = L->getType()->isIntegerTy();<br>


+    if (!(isConsecutiveAccess(Left[i]<u></u>, Right[i])))<br>


+      continue;<br>


+    else if (!isInt && !isConsecutiveAccess(Right[i], Left[i + 1]))<br>


       continue;<br>


     else<br>


       std::swap(Left[i + 1], Right[i]);<br>


Index: test/Transforms/SLPVectorizer/<u></u>AArch64/horizontaladd.ll<br>


==============================<u></u>==============================<u></u>=======<br>


--- test/Transforms/SLPVectorizer/<u></u>AArch64/horizontaladd.ll<br>


+++ test/Transforms/SLPVectorizer/<u></u>AArch64/horizontaladd.ll<br>


@@ -25,3 +25,34 @@<br>


   %add5 = fadd float %add, %add4<br>


   ret float %add5<br>


 }<br>


+<br>


+; CHECK-LABEL: @hadd_int<br>


+; CHECK: load <2 x i32>*<br>


+; CHECK: add <2 x i32><br>


+; CHECK: extractelement <2 x i32><br>


+define i32 @hadd_int(i32* nocapture readonly %a) {<br>


+entry:<br>


+  %0 = load i32* %a, align 4<br>


+  %arrayidx1 = getelementptr inbounds i32* %a, i64 1<br>


+  %1 = load i32* %arrayidx1, align 4<br>


+  %arrayidx2 = getelementptr inbounds i32* %a, i64 2<br>


+  %2 = load i32* %arrayidx2, align 4<br>


+  %arrayidx3 = getelementptr inbounds i32* %a, i64 3<br>


+  %3 = load i32* %arrayidx3, align 4<br>


+  %arrayidx6 = getelementptr inbounds i32* %a, i64 4<br>


+  %4 = load i32* %arrayidx6, align 4<br>


+  %arrayidx7 = getelementptr inbounds i32* %a, i64 5<br>


+  %5 = load i32* %arrayidx7, align 4<br>


+  %arrayidx10 = getelementptr inbounds i32* %a, i64 6<br>


+  %6 = load i32* %arrayidx10, align 4<br>


+  %arrayidx11 = getelementptr inbounds i32* %a, i64 7<br>


+  %7 = load i32* %arrayidx11, align 4<br>


+  %add1 = add i32 %0, %1<br>


+  %add2 = add i32 %2, %3<br>


+  %add3 = add i32 %4, %5<br>


+  %add4 = add i32 %6, %7<br>


+  %add5 = add i32 %add1, %add2<br>


+  %add6 = add i32 %add3, %add4<br>


+  %add7 = add i32 %add5, %add6<br>


+  ret i32 %add7<br>


+}<br>


<br>


EMAIL PREFERENCES<br>


  <a href="http://reviews.llvm.org/settings/panel/emailpreferences/" target="_blank">http://reviews.llvm.org/<u></u>settings/panel/<u></u>emailpreferences/</a><br>


______________________________<u></u>_________________<br>


llvm-commits mailing list<br>


<a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>


<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/<u></u>mailman/listinfo/llvm-commits</a></blockquote></div>