[PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Tue Dec 16 00:00:32 PST 2014

Hi James,

Thanks for the review. 

Yes, I agree the code generated can be optimized further to have v0.4s (addition of 4 scalars at a time) instead of 
v0.2s (addition of 2 scalars at a time). So, basically we need to emit <4x> instead of <2x> vectors in the IR.

AFAIK, the way we build the Bottom up tree in SLP ( for the kind of tree I have 
described in the description below ), when we try to bundle up loads for a vector binary operator, we always
bundle up in pair of 2 loads.

For ex (I am numbering the + operators for reference):

                                                + 1
                                              /    \
                                            /       \
                                          /          \
                                         /            \
                                        + 2           + 3
                                      /   \            /  \
                                     /     \          /    \
                                    /       \        /      \
                                  + 4     + 5    + 6     + 7
                                 /  \      /  \    /  \     /  \
                               0    1   2   3  4   5  6   7

When visiting this tree, we encounter 2 and 3 + and we say that it can be vectorized. Now we visit, left of 2 and 3,
and come to 4 and 6 + , which can again be vectorized. So we visit, Left and Right of 4 and 6 + . We now find, that

Left -> a[0] and a[4]     
Right -> a[1] and a[5]

With my patch, we re-arrange them as 

Left -> a[0] and a[1]
Right -> a[4] and a[5].

We see that both Left and Right now have consecutive loads and hence can be bundled into a vector load of <2x>.
Note, that at this point, we are unaware of the other loads and 5 7 +. Hence, we are not emitting <4x> vector loads.

This traversal of operators and operands was already in existing code, and I didn't disturb that :).

May be we can put code to handle such type of IR's in DAG combine where if we encounter consecutive vector loads of 2 loads 
at a time, we can combine them into vector load of 4 loads. 

So basically, we need to reduce the tree :  

                                                             +   
                                                         /       \
                                                       /          \
                                                     +            +
                                                   /   \         /    \
                                              load   load    load  load
                                             2x at 0 2x at 2  2x at 4 2x at 6

to something like :
 
                                                          +
                                                       /     \
                                                      /       \ 
                                                    load     load
                                                    4x at 0   4x at 4

Feel free to correct me in my understanding :)
I am trying to solve this type of problems in incremental steps.

(Exciting thing is, as I was writing this mail, I got an idea for above type of reduction,
where I can vectorize 2x loads into 4xloads :). 
Need to check if it already exist and come up with a patch if not.)

Regards,
Suyog

 ------- Original Message -------
Sender : James Molloy<james at jamesmolloy.co.uk>
Date : Dec 16, 2014 16:25 (GMT+09:00)
Title : Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Hi suyog,
This is a good improvement, thanks for working on it!

I'll take a closer look today, but for now I did notice that the generated aarch64 assembly isn't as optimal as it could be. I'd expect:

Ldp q0, q1
Add v0.4s, v0.4s, v1.4s
Addv s0, v0.4s

Cheers,

James

On Tue, 16 Dec 2014 at 05:29, suyog <suyog.sarda at samsung.com> wrote:

Hi nadav, aschwaighofer, jmolloy,

This patch is enhancement to r224119 which vectorizes horizontal reductions from consecutive loads.

Earlier in r224119, we handled tree :

                                                  +
                                                /    \
                                              /       \
                                            +         +
                                           /  \       /  \
                                         /     \     /    \
                                     a[0]  a[1] a[2] a[3]

where originally, we had
                                      Left              Right
                                      a[0]              a[1]
                                      a[2]              a[3]

In r224119, we compared, (Left[i], Right[i]) and (Right[i], Left[i+1])

                                     Left        Right
                                     a[0] ---> a[1]
                                                /
                                               /
                                              /
                                            \/
                                     a[2]       a[3]


And then rearrange it to
                                     Left        Right
                                     a[0]        a[2]
                                     a[1]        a[3]
so that, we can bundle left and right into vector of loads.

However, with bigger tree,

                                                +
                                              /    \
                                            /       \
                                          /          \
                                         /            \
                                        +              +
                                      /   \            /  \
                                     /     \          /    \
                                    /       \        /      \
                                  +        +      +       +
                                 /  \      /  \    /  \     /  \
                               0    1   2   3  4   5  6   7


                                  Left              Right
                                   0                  1
                                   4                  5
                                   2                  3
                                   6                  7

In this case, Comparison of Right[i]  and Left[i+1] would fail, and code remains scalar.

If we eliminate comparison Right[i] and Left[i+1], and just compare Left[i] with Right[i],
we would be able to re-arrange Left and Right into :
                               Left               Right
                                0                    4
                                1                    5
                                2                    6
                                3                    7

And then would bundle (0,1) (4,5) and (2,3) (6,7) into vector loads.
And then have vector adds of (01, 45) and (23, 67).

However, notice that, this would disturb the sequence of addition.
Originally, (01) and (23) should have been added. Same with (45) and (67).
For integer type addition, this would not create any issue, but for other
data types with precision concerns, there might be a problem.

ffast-math would have eliminated this precision concern, but it would have
re-associated the tree itself into (+(+(+(+(0,1)2)3....)

Hence, in this patch we are checking for integer types and then only skipping
the extra comparison of (Right[i], Left[i+1]).

With this patch, we now vectorize above type of tree for any length of consecutive loads
of integer type.


For test case:

                #include <arm_neon.h>
                int hadd(int* a){
                    return (a[0] + a[1]) + (a[2] + a[3]) + (a[4] + a[5]) + (a[6] + a[7]);
                }

AArch64 assembly before this patch :

                ldp      w8, w9, [x0]
                ldp     w10, w11, [x0, #8]
                ldp     w12, w13, [x0, #16]
                ldp     w14, w15, [x0, #24]
                add      w8, w8, w9
                add      w9, w10, w11
                add      w10, w12, w13
                add      w11, w14, w15
                add      w8, w8, w9
                add      w9, w10, w11
                add      w0, w8, w9
                ret

AArch64 assembly after this patch :

                ldp      d0, d1, [x0]
                ldp     d2, d3, [x0, #16]
                add     v0.2s, v0.2s, v2.2s
                add     v1.2s, v1.2s, v3.2s
                add     v0.2s, v0.2s, v1.2s
                fmov    w8, s0
                mov             w9, v0.s[1]
                add      w0, w8, w9
                ret



Please help in reviewing this patch. I did not run LNT as of now, since this is just enhancement
to r224119. I will update with LNT results if required.

Regards,
Suyog

REPOSITORY
  rL LLVM

http://reviews.llvm.org/D6675

Files:
  lib/Transforms/Vectorize/SLPVectorizer.cpp
  test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll

Index: lib/Transforms/Vectorize/SLPVectorizer.cpp
===================================================================

--- lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1831,8 +1831,11 @@
   for (unsigned i = 0, e = Left.size(); i < e - 1; ++i) {
     if (!isa<LoadInst>(Left[i]) || !isa<LoadInst>(Right[i]))
       return;
-    if (!(isConsecutiveAccess(Left[i], Right[i]) &&
-          isConsecutiveAccess(Right[i], Left[i + 1])))
+    LoadInst *L = dyn_cast<LoadInst>(Left[i]);
+    bool isInt = L->getType()->isIntegerTy();
+    if (!(isConsecutiveAccess(Left[i], Right[i])))
+      continue;
+    else if (!isInt && !isConsecutiveAccess(Right[i], Left[i + 1]))
       continue;
     else
       std::swap(Left[i + 1], Right[i]);
Index: test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
===================================================================
--- test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
+++ test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
@@ -25,3 +25,34 @@
   %add5 = fadd float %add, %add4
   ret float %add5
 }
+
+; CHECK-LABEL: @hadd_int
+; CHECK: load <2 x i32>*
+; CHECK: add <2 x i32>
+; CHECK: extractelement <2 x i32>
+define i32 @hadd_int(i32* nocapture readonly %a) {
+entry:
+  %0 = load i32* %a, align 4
+  %arrayidx1 = getelementptr inbounds i32* %a, i64 1
+  %1 = load i32* %arrayidx1, align 4
+  %arrayidx2 = getelementptr inbounds i32* %a, i64 2
+  %2 = load i32* %arrayidx2, align 4
+  %arrayidx3 = getelementptr inbounds i32* %a, i64 3
+  %3 = load i32* %arrayidx3, align 4
+  %arrayidx6 = getelementptr inbounds i32* %a, i64 4
+  %4 = load i32* %arrayidx6, align 4
+  %arrayidx7 = getelementptr inbounds i32* %a, i64 5
+  %5 = load i32* %arrayidx7, align 4
+  %arrayidx10 = getelementptr inbounds i32* %a, i64 6
+  %6 = load i32* %arrayidx10, align 4
+  %arrayidx11 = getelementptr inbounds i32* %a, i64 7
+  %7 = load i32* %arrayidx11, align 4
+  %add1 = add i32 %0, %1
+  %add2 = add i32 %2, %3
+  %add3 = add i32 %4, %5
+  %add4 = add i32 %6, %7
+  %add5 = add i32 %add1, %add2
+  %add6 = add i32 %add3, %add4
+  %add7 = add i32 %add5, %add6
+  ret i32 %add7
+}

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/
_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits