[PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads
Suyog Kamal Sarda
suyog.sarda at samsung.com
Tue Dec 16 00:00:32 PST 2014
Hi James,
Thanks for the review.
Yes, I agree the code generated can be optimized further to have v0.4s (addition of 4 scalars at a time) instead of
v0.2s (addition of 2 scalars at a time). So, basically we need to emit <4x> instead of <2x> vectors in the IR.
AFAIK, the way we build the Bottom up tree in SLP ( for the kind of tree I have
described in the description below ), when we try to bundle up loads for a vector binary operator, we always
bundle up in pair of 2 loads.
For ex (I am numbering the + operators for reference):
+ 1
/ \
/ \
/ \
/ \
+ 2 + 3
/ \ / \
/ \ / \
/ \ / \
+ 4 + 5 + 6 + 7
/ \ / \ / \ / \
0 1 2 3 4 5 6 7
When visiting this tree, we encounter 2 and 3 + and we say that it can be vectorized. Now we visit, left of 2 and 3,
and come to 4 and 6 + , which can again be vectorized. So we visit, Left and Right of 4 and 6 + . We now find, that
Left -> a[0] and a[4]
Right -> a[1] and a[5]
With my patch, we re-arrange them as
Left -> a[0] and a[1]
Right -> a[4] and a[5].
We see that both Left and Right now have consecutive loads and hence can be bundled into a vector load of <2x>.
Note, that at this point, we are unaware of the other loads and 5 7 +. Hence, we are not emitting <4x> vector loads.
This traversal of operators and operands was already in existing code, and I didn't disturb that :).
May be we can put code to handle such type of IR's in DAG combine where if we encounter consecutive vector loads of 2 loads
at a time, we can combine them into vector load of 4 loads.
So basically, we need to reduce the tree :
+
/ \
/ \
+ +
/ \ / \
load load load load
2x at 0 2x at 2 2x at 4 2x at 6
to something like :
+
/ \
/ \
load load
4x at 0 4x at 4
Feel free to correct me in my understanding :)
I am trying to solve this type of problems in incremental steps.
(Exciting thing is, as I was writing this mail, I got an idea for above type of reduction,
where I can vectorize 2x loads into 4xloads :).
Need to check if it already exist and come up with a patch if not.)
Regards,
Suyog
------- Original Message -------
Sender : James Molloy<james at jamesmolloy.co.uk>
Date : Dec 16, 2014 16:25 (GMT+09:00)
Title : Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads
Hi suyog,
This is a good improvement, thanks for working on it!
I'll take a closer look today, but for now I did notice that the generated aarch64 assembly isn't as optimal as it could be. I'd expect:
Ldp q0, q1
Add v0.4s, v0.4s, v1.4s
Addv s0, v0.4s
Cheers,
James
On Tue, 16 Dec 2014 at 05:29, suyog <suyog.sarda at samsung.com> wrote:
Hi nadav, aschwaighofer, jmolloy,
This patch is enhancement to r224119 which vectorizes horizontal reductions from consecutive loads.
Earlier in r224119, we handled tree :
+
/ \
/ \
+ +
/ \ / \
/ \ / \
a[0] a[1] a[2] a[3]
where originally, we had
Left Right
a[0] a[1]
a[2] a[3]
In r224119, we compared, (Left[i], Right[i]) and (Right[i], Left[i+1])
Left Right
a[0] ---> a[1]
/
/
/
\/
a[2] a[3]
And then rearrange it to
Left Right
a[0] a[2]
a[1] a[3]
so that, we can bundle left and right into vector of loads.
However, with bigger tree,
+
/ \
/ \
/ \
/ \
+ +
/ \ / \
/ \ / \
/ \ / \
+ + + +
/ \ / \ / \ / \
0 1 2 3 4 5 6 7
Left Right
0 1
4 5
2 3
6 7
In this case, Comparison of Right[i] and Left[i+1] would fail, and code remains scalar.
If we eliminate comparison Right[i] and Left[i+1], and just compare Left[i] with Right[i],
we would be able to re-arrange Left and Right into :
Left Right
0 4
1 5
2 6
3 7
And then would bundle (0,1) (4,5) and (2,3) (6,7) into vector loads.
And then have vector adds of (01, 45) and (23, 67).
However, notice that, this would disturb the sequence of addition.
Originally, (01) and (23) should have been added. Same with (45) and (67).
For integer type addition, this would not create any issue, but for other
data types with precision concerns, there might be a problem.
ffast-math would have eliminated this precision concern, but it would have
re-associated the tree itself into (+(+(+(+(0,1)2)3....)
Hence, in this patch we are checking for integer types and then only skipping
the extra comparison of (Right[i], Left[i+1]).
With this patch, we now vectorize above type of tree for any length of consecutive loads
of integer type.
For test case:
#include <arm_neon.h>
int hadd(int* a){
return (a[0] + a[1]) + (a[2] + a[3]) + (a[4] + a[5]) + (a[6] + a[7]);
}
AArch64 assembly before this patch :
ldp w8, w9, [x0]
ldp w10, w11, [x0, #8]
ldp w12, w13, [x0, #16]
ldp w14, w15, [x0, #24]
add w8, w8, w9
add w9, w10, w11
add w10, w12, w13
add w11, w14, w15
add w8, w8, w9
add w9, w10, w11
add w0, w8, w9
ret
AArch64 assembly after this patch :
ldp d0, d1, [x0]
ldp d2, d3, [x0, #16]
add v0.2s, v0.2s, v2.2s
add v1.2s, v1.2s, v3.2s
add v0.2s, v0.2s, v1.2s
fmov w8, s0
mov w9, v0.s[1]
add w0, w8, w9
ret
Please help in reviewing this patch. I did not run LNT as of now, since this is just enhancement
to r224119. I will update with LNT results if required.
Regards,
Suyog
REPOSITORY
rL LLVM
http://reviews.llvm.org/D6675
Files:
lib/Transforms/Vectorize/SLPVectorizer.cpp
test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
Index: lib/Transforms/Vectorize/SLPVectorizer.cpp
===================================================================
--- lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1831,8 +1831,11 @@
for (unsigned i = 0, e = Left.size(); i < e - 1; ++i) {
if (!isa<LoadInst>(Left[i]) || !isa<LoadInst>(Right[i]))
return;
- if (!(isConsecutiveAccess(Left[i], Right[i]) &&
- isConsecutiveAccess(Right[i], Left[i + 1])))
+ LoadInst *L = dyn_cast<LoadInst>(Left[i]);
+ bool isInt = L->getType()->isIntegerTy();
+ if (!(isConsecutiveAccess(Left[i], Right[i])))
+ continue;
+ else if (!isInt && !isConsecutiveAccess(Right[i], Left[i + 1]))
continue;
else
std::swap(Left[i + 1], Right[i]);
Index: test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
===================================================================
--- test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
+++ test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll
@@ -25,3 +25,34 @@
%add5 = fadd float %add, %add4
ret float %add5
}
+
+; CHECK-LABEL: @hadd_int
+; CHECK: load <2 x i32>*
+; CHECK: add <2 x i32>
+; CHECK: extractelement <2 x i32>
+define i32 @hadd_int(i32* nocapture readonly %a) {
+entry:
+ %0 = load i32* %a, align 4
+ %arrayidx1 = getelementptr inbounds i32* %a, i64 1
+ %1 = load i32* %arrayidx1, align 4
+ %arrayidx2 = getelementptr inbounds i32* %a, i64 2
+ %2 = load i32* %arrayidx2, align 4
+ %arrayidx3 = getelementptr inbounds i32* %a, i64 3
+ %3 = load i32* %arrayidx3, align 4
+ %arrayidx6 = getelementptr inbounds i32* %a, i64 4
+ %4 = load i32* %arrayidx6, align 4
+ %arrayidx7 = getelementptr inbounds i32* %a, i64 5
+ %5 = load i32* %arrayidx7, align 4
+ %arrayidx10 = getelementptr inbounds i32* %a, i64 6
+ %6 = load i32* %arrayidx10, align 4
+ %arrayidx11 = getelementptr inbounds i32* %a, i64 7
+ %7 = load i32* %arrayidx11, align 4
+ %add1 = add i32 %0, %1
+ %add2 = add i32 %2, %3
+ %add3 = add i32 %4, %5
+ %add4 = add i32 %6, %7
+ %add5 = add i32 %add1, %add2
+ %add6 = add i32 %add3, %add4
+ %add7 = add i32 %add5, %add6
+ ret i32 %add7
+}
EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
_______________________________________________
llvm-commits mailing list
llvm-commits at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
More information about the llvm-commits
mailing list