[PATCH] [SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statement
suyog
suyog.sarda at samsung.com
Fri Nov 14 04:27:39 PST 2014
Hi all,
I had run performance test suite on X86 for 10 iterations and below is the output.
{F244234}
Please NOTE - BASELINE is WITH THE PROPOSED PATCH and CURRENT is WITHOUT PATCH.
(I tried alternating the baseline and the current so that current contains the proposed patch. Strangely,
the report always shows baseline to be with the current patch.) There isn't any regression observed.
Test cases marked in RED are improvements, though they are unrelated to the proposed patch.
This code will be re-usable for future improvements in identifying consecutive memory access in same subtree.
That will come in separate patch, unrelated to the objective of this patch.
I tried checking code generated for smaller than 32-bit type for AArch64.
Test Case :
#include <arm_neon.h>
short hadd(short * a) {
return ((a[0] + a[2]) + (a[1] + a[3]));
}
IR after O1 (without SLP) :
define i16 @hadd(i16* nocapture readonly %a) #0 {
entry:
%0 = load i16* %a, align 2, !tbaa !1
%conv13 = zext i16 %0 to i32
%arrayidx1 = getelementptr inbounds i16* %a, i64 2
%1 = load i16* %arrayidx1, align 2, !tbaa !1
%conv214 = zext i16 %1 to i32
%arrayidx3 = getelementptr inbounds i16* %a, i64 1
%2 = load i16* %arrayidx3, align 2, !tbaa !1
%conv415 = zext i16 %2 to i32
%arrayidx5 = getelementptr inbounds i16* %a, i64 3
%3 = load i16* %arrayidx5, align 2, !tbaa !1
%conv616 = zext i16 %3 to i32
%add7 = add nuw nsw i32 %conv214, %conv13
%add = add nuw nsw i32 %add7, %conv415
%add8 = add nuw nsw i32 %add, %conv616
%conv9 = trunc i32 %add8 to i16
ret i16 %conv9
}
Since we are doing extension-truncation operations here, current patch does not vectorize it.
If we remove those extension and truncation
define i16 @hadd(i16* nocapture readonly %a) #0 {
entry:
%0 = load i16* %a, align 2, !tbaa !1
%arrayidx1 = getelementptr inbounds i16* %a, i64 2
%1 = load i16* %arrayidx1, align 2, !tbaa !1
%arrayidx3 = getelementptr inbounds i16* %a, i64 1
%2 = load i16* %arrayidx3, align 2, !tbaa !1
%arrayidx5 = getelementptr inbounds i16* %a, i64 3
%3 = load i16* %arrayidx5, align 2, !tbaa !1
%add7 = add nuw nsw i16 %0, %1
%add = add nuw nsw i16 %2, %3
%add8 = add nuw nsw i16 %add, %add7
ret i16 %add8
}
LLVM vectorizes this with patch above.
Assembly code for 16 bit with extension-truncation after running SLP pass (No vectorization done in this case)
ldrh w8, [x0]
ldrh w9, [x0, #4]
ldrh w10, [x0, #2]
ldrh w11, [x0, #6]
add w8, w9, w8
add w8, w8, w10
add w0, w8, w11
ret
Assembly code for 16 bit without extension-truncation after running SLP pass (vectorization done in this case)
ldrh w8, [x0]
ldrh w9, [x0, #2]
ldrh w10, [x0, #4]
ldrh w11, [x0, #6]
fmov s0, w8
fmov s1, w10
ins v0.s[1], w9
ins v1.s[1], w11
add v0.2s, v0.2s, v1.2s
fmov w8, s0
mov w9, v0.s[1]
add w0, w9, w8
ret
Seems bad code for less than 32 bit data type.
However, the current patch doesn't vectorizes less than 32 bit data as it ignores vectorization if truncation/extension encountered.
Please help in reviewing this patch.
Regards,
Suyog
http://reviews.llvm.org/D6227
More information about the llvm-commits
mailing list