[PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads
Michael Zolotukhin
mzolotukhin at apple.com
Wed Jan 7 17:37:06 PST 2015
Hi Suyog,
I've also just managed to construct an example in which we perform an incorrect transformation.
Here it is:
@a = common global [1000 x i32] zeroinitializer, align 16
@b = common global [1000 x i32] zeroinitializer, align 16
@c = common global [1000 x i32] zeroinitializer, align 16
; Function Attrs: nounwind readonly ssp uwtable
define void @foo() #0 {
entry:
%a0 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 0), align 16, !tbaa !2
%a1 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 1), align 4, !tbaa !2
%a2 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 2), align 8, !tbaa !2
%a3 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 3), align 4, !tbaa !2
%a4 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 4), align 16, !tbaa !2
%a5 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 5), align 4, !tbaa !2
%a6 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 6), align 8, !tbaa !2
%a7 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 7), align 4, !tbaa !2
%b0 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 0), align 16, !tbaa !2
%b1 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 1), align 4, !tbaa !2
%b2 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 2), align 8, !tbaa !2
%b3 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 3), align 4, !tbaa !2
%b4 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 4), align 16, !tbaa !2
%b5 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 5), align 4, !tbaa !2
%b6 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 6), align 8, !tbaa !2
%b7 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 7), align 4, !tbaa !2
%add01 = add i32 %a0, %a1
%add02 = add i32 %a4, %b4
%add0 = add i32 %add01, %add02
%add11 = add i32 %b0, %b1
%add12 = add i32 %a5, %b5
%add1 = add i32 %add11, %add12
%add21 = add i32 %a2, %b2
%add22 = add i32 %a6, %b6
%add2 = add i32 %add21, %add22
%add31 = add i32 %a3, %b3
%add32 = add i32 %a7, %b7
%add3 = add i32 %add31, %add32
store i32 %add0, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 0), align 16
store i32 %add1, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 1), align 4
store i32 %add2, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 2), align 8
store i32 %add3, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 3), align 4
ret void
}
The code might look confusing, but it's actually pretty simple. I took computation `c[0:3] = (a[0:3]+b[0:3]) + (a[4:7]+b[4:7])` and swapped `b[0]` and `a[1]` in it. The patched compiler incorrectly swaps these two operands back.
The problem happens because `reorderIfConsecutiveLoads` is currently called not only for reductions, but for store-chains as well. While it's valid to swap operands in reduction, it's illegal to do so across the lanes in usual vector computations.
REPOSITORY
rL LLVM
http://reviews.llvm.org/D6675
EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
More information about the llvm-commits
mailing list