[PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Michael Zolotukhin mzolotukhin at apple.com
Wed Jan 7 17:37:06 PST 2015


Hi Suyog,

I've also just managed to construct an example in which we perform an incorrect transformation.

Here it is:

  @a = common global [1000 x i32] zeroinitializer, align 16
  @b = common global [1000 x i32] zeroinitializer, align 16
  @c = common global [1000 x i32] zeroinitializer, align 16
  
  ; Function Attrs: nounwind readonly ssp uwtable
  define void @foo() #0 {
  entry:
    %a0 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 0), align 16, !tbaa !2
    %a1 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 1), align 4, !tbaa !2
    %a2 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 2), align 8, !tbaa !2
    %a3 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 3), align 4, !tbaa !2
    %a4 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 4), align 16, !tbaa !2
    %a5 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 5), align 4, !tbaa !2
    %a6 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 6), align 8, !tbaa !2
    %a7 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 7), align 4, !tbaa !2
  
    %b0 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 0), align 16, !tbaa !2
    %b1 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 1), align 4, !tbaa !2
    %b2 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 2), align 8, !tbaa !2
    %b3 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 3), align 4, !tbaa !2
    %b4 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 4), align 16, !tbaa !2
    %b5 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 5), align 4, !tbaa !2
    %b6 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 6), align 8, !tbaa !2
    %b7 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 7), align 4, !tbaa !2
  
    %add01 = add i32 %a0, %a1
    %add02 = add i32 %a4, %b4
    %add0 = add i32 %add01, %add02
  
    %add11 = add i32 %b0, %b1
    %add12 = add i32 %a5, %b5
    %add1 = add i32 %add11, %add12
  
    %add21 = add i32 %a2, %b2
    %add22 = add i32 %a6, %b6
    %add2 = add i32 %add21, %add22
  
    %add31 = add i32 %a3, %b3
    %add32 = add i32 %a7, %b7
    %add3 = add i32 %add31, %add32
  
    store i32 %add0, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 0), align 16
    store i32 %add1, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 1), align 4
    store i32 %add2, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 2), align 8
    store i32 %add3, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 3), align 4
    ret void
  }

The code might look confusing, but it's actually pretty simple. I took computation `c[0:3] = (a[0:3]+b[0:3]) + (a[4:7]+b[4:7])` and swapped `b[0]` and `a[1]` in it. The patched compiler incorrectly swaps these two operands back.

The problem happens because `reorderIfConsecutiveLoads` is currently called not only for reductions, but for store-chains as well. While it's valid to swap operands in reduction, it's illegal to do so across the lanes in usual vector computations.


REPOSITORY
  rL LLVM

http://reviews.llvm.org/D6675

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/






More information about the llvm-commits mailing list