[llvm] r285517 - [LoopVectorize] Make interleaved-accesses analysis less conservative about

Sun Oct 30 05:23:29 PDT 2016

Author: dorit
Date: Sun Oct 30 07:23:26 2016
New Revision: 285517

URL: http://llvm.org/viewvc/llvm-project?rev=285517&view=rev
Log:
[LoopVectorize] Make interleaved-accesses analysis less conservative about
possible pointer-wrap-around concerns, in some cases.

Before this patch, collectConstStridedAccesses (part of interleaved-accesses
analysis) called getPtrStride with [Assume=false, ShouldCheckWrap=true] when
examining all candidate pointers. This is too conservative. Instead, this
patch makes collectConstStridedAccesses use an optimistic approach, calling
getPtrStride with [Assume=true, ShouldCheckWrap=false], and then, once the
candidate interleave groups have been formed, revisits the pointer-wrapping
analysis but only where it matters: namely, in groups that have gaps, and where
the gaps are not at the very end of the group (in which case the loop is
peeled). This second time getPtrStride is called with [Assume=false,
ShouldCheckWrap=true], but this could further be improved to using Assume=true,
once we also add the logic to track that we are not going to meet the scev
runtime checks threshold.

Differential Revision: https://reviews.llvm.org/D25276


Added:
    llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-1.ll
    llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
    llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-3.ll
Modified:
    llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
    llvm/trunk/test/Transforms/LoopVectorize/AArch64/gather-cost.ll
    llvm/trunk/test/Transforms/LoopVectorize/ARM/gather-cost.ll
    llvm/trunk/test/Transforms/LoopVectorize/X86/gather-cost.ll

Modified: llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp?rev=285517&r1=285516&r2=285517&view=diff
==============================================================================

--- llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp (original)
+++ llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp Sun Oct 30 07:23:26 2016
@@ -5734,7 +5734,15 @@ void InterleavedAccessInfo::collectConst
         continue;
 
       Value *Ptr = getPointerOperand(&I);
-      int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides);
+      // We don't check wrapping here because we don't know yet if Ptr will be 
+      // part of a full group or a group with gaps. Checking wrapping for all 
+      // pointers (even those that end up in groups with no gaps) will be overly
+      // conservative. For full groups, wrapping should be ok since if we would 
+      // wrap around the address space we would do a memory access at nullptr
+      // even without the transformation. The wrapping checks are therefore
+      // deferred until after we've formed the interleaved groups.
+      int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,
+                                    /*Assume=*/true, /*ShouldCheckWrap=*/false);
 
       const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);
       PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
@@ -5938,6 +5946,56 @@ void InterleavedAccessInfo::analyzeInter
     if (Group->getNumMembers() != Group->getFactor())
       releaseGroup(Group);
 
+  // Remove interleaved groups with gaps (currently only loads) whose memory 
+  // accesses may wrap around. We have to revisit the getPtrStride analysis, 
+  // this time with ShouldCheckWrap=true, since collectConstStrideAccesses does 
+  // not check wrapping (see documentation there).
+  // FORNOW we use Assume=false; 
+  // TODO: Change to Assume=true but making sure we don't exceed the threshold 
+  // of runtime SCEV assumptions checks (thereby potentially failing to
+  // vectorize altogether). 
+  // Additional optional optimizations:
+  // TODO: If we are peeling the loop and we know that the first pointer doesn't 
+  // wrap then we can deduce that all pointers in the group don't wrap.
+  // This means that we can forcefully peel the loop in order to only have to 
+  // check the first pointer for no-wrap. When we'll change to use Assume=true 
+  // we'll only need at most one runtime check per interleaved group.
+  //
+  for (InterleaveGroup *Group : LoadGroups) {
+
+    // Case 1: A full group. Can Skip the checks; For full groups, if the wide
+    // load would wrap around the address space we would do a memory access at 
+    // nullptr even without the transformation. 
+    if (Group->getNumMembers() == Group->getFactor()) 
+      continue;
+
+    // Case 2: If first and last members of the group don't wrap this implies 
+    // that all the pointers in the group don't wrap.
+    // So we check only group member 0 (which is always guaranteed to exist),
+    // and group member Factor-1 (if it doesn't exist we can just ignore it 
+    // since we know that in this case we will always peel the loop, in which
+    // case we only need to check the first member). 
+    Value *FirstMemberPtr = getPointerOperand(Group->getMember(0));
+    if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /*Assume=*/false, 
+                      /*ShouldCheckWrap=*/true)) {
+      DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
+                      "potential pointer wrapping.\n");
+      releaseGroup(Group);
+      continue;
+    }
+
+    if (Instruction *LastMember = Group->getMember(Group->getFactor() - 1)) {
+      Value *LastMemberPtr = getPointerOperand(LastMember);
+      if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /*Assume=*/false, 
+                        /*ShouldCheckWrap=*/true)) {
+        DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
+                        "potential pointer wrapping.\n");
+        releaseGroup(Group);
+        continue;
+      }
+    }
+  }
+
   // If there is a non-reversed interleaved load group with gaps, we will need
   // to execute at least one scalar epilogue iteration. This will ensure that
   // we don't speculatively access memory out-of-bounds. Note that we only need

Modified: llvm/trunk/test/Transforms/LoopVectorize/AArch64/gather-cost.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/AArch64/gather-cost.ll?rev=285517&r1=285516&r2=285517&view=diff
==============================================================================
--- llvm/trunk/test/Transforms/LoopVectorize/AArch64/gather-cost.ll (original)
+++ llvm/trunk/test/Transforms/LoopVectorize/AArch64/gather-cost.ll Sun Oct 30 07:23:26 2016
@@ -1,4 +1,4 @@
-; RUN: opt -loop-vectorize -mtriple=arm64-apple-ios -S -mcpu=cyclone < %s | FileCheck %s
+; RUN: opt -loop-vectorize -mtriple=arm64-apple-ios -S -mcpu=cyclone -enable-interleaved-mem-accesses=false < %s | FileCheck %s
 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"
 
 @kernel = global [512 x float] zeroinitializer, align 16

Modified: llvm/trunk/test/Transforms/LoopVectorize/ARM/gather-cost.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/ARM/gather-cost.ll?rev=285517&r1=285516&r2=285517&view=diff
==============================================================================
--- llvm/trunk/test/Transforms/LoopVectorize/ARM/gather-cost.ll (original)
+++ llvm/trunk/test/Transforms/LoopVectorize/ARM/gather-cost.ll Sun Oct 30 07:23:26 2016
@@ -1,4 +1,4 @@
-; RUN: opt -loop-vectorize -mtriple=thumbv7s-apple-ios6.0.0 -S < %s | FileCheck %s
+; RUN: opt -loop-vectorize -mtriple=thumbv7s-apple-ios6.0.0 -S -enable-interleaved-mem-accesses=false < %s | FileCheck %s
 
 target datalayout = "e-p:32:32:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:32:64-v128:32:128-a0:0:32-n32-S32"
 

Modified: llvm/trunk/test/Transforms/LoopVectorize/X86/gather-cost.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/X86/gather-cost.ll?rev=285517&r1=285516&r2=285517&view=diff
==============================================================================
--- llvm/trunk/test/Transforms/LoopVectorize/X86/gather-cost.ll (original)
+++ llvm/trunk/test/Transforms/LoopVectorize/X86/gather-cost.ll Sun Oct 30 07:23:26 2016
@@ -1,4 +1,4 @@
-; RUN: opt -loop-vectorize -mtriple=x86_64-apple-macosx -S -mcpu=corei7-avx < %s | FileCheck %s
+; RUN: opt -loop-vectorize -mtriple=x86_64-apple-macosx -S -mcpu=corei7-avx -enable-interleaved-mem-accesses=false < %s | FileCheck %s
 target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
 
 @kernel = global [512 x float] zeroinitializer, align 16

Added: llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-1.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-1.ll?rev=285517&view=auto
==============================================================================
--- llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-1.ll (added)
+++ llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-1.ll Sun Oct 30 07:23:26 2016
@@ -0,0 +1,78 @@
+; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true < %s | FileCheck %s
+
+target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
+
+; Check that the interleaved-mem-access analysis identifies the access
+; to array 'in' as interleaved, despite the possibly wrapping unsigned
+; 'out_ix' index.
+;
+; In this test the interleave-groups are full (have no gaps), so no wrapping
+; checks are necessary. We can call getPtrStride with Assume=false and
+; ShouldCheckWrap=false to safely figure out that the stride is 2.
+
+; #include <stdlib.h>
+; class Complex {
+; private:
+;  float real_;
+;  float imaginary_;
+;
+;public:
+; Complex() : real_(0), imaginary_(0) { }
+; Complex(float real, float imaginary) : real_(real), imaginary_(imaginary) { }
+; Complex(const Complex &rhs) : real_(rhs.real()), imaginary_(rhs.imaginary()) { }
+;
+; inline float real() const { return real_; }
+; inline float imaginary() const { return imaginary_; }
+;};
+;
+;void test(Complex * __restrict__ out, Complex * __restrict__ in, size_t out_start, size_t size)
+;{
+;   for (size_t out_offset = 0; out_offset < size; ++out_offset)
+;     {
+;       size_t out_ix = out_start + out_offset;
+;       Complex t0 = in[out_ix];
+;       out[out_ix] = t0;
+;     }
+;}
+
+; CHECK: vector.body:
+; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
+; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
+
+%class.Complex = type { float, float }
+
+define void @_Z4testP7ComplexS0_mm(%class.Complex* noalias nocapture %out, %class.Complex* noalias nocapture readonly %in, i64 %out_start, i64 %size) local_unnamed_addr {
+entry:
+  %cmp9 = icmp eq i64 %size, 0
+  br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader:
+  br label %for.body
+
+for.cond.cleanup.loopexit:
+  br label %for.cond.cleanup
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %out_offset.010 = phi i64 [ %inc, %for.body ], [ 0, %for.body.preheader ]
+  %add = add i64 %out_offset.010, %out_start
+  %arrayidx = getelementptr inbounds %class.Complex, %class.Complex* %in, i64 %add
+  %0 = bitcast %class.Complex* %arrayidx to i32*
+  %1 = load i32, i32* %0, align 4
+  %imaginary_.i.i = getelementptr inbounds %class.Complex, %class.Complex* %in, i64 %add, i32 1
+  %2 = bitcast float* %imaginary_.i.i to i32*
+  %3 = load i32, i32* %2, align 4
+  %arrayidx1 = getelementptr inbounds %class.Complex, %class.Complex* %out, i64 %add
+  %4 = bitcast %class.Complex* %arrayidx1 to i64*
+  %t0.sroa.4.0.insert.ext = zext i32 %3 to i64
+  %t0.sroa.4.0.insert.shift = shl nuw i64 %t0.sroa.4.0.insert.ext, 32
+  %t0.sroa.0.0.insert.ext = zext i32 %1 to i64
+  %t0.sroa.0.0.insert.insert = or i64 %t0.sroa.4.0.insert.shift, %t0.sroa.0.0.insert.ext
+  store i64 %t0.sroa.0.0.insert.insert, i64* %4, align 4
+  %inc = add nuw i64 %out_offset.010, 1
+  %exitcond = icmp eq i64 %inc, %size
+  br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
+}

Added: llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-2.ll?rev=285517&view=auto
==============================================================================
--- llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-2.ll (added)
+++ llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-2.ll Sun Oct 30 07:23:26 2016
@@ -0,0 +1,58 @@
+; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true < %s | FileCheck %s
+
+target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
+
+; Check that the interleaved-mem-access analysis currently does not create an 
+; interleave group for the access to array 'in' due to the possibly wrapping 
+; unsigned 'out_ix' index.
+;
+; In this test the interleave-group of the loads is not full (has gaps), so 
+; the wrapping checks are necessary. Here this cannot be done statically so 
+; runtime checks are needed, but with Assume=false getPtrStride cannot add 
+; runtime checks and as a result we can't create the interleave-group.
+;
+; FIXME: This is currently a missed optimization until we can use Assume=true 
+; with proper threshold checks. Once we do that the candidate interleave-group
+; will not be invalidated by the wrapping checks.
+
+; #include <stdlib.h>
+; void test(float * __restrict__ out, float * __restrict__ in, size_t size)
+; {
+;    for (size_t out_offset = 0; out_offset < size; ++out_offset)
+;      {
+;        float t0 = in[2*out_offset];
+;        out[out_offset] = t0;
+;      }
+; }
+
+; CHECK: vector.body:
+; CHECK-NOT: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
+; CHECK-NOT: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+
+define void @_Z4testPfS_m(float* noalias nocapture %out, float* noalias nocapture readonly %in, i64 %size) local_unnamed_addr {
+entry:
+  %cmp7 = icmp eq i64 %size, 0
+  br i1 %cmp7, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader:
+  br label %for.body
+
+for.cond.cleanup.loopexit:
+  br label %for.cond.cleanup
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %out_offset.08 = phi i64 [ %inc, %for.body ], [ 0, %for.body.preheader ]
+  %mul = shl i64 %out_offset.08, 1
+  %arrayidx = getelementptr inbounds float, float* %in, i64 %mul
+  %0 = bitcast float* %arrayidx to i32*
+  %1 = load i32, i32* %0, align 4
+  %arrayidx1 = getelementptr inbounds float, float* %out, i64 %out_offset.08
+  %2 = bitcast float* %arrayidx1 to i32*
+  store i32 %1, i32* %2, align 4
+  %inc = add nuw i64 %out_offset.08, 1
+  %exitcond = icmp eq i64 %inc, %size
+  br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
+}

Added: llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-3.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-3.ll?rev=285517&view=auto
==============================================================================
--- llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-3.ll (added)
+++ llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-3.ll Sun Oct 30 07:23:26 2016
@@ -0,0 +1,57 @@
+; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true < %s | FileCheck %s
+
+target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
+
+; Check that the interleaved-mem-access analysis currently does not create an 
+; interleave group for access 'a' due to the possible pointer wrap-around.
+;
+; To begin with, in this test the candidate interleave group can be created 
+; only when getPtrStride is called with Assume=true. Next, because
+; the interleave-group of the loads is not full (has gaps), we also need to check 
+; for possible pointer wrapping. Here we currently use Assume=false and as a 
+; result cannot prove the transformation is safe and therefore invalidate the
+; candidate interleave group.
+;
+; FIXME: This is a missed optimization. Once we use Assume=true here, we will
+; not have to invalidate the group.
+
+; void func(unsigned * __restrict a, unsigned * __restrict b, unsigned char x, unsigned char y) {
+;  int i = 0;
+;  for (unsigned char index = x; i < y; index +=2, ++i)
+;    b[i] = a[index] * 2;
+;
+; }
+
+; CHECK: vector.body:
+; CHECK-NOT: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
+; CHECK-NOT: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
+
+define void @_Z4funcPjS_hh(i32* noalias nocapture readonly %a, i32* noalias nocapture %b, i8 zeroext %x, i8 zeroext %y) local_unnamed_addr {
+entry:
+  %cmp9 = icmp eq i8 %y, 0
+  br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader:
+  %wide.trip.count = zext i8 %y to i64
+  br label %for.body
+
+for.cond.cleanup.loopexit:
+  br label %for.cond.cleanup
+
+for.cond.cleanup:
+  ret void
+
+for.body:
+  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.preheader ]
+  %index.011 = phi i8 [ %add, %for.body ], [ %x, %for.body.preheader ]
+  %idxprom = zext i8 %index.011 to i64
+  %arrayidx = getelementptr inbounds i32, i32* %a, i64 %idxprom
+  %0 = load i32, i32* %arrayidx, align 4
+  %mul = shl i32 %0, 1
+  %arrayidx2 = getelementptr inbounds i32, i32* %b, i64 %indvars.iv
+  store i32 %mul, i32* %arrayidx2, align 4
+  %add = add i8 %index.011, 2
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
+  %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
+  br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
+}