[llvm] r184684 - LoopVectorize: Add utility class for checking dependency among accesses

Mon Jul 1 11:18:16 PDT 2013

Hi Preston,

I have taken a high-level look at the implementation of the Dependence Analysis pass. Here are my observations so far.

- Using GetElementPtr during the analysis.

  Part of the current analysis depends on two geps with matching pointer types. I don’t think this is the right approach. Two differently typed GetElementPtr’s can compute the same access function. The GetElementPtr is only a way to describe address computation. GetElementPtrs don’t impose interesting constraints (see my next point) that would not be embodied by the ScalarEvolution of the pointer. But the ScalarEvolution is in a canonical form and therefore we would not need to dependent on matching gep pointer types.
  I believe the analysis should directly work from and directly interpret the ScalarEvolution function describing the access function.

 For example, in the example I give below, the SCEVs of the access functions are the following:

  {{            %A, +,1024}<nw><%for.cond1.preheader.us>,  +,4}<nw><%for.body3.us>
  {{((4 * %N) + %A),+,1024}<nw><%for.cond1.preheader.us>,  +,4}<%for.body3.us>

 Using the SCEV we can deduce that there is a dependence carried on the y-loop when N >= 256:

   (4 * N) + 1024 * y + 4*i == 1024 * Y + 4 * I
   N + 256  * y + i == 256 * Y + I

 We now either know that N is < 256 (and hence 0 <= i < 256) allowing us to assume independence or we could insert dynamic check, or we have to assume a dependence carried at the outer loop.

- A GetElementPtr used to describe array accesses does not impose array dimension restrictions.

  The code currently assumes that two different indices of a GetElementPointer can be independently analyzed. This is not correct. The address part computation of a higher index may “overflow" into the lower index. An array type in a gep does not restrict the index range. (http://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds, Only the type of the array elements is relevant for the address computation)
  If we want to use a “multi-dimensional” array property (indices can be independently analyzed) we have to first show that this holds for the LLVM-IR in question. In my example below we have to make sure that N < 256, otherwise, we have to analyze the indices together.

  Let me give an example:
    void f(int A[256][256], long N) {
    for (long y = 0; y < 128; ++y)
      for (long i = 0; i < N; ++i)
        A[y][i+N] = 2 * A[y][i];
    }

  cat > test-2d-array.2.ll
===
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.8.0"

; Function Attrs: nounwind ssp uwtable
define void @f([256 x i32]* nocapture %A, i64 %N) #0 {
entry:
  %cmp218 = icmp sgt i64 %N, 0
  br i1 %cmp218, label %entry.split.us, label %entry.entry.split_crit_edge

entry.entry.split_crit_edge:                      ; preds = %entry
  br label %entry.split

entry.split.us:                                   ; preds = %entry
  br label %for.cond1.preheader.us

for.cond1.preheader.us:                           ; preds = %for.inc7.us, %entry.split.us
  %y.020.us = phi i64 [ 0, %entry.split.us ], [ %inc8.us, %for.inc7.us ]
  br i1 true, label %for.body3.lr.ph.us, label %for.inc7.us

for.inc7.us:                                      ; preds = %for.cond1.for.inc7_crit_edge.us, %for.cond1.preheader.us
  %inc8.us = add nsw i64 %y.020.us, 1
  %exitcond22 = icmp ne i64 %inc8.us, 128
  br i1 %exitcond22, label %for.cond1.preheader.us, label %for.end9.us-lcssa.us

for.body3.us:                                     ; preds = %for.body3.lr.ph.us, %for.body3.us
  %i.019.us = phi i64 [ 0, %for.body3.lr.ph.us ], [ %inc.us, %for.body3.us ]
  %arrayidx4.us = getelementptr inbounds [256 x i32]* %A, i64 %y.020.us, i64 %i.019.us
  %0 = load i32* %arrayidx4.us, align 4, !tbaa !0
  %mul.us = shl nsw i32 %0, 1
  %add.us = add nsw i64 %i.019.us, %N
  %arrayidx6.us = getelementptr inbounds [256 x i32]* %A, i64 %y.020.us, i64 %add.us
  store i32 %mul.us, i32* %arrayidx6.us, align 4, !tbaa !0
  %inc.us = add nsw i64 %i.019.us, 1
  %exitcond = icmp ne i64 %inc.us, %N
  br i1 %exitcond, label %for.body3.us, label %for.cond1.for.inc7_crit_edge.us

for.body3.lr.ph.us:                               ; preds = %for.cond1.preheader.us
  br label %for.body3.us

for.cond1.for.inc7_crit_edge.us:                  ; preds = %for.body3.us
  br label %for.inc7.us

for.end9.us-lcssa.us:                             ; preds = %for.inc7.us
  br label %for.end9

entry.split:                                      ; preds = %entry.entry.split_crit_edge
  br label %for.end9.us-lcssa

for.end9.us-lcssa:                                ; preds = %entry.split
  br label %for.end9

for.end9:                                         ; preds = %for.end9.us-lcssa.us, %for.end9.us-lcssa
  ret void
}
attributes #0 = { nounwind ssp uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"="true" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "unsafe-fp-math"="false" "use-soft-float"="false" }

!0 = metadata !{metadata !"int", metadata !1}
!1 = metadata !{metadata !"omnipotent char", metadata !2}
!2 = metadata !{metadata !"Simple C/C++ TBAA"}
===

  Release+Asserts/bin/opt -basicaa -analyze -da < test-2d-array.2.ll -debug-only=da

  If N is big enough (>=256) there is a dependence between the accesses (it might not be valid C to have N > 255, but it is certainly valid in LLVM IR semantics). The current implementation treats different getelementptr indices as independent and will return “none” as dependence answer for the two accesses. This is not correct.
  A "getelementptr [256 x i32]* %A, 0, %i” does not imply that i must be < 256.

- Overflow 
  It seems the current implementation does not handle overflow correctly. We must be very careful with cases where part of the access function might overflow.

;;  for (long unsigned i = 0; i < N; i++) {
;;    A[3*i+7] = i;
;;    *B++ = A[3*i];

There is a dependence between the two access possible due to integer wrapping but the current implementation returns there is none. I have not investigated why.

define void @overflow(i32* noalias %A , i32* noalias %B, i64 %n) {
entry:
  br label %for.body

for.body:                                         ; preds = %entry, %for.body
  %i.02 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
  %B.addr.01 = phi i32* [ %B, %entry ], [ %incdec.ptr, %for.body ]
  %conv = trunc i64 %i.02 to i32
  %mul = mul i64 %i.02, 3
  %add = add i64 %mul, 7
  %arrayidx = getelementptr inbounds i32* %A, i64 %add
  store i32 %conv, i32* %arrayidx, align 4
  %mul1 = mul i64 %i.02, 3
  %arrayidx2 = getelementptr inbounds i32* %A, i64 %mul1
  %0 = load i32* %arrayidx2, align 4
  %incdec.ptr = getelementptr inbounds i32* %B.addr.01, i64 1
  store i32 %0, i32* %B.addr.01, align 4
  %inc = add i64 %i.02, 1
  %exitcond = icmp ne i64 %inc, %n
  br i1 %exitcond, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

Thanks for pushing LLVM on this front!

Best,
Arnold

On Jun 24, 2013, at 1:08 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:

> 
> On Jun 24, 2013, at 12:14 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
>> ----- Original Message -----
>>> 
>>> On Jun 24, 2013, at 8:14 AM, Benjamin Kramer <benny.kra at gmail.com>
>>> wrote:
>>> 
>>>> On 24.06.2013, at 05:58, Arnold Schwaighofer
>>>> <aschwaighofer at apple.com> wrote:
>>>> 
>>>>> Author: arnolds
>>>>> Date: Sun Jun 23 22:55:45 2013
>>>>> New Revision: 184684
>>>>> 
>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=184684&view=rev
>>>>> Log:
>>>>> LoopVectorize: Add utility class for checking dependency among
>>>>> accesses
>>>>> 
>>>>> This class checks dependences by subtracting two Scalar Evolution
>>>>> access
>>>>> functions allowing us to catch very simple linear dependences.
>>>>> 
>>>>> The checker assumes source order in determining whether
>>>>> vectorization is safe.
>>>>> We currently don't reorder accesses.
>>>>> Positive true dependencies need to be a multiple of VF otherwise
>>>>> we impede
>>>>> store-load forwarding.
>>>> 
>>>> Any reason for not using the existing DependenceAnalysis?
>>> 
>>> We can us it once we are reasonability convinced of its correctness.
>>> At the moment I am hesitant to just drop it in.
>> 
>> As I recall, Preston did a good job with the base set of unit tests.
>> Maybe we could add a command-line argument that enables using it so that we can start some serious testing? I'd really like to see us, to the extent practical, consolidate around one primary solution to this problem.
>> 
> 
> 
> Yes this is on my list of things I like to do. I would like to gradually add DependenceAnalysis to the loop vectorizer. Starting off with reviewing the simple tests and enabling them. I like to take an incremental approach. Adding complexity as we go.
> 
> This is just a first step (i.e. we are exercising SCEV).
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits