Reductions

Sat Jun 14 01:37:11 PDT 2014

On 13/06/2014 19:27, Johannes Doerfert wrote:
> Hey Tobias,
>
> I changed almost everything according to your comments (except the prefix in
> the test cases, I don't see the need to do this, but if you insist I will).
> I added another "only reduction like" test case and a floating point test
> case w and w/o fast math [= 4 test cases for in the first patch].

Nice.

> What is left in order to commit this one and get to the next patch?

Besides a couple of nitpicks in the test cases and some last questions 
about the comments, the only missing piece is the question if we need to 
check the read/write location to be identical.

> Comments inlined (formatting might be a bit off, sorry about that).

>>>> >> >+  /// @brief Flag to indicate reduction statements
>>>> >> >+ bool IsReductionLike = false;
>>> >>
>>> >>Why do you call this ReductionLike and not just reduction?

This question is just a minor thing, but you did not reply to it. Is the 
example you give in the comment the reason you call this reduction-like 
instead of reduction?

>>> >>
>>> >>Also, can you explain what it means to be a reduction. Is it associative
> _and_  commutative? Or just one of them? I would also mention the restriction
> on two memory accesses. We later need to extend this to the multiple memory
> access case.
> I added a descriptive comment to this Boolean flag. It talks about reduction
> like, associative, commutative, and the restriction.
> Lifting the restriction soon is one of mine priorities too, but I'd like
> positive feedback on these patches first.

Nice.

>>>> >> >   #include "isl/constraint.h"
>>>> >> >@@ -55,6 +55,12 @@ using namespace polly;
>>>> >> >   STATISTIC(ScopFound, "Number of valid Scops");
>>>> >> >   STATISTIC(RichScopFound, "Number of Scops containing a loop");
>>>> >> >
>>>> >> >+static cl::opt<bool>
>>>> >> >
> +DisableMultiplicativeReductions("polly-disable-multiplicative-reductions",
>>>> >> >+                                cl::desc("Disable multiplicative
> reductions"),
>>>> >> >+                                cl::Hidden, cl::ZeroOrMore,
> cl::init(false),
>>>> >> >+                                cl::cat(PollyCategory));
>>> >>
>>> >>Out of interest. Why would we want to disable those, but not the other
> possible operands?
> They overflow easily, add is pretty safe and bit operations are stable too.
> Is that OK. ?

Sure. Maybe add a comment why someone would like to disable 
multiplicative reductions.

>>>> >> >+  // Skip if there is not one binary operator between the load and
>>>> >> >+ the store  auto *BinOp =
>>>> >> >+ dyn_cast<BinaryOperator>(Store->getValueOperand());
>>>> >> >+  if (!BinOp || (BinOp->getOperand(0) != Load && BinOp->getOperand(1)
> != Load))
>>>> >> >+    return;
>>> >>Should we also check that the load and store instruction point to the
> same memory location (reference the very same pointer)?
> Not in the current setting. I wanted it to be a small first patch so I use
> reduction like and allow for example:
> for I
>    for j
>      B[i] = A[i] + 3;
> However there won't be any reduction dependences for this example so there
> is no real harm done.

I tested test example (attached):

for 0 <= i < 100:
   A[i+1] = A[i] + 1;

To my understanding your current dependence check in patch 2 would 
detect reduction dependences here (and does not verify correctly that
the in and out locations are identical).

(In fact you don't detect this reduction, but this seems to be a small bug:

     RED = isl_union_map_intersect(RED, isl_union_map_copy(RAW));
     RED = isl_union_map_intersect(RED, isl_union_map_copy(WAW));

I think you would like to compute

RED = intersect(RED, union(RAW, WAW)), instead of the current
RED = intersect(intersect(RED, RAW), WAW))

Or are reduction dependences required to be both RAW and WAW dependences?)

I think the simplest check for now is to verify that the addresses of 
the input/output location have the same LLVM-IR value. Or that the read 
and write access have identical access functions. We can then later 
discuss how to relax this.

> diff --git a/include/polly/ScopInfo.h b/include/polly/ScopInfo.h
> index e85ec13..f571b47 100644
> --- a/include/polly/ScopInfo.h
> +++ b/include/polly/ScopInfo.h
> @@ -270,6 +270,17 @@ class ScopStmt {
>     MemoryAccessVec MemAccs;
>     std::map<const Instruction *, MemoryAccess *> InstructionToAccess;
>
> +  /// @brief Flag to indicate reduction like statements
> +  ///
> +  /// A statement is reduction like if it contains exactly one load and one
> +  /// store with exactly one binary operator in between. The binary operator
> +  /// needs to be associative and commutative or easily convertable into a such

Should it be 'convertible'?

What does it mean to be easily convertible? Is this already implemented? 
Is there a test case/example I could look at?

> +  /// one. Not all reduction like statements are binary reductions in the common
> +  /// sence, e.g.,
> +  /// for (i = 0; i < 100; i++)
> +  ///   sum[100-i] += sum[i]
> +  /// which is reduction like but will not cause any reduction dependences.

This is because there are no dependences at all in this loop or because 
the individual iterations write to a different memory location?

Am I right that the point you want to make is that the flag in ScopInfo 
only gives information about the properties of an individual 
computation, but that the actual computation of reduction dependences 
will happen later?

> diff --git a/test/ScopInfo/reduction_only_reduction_like_access.ll b/test/ScopInfo/reduction_only_reduction_like_access.ll
> new file mode 100644
> index 0000000..30eef4d
> --- /dev/null
> +++ b/test/ScopInfo/reduction_only_reduction_like_access.ll
> @@ -0,0 +1,39 @@
> +; RUN: opt %loadPolly -polly-scops -analyze < %s | FileCheck %s --check-prefix=DETECTION
> +;
> +; DETECTION: Reduction like: 1
> +;
> +; void f(int *sum) {
> +;   int i;
> +;   for (i = 0; i < 100; i++)
> +;     sum[i] = sum[99-i] + i;
> +; }
I like this style of test case best as it does not add unnecessary 
reg2mem instructions, basic blocks as some of your other test cases do. ;-)

> +;
> +; ModuleID = 'test/ScopInfo/reduction_only_reduction_like_access.ll'

Remove this line.

> +target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-n32-S64"
> +
> +; Function Attrs: nounwind
> +define void @f(i32* %sum) {
> +entry:
> +  br label %for.cond
> +
> +for.cond:                                         ; preds = %for.inc, %entry
> +  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
> +  %exitcond = icmp ne i32 %i.0, 100
> +  br i1 %exitcond, label %for.body, label %for.end
> +
> +for.body:                                         ; preds = %for.cond
> +  %sub = sub nsw i32 99, %i.0
> +  %arrayidx = getelementptr inbounds i32* %sum, i32 %sub
> +  %tmp = load i32* %arrayidx, align 4
> +  %add = add nsw i32 %tmp, %i.0
> +  %arrayidx1 = getelementptr inbounds i32* %sum, i32 %i.0
> +  store i32 %add, i32* %arrayidx1, align 4
> +  br label %for.inc
> +
> +for.inc:                                          ; preds = %for.body
> +  %inc = add nsw i32 %i.0, 1
> +  br label %for.cond
> +
> +for.end:                                          ; preds = %for.cond
> +  ret void
> +}
> diff --git a/test/ScopInfo/reduction_simple_fp.ll b/test/ScopInfo/reduction_simple_fp.ll
> new file mode 100644
> index 0000000..0eef154
> --- /dev/null
> +++ b/test/ScopInfo/reduction_simple_fp.ll
> @@ -0,0 +1,69 @@
> +; RUN: opt %loadPolly -polly-scops -analyze < %s | FileCheck %s --check-prefix=DETECTION
> +;
> +; DETECTION: Function: f_no_fast_math
> +; DETECTION: Reduction like: 0
> +; DETECTION: Function: f_fast_math
> +; DETECTION: Reduction like: 1
> +;
> +; void f(float *sum) {
> +;   int i;
> +;   for (i = 0; i < 100; i++)
> +;     *sum += 3.41 * i;
I like this style of test case best as it does not add unnecessary 
reg2mem instructions, basic blocks as some of your other test cases do. ;-)

> +; }
> +;
> +; ModuleID = 'test/ScopInfo/reduction_simple_fp.ll'
Remove this line.

> +target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-n32-S64"
> +
> +; Function Attrs: nounwind
Remove this line.

> +define void @f_no_fast_math(float* %sum) {
> +entry:
> +  br label %for.cond
> +
> +for.cond:                                         ; preds = %for.inc, %entry
> +  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
> +  %exitcond = icmp ne i32 %i.0, 100
> +  br i1 %exitcond, label %for.body, label %for.end
> +
> +for.body:                                         ; preds = %for.cond
> +  %conv = sitofp i32 %i.0 to float
> +  %pi = fptrunc double 3.41 to float
> +  %mul = fmul float %conv, %pi
> +  %tmp = load float* %sum, align 4
> +  %add = fadd float %tmp, %mul
> +  store float %add, float* %sum, align 4
> +  br label %for.inc
> +
> +for.inc:                                          ; preds = %for.body
> +  %inc = add nsw i32 %i.0, 1
> +  br label %for.cond
> +
> +for.end:                                          ; preds = %for.cond
> +  ret void
> +}
> +
> +; Function Attrs: nounwind
Remove this line.

> +define void @f_fast_math(float* %sum) {
> +entry:
> +  br label %for.cond
> +
> +for.cond:                                         ; preds = %for.inc, %entry
> +  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
> +  %exitcond = icmp ne i32 %i.0, 100
> +  br i1 %exitcond, label %for.body, label %for.end
> +
> +for.body:                                         ; preds = %for.cond
> +  %conv = sitofp i32 %i.0 to float
> +  %pi = fptrunc double 3.41 to float
> +  %mul = fmul fast float %conv, %pi
> +  %tmp = load float* %sum, align 4
> +  %add = fadd fast float %tmp, %mul
> +  store float %add, float* %sum, align 4
> +  br label %for.inc
> +
> +for.inc:                                          ; preds = %for.body
> +  %inc = add nsw i32 %i.0, 1
> +  br label %for.cond
> +
> +for.end:                                          ; preds = %for.cond
> +  ret void
> +}
> diff --git a/test/ScopInfo/reduction_simple_w_constant.ll b/test/ScopInfo/reduction_simple_w_constant.ll
> new file mode 100644
> index 0000000..1d0a5c1
> --- /dev/null
> +++ b/test/ScopInfo/reduction_simple_w_constant.ll
> @@ -0,0 +1,41 @@
> +; RUN: opt %loadPolly -polly-scops -analyze < %s | FileCheck %s --check-prefix=DETECTION
> +;
> +; DETECTION: Reduction like: 1
> +;
> +; int f(int sum) {
> +;   int i;
> +;   for (int i = 0; i <= 100; i++) {
> +;     sum += 3;
> +;   }
> +;   return sum;
> +; }
You could simplify this according to reduction_simple_fp.ll

> +;
> +; ModuleID = 'test/Reductions/simplest_binary_reduction.ll'
Remove this line.

> +target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-n32-S64"
> +
> +; Function Attrs: nounwind
Remove this line.

> +define i32 @f(i32 %sum) {
> +entry:
> +  %sum.reg2mem = alloca i32
> +  br label %entry.split1
> +
> +entry.split1:                                     ; preds = %entry
> +  br label %entry.split
> +
> +entry.split:                                      ; preds = %entry.split1
> +  store i32 %sum, i32* %sum.reg2mem
> +  br label %for.cond
> +
> +for.cond:                                         ; preds = %for.cond, %entry.split
> +  %i1.0 = phi i32 [ 0, %entry.split ], [ %inc, %for.cond ]
> +  %sum.reload = load i32* %sum.reg2mem
> +  %add = add nsw i32 %sum.reload, 3
> +  %inc = add nsw i32 %i1.0, 1
> +  store i32 %add, i32* %sum.reg2mem
> +  %cmp = icmp slt i32 %i1.0, 100
> +  br i1 %cmp, label %for.cond, label %for.end
> +
> +for.end:                                          ; preds = %for.cond
> +  %sum.reload.2 = load i32* %sum.reg2mem
> +  ret i32 %sum.reload.2
> +}
> diff --git a/test/ScopInfo/reduction_simple_w_iv.ll b/test/ScopInfo/reduction_simple_w_iv.ll
> new file mode 100644
> index 0000000..de5a8e1
> --- /dev/null
> +++ b/test/ScopInfo/reduction_simple_w_iv.ll
> @@ -0,0 +1,42 @@
> +; RUN: opt %loadPolly -polly-scops -analyze < %s | FileCheck %s --check-prefix=DETECTION
> +;
> +; DETECTION: Reduction like: 1
> +;
> +; int f(int sum) {
> +;   int i;
> +;   for (int i = 0; i <= 100; i++) {
> +;     sum += i * 3;
> +;   }
> +;   return sum;
> +; }
You could simplify this according to reduction_simple_fp.ll

> +;
> +; ModuleID = 'test/Reductions/simplest_binary_reduction.ll'
Remove this line.

> +target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-n32-S64"
> +
> +; Function Attrs: nounwind
Remove this line.

> +define i32 @f(i32 %sum) {
> +entry:
> +  %sum.reg2mem = alloca i32
> +  br label %entry.split1
> +
> +entry.split1:                                     ; preds = %entry
> +  br label %entry.split
> +
> +entry.split:                                      ; preds = %entry.split1
> +  store i32 %sum, i32* %sum.reg2mem
> +  br label %for.cond
> +
> +for.cond:                                         ; preds = %for.cond, %entry.split
> +  %i1.0 = phi i32 [ 0, %entry.split ], [ %inc, %for.cond ]
> +  %sum.reload = load i32* %sum.reg2mem
> +  %mul = mul nsw i32 %i1.0, 3
> +  %add = add nsw i32 %sum.reload, %mul
> +  %inc = add nsw i32 %i1.0, 1
> +  store i32 %add, i32* %sum.reg2mem
> +  %cmp = icmp slt i32 %i1.0, 100
> +  br i1 %cmp, label %for.cond, label %for.end
> +
> +for.end:                                          ; preds = %for.cond
> +  %sum.reload.2 = load i32* %sum.reg2mem
> +  ret i32 %sum.reload.2
> +}
> -- 1.8.2.1

-------------- next part --------------
; RUN: opt %loadPolly -polly-scops -analyze < %s | FileCheck %s --check-prefix=DETECTION
;
; DETECTION: Reduction like: 1
;
; void f(int *sum) {
;   int i;
;   for (i = 0; i < 100; i++)
;     sum[i+1] = sum[i] + i;
; }
;
target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-n32-S64"

; Function Attrs: nounwind
define void @f(i32* %sum) {
entry:
  br label %for.cond

for.cond:                                         ; preds = %for.inc, %entry
  %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
  %exitcond = icmp ne i32 %i.0, 100
  br i1 %exitcond, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %offset = add nsw i32 1, %i.0
  %arrayidx = getelementptr inbounds i32* %sum, i32 %i.0
  %tmp = load i32* %arrayidx, align 4
  %add = add nsw i32 %tmp, 1
  %arrayidx1 = getelementptr inbounds i32* %sum, i32 %offset
  store i32 %add, i32* %arrayidx1, align 4
  br label %for.inc

for.inc:                                          ; preds = %for.body
  %inc = add nsw i32 %i.0, 1
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret void
}