[Polly] Reduction dependences [V3]

Wed Jun 18 06:10:09 PDT 2014

On 17/06/2014 23:17, Johannes Doerfert wrote:
> Small changes to the version before:
>
>    + I found an easier way of computing the privatization dependences.
>
>   + More accurate comment
>
>
>
> Since we have the first patch in, tell me what you think about this one J.

Sure, let's keep going.

> diff --git a/include/polly/Dependences.h b/include/polly/Dependences.h
> index 9a704a8..d44a905 100755
> --- a/include/polly/Dependences.h
> +++ b/include/polly/Dependences.h
> @@ -105,6 +105,9 @@ private:
>     isl_union_map *WAR;
>     isl_union_map *WAW;
>
> +  /// @brief The map of reduction dependences
> +  isl_union_map *RED = nullptr;
> +
>     /// @brief Collect information about the SCoP.
>     void collectInfo(Scop &S, isl_union_map **Read, isl_union_map **Write,
>                      isl_union_map **MayWrite, isl_union_map **Schedule);
> diff --git a/lib/Analysis/Dependences.cpp b/lib/Analysis/Dependences.cpp
> index dc64b7e..1624b81 100644
> --- a/lib/Analysis/Dependences.cpp
> +++ b/lib/Analysis/Dependences.cpp
> @@ -25,6 +25,7 @@
>   #include "polly/Options.h"
>   #include "polly/ScopInfo.h"
>   #include "polly/Support/GICHelper.h"
> +#include "polly/CodeGen/CodeGeneration.h"
>   #include "llvm/Support/Debug.h"
>
>   #include <isl/aff.h>
> @@ -65,6 +66,84 @@ static cl::opt<enum AnalysisType> OptAnalysisType(
>       cl::cat(PollyCategory));
>
>   //===----------------------------------------------------------------------===//
> +
> +/// @brief Helper struct during the compuation of privatization dependences
                                        computation
> +struct PrivPayload {

I would suggest to avoid appreviations, except for very commonly used
code with an obvious appreviation.

> +  /// @brief All privatization dependences computed so far
Missing point at end of sentence

> +  isl_union_map *Priv;
> +  /// @brief All accesses which are part of actual reduction dependences
Missing point at end of sentence.

What about 'Accesses'?

> +};
> +
> +/// @brief Compute the privatization dependences for a given dependency @p Map
> +///
> +/// Privatization dependences are widened original dependences which originate
> +/// or end in a reduction access. To compute them we first look for those
> +/// original dependences. If found we widen them such that they do __not__ only
> +/// depend on a subset of the reduction memory accesses but on all of them.
> +/// For the input:
> +///
> +///  S0:   *sum = 0;
> +///        for (int i = 0; i < 1024; i++)
> +///  S1:     *sum += i;
> +///  S2:   *sum = *sum * 3;
> +///
> +/// we have the following dependences before we add privatization dependences:
> +///
> +///   RAW:
> +///     { S0[] -> S1[0]; S1[1023] -> S2[] }
> +///   WAR:
> +///     {  }
> +///   WAW:
> +///     { S0[] -> S1[0]; S1[1024] -> S2[] }
> +///   RED:
> +///     { S1[i0] -> S1[1 + i0] : i0 >= 0 and i0 <= 1022 }
> +///
> +/// and afterwards:
> +///
> +///   RAW:
> +///     { S0[] -> S1[i0] : i0 >= 0 and i0 <= 1023;
> +///       S1[i0] -> S2[] : i0 >= 0 and i0 <= 1023}
> +///   WAR:
> +///     {  }
> +///   WAW:
> +///     { S0[] -> S1[i0] : i0 >= 0 and i0 <= 1023;
> +///       S1[i0] -> S2[] : i0 >= 0 and i0 <= 1023}
> +///   RED:
> +///     { S1[i0] -> S1[1 + i0] : i0 >= 0 and i0 <= 1022 }

Very nice and clear comment!

This is for me currently the most important part to understand and to
get right.

I suggest to move this code here:

> +      isl_union_set *RedAccs = isl_union_map_domain(isl_union_map_copy(RED));
> +      RedAccs = isl_union_set_union(
> +          RedAccs, isl_union_map_range(isl_union_map_copy(RED)));
> +      RedAccs = isl_union_set_coalesce(RedAccs);
> +      isl_union_map **Maps[] = {&RAW, &WAW, &WAR};
> +      for (isl_union_map **MapPtr : Maps) {
> +        PrivPayload Pl = {isl_union_map_empty(isl_union_map_get_space(RAW)),
> +                          RedAccs};
> +        isl_union_map_foreach_map(*MapPtr, createPrivatizationDeps, &Pl);
> +        *MapPtr = isl_union_map_coalesce(isl_union_map_union(*MapPtr, Pl.Priv));
> +      }
> +      isl_union_set_free(RedAccs);

into its own function right below your comment.

As this algorithm is an essential part, it would be good to describe the
implementation in another (smaller) comment.

> +static int createPrivatizationDeps(__isl_take isl_map *Map, void *user) {
> +  isl_union_set *PDom = nullptr, *PRan = nullptr;
> +  isl_union_map *UMap, *DInter, *RInter;
> +  PrivPayload *Pl = (PrivPayload *)user;
> +  UMap = isl_union_map_from_map(Map);
> +
> +  DInter = isl_union_map_intersect_domain(isl_union_map_copy(UMap),
> +                                          isl_union_set_copy(Pl->RedAccs));
> +  if (!isl_union_map_is_empty(DInter)) {
> +    PDom = isl_union_set_copy(Pl->RedAccs);
> +    PRan = isl_union_map_range(isl_union_map_copy(UMap));
> +  }
> +
> +  RInter = isl_union_map_intersect_range(isl_union_map_copy(UMap),
> +                                         isl_union_set_copy(Pl->RedAccs));
> +  if (!isl_union_map_is_empty(RInter)) {
> +    assert(!PDom && !PRan && "Create privatization locations twice");
> +    PDom = isl_union_map_domain(isl_union_map_copy(UMap));
> +    PRan = isl_union_set_copy(Pl->RedAccs);
> +  }
> +
> +  if (PDom || PRan) {
> +    assert(PDom && PRan && "Inconsistent assignments");
> +    Pl->Priv = isl_union_map_union(
> +        Pl->Priv, isl_union_map_from_domain_and_range(PDom, PRan));
> +  }

It would be interesting to see the following test case:

S0:   *sum = 0;
       for (int i = 0; i < 1024; i++)
S1:     *sum += i;
S2:   *sum = *sum * 3;
       for (int i = 0; i < 1024; i++)
S4:     *sum += i;
S5:   *sum = *sum * 3;

>   Dependences::Dependences() : ScopPass(ID) { RAW = WAR = WAW = nullptr; }
>
>   void Dependences::collectInfo(Scop &S, isl_union_map **Read,
> @@ -169,6 +248,62 @@ void Dependences::calculateDependences(Scop &S) {
>     isl_ctx_reset_operations(S.getIslCtx());
>     isl_ctx_set_max_operations(S.getIslCtx(), MaxOpsOld);
>
> +  // FIXME: Code generation cannot produce privatization code yet, thus we need
> +  //        to check if we will produce non-parallel code anyway. If so, we do
> +  //        not need privatization, if not we turn off reduction support.
> +  if (PollyVectorizerChoice == VECTORIZER_NONE && !EnableOpenMP) {

That was not exactly what I imagined. Instead of checking if OpenMP code
generation is enabled, I was hoping that we store the dependence in a
way that passes like the scheduler can get the necessary dependences for
the sequential transformations and that we make sure that for 
parallelism checks
we use the additional reduction dependences.

Even with openmp code generation enabled, we should allow the scheduler
the freedom of reordering instructions. We only don't want to
incorrectly introduce parallel loops.

(Let's leave this out for now, the main point is the privatization
dependences computation.)

> +    // To handle reduction dependences follow we:

Grammar?

> +    // 1) Aggregate all possible reduction dependences, namely all self
> +    //    dependences on reduction like statements.
> +    // 2) Intersect them with the actual RAW & WAW dependences to the get the
> +    //    actual reduction dependences. This will ensure the load/store memory
> +    //    addresses were __identical__ in the two iterations of the statement.
> +    // 3) Relax the original RAW and WAW dependences by substracting the actual
> +    //    reduction dependences. Binary reductions cause both (and the same) RAW
> +    //    and WAW dependences, but unary reductions only WAW.

What are binary and what are unary reductions. Is this already explained
somewhere? Are there certain test cases that I can look at to see these
differences?

> +    // 4) Add the privatization dependences which are widened versions of
> +    //    already present dependences. They model the effect of manual
> +    //    privatization at the outermost possible place (namely after the last
> +    //    write and before the first access to a reduction location).
> +
> +    // Step 1)
> +    RED = isl_union_map_empty(isl_union_map_get_space(RAW));
> +    for (auto *Stmt : S) {
> +      if (!Stmt->isReductionLike())
> +        continue;
> +      isl_set *Domain = Stmt->getDomain();
> +      isl_map *Identity =
> +          isl_map_from_domain_and_range(isl_set_copy(Domain), Domain);
> +      RED = isl_union_map_add_map(RED, Identity);
> +    }
> +
> +    // Step 2)
> +    RED = isl_union_map_intersect(RED, isl_union_map_copy(RAW));
> +    RED = isl_union_map_intersect(RED, isl_union_map_copy(WAW));
> +
> +    if (!isl_union_map_is_empty(RED)) {
> +
> +      // Step 3)
> +      RAW = isl_union_map_subtract(RAW, isl_union_map_copy(RED));
> +      WAW = isl_union_map_subtract(WAW, isl_union_map_copy(RED));
> +
> +      // Step 4)
> +      isl_union_set *RedAccs = isl_union_map_domain(isl_union_map_copy(RED));

To my understanding 'RED' contains a mapping from statement iteration to
statement iteration. Why is the domain now an 'access'?

> +      RedAccs = isl_union_set_union(
> +          RedAccs, isl_union_map_range(isl_union_map_copy(RED)));
> +      RedAccs = isl_union_set_coalesce(RedAccs);

This is the set of statement iterations that appear in reduction
accesses, right?

> +
> +      isl_union_map **Maps[] = {&RAW, &WAW, &WAR};
> +      for (isl_union_map **MapPtr : Maps) {
> +        PrivPayload Pl = {isl_union_map_empty(isl_union_map_get_space(RAW)),
> +                          RedAccs};
> +        isl_union_map_foreach_map(*MapPtr, createPrivatizationDeps, &Pl);

In your comment, it would be interesting to understand why creating 
privatization
dependences per map is a good/valid approach.

> diff --git a/test/Dependences/reduction_only_reduction_like_access.ll b/test/Dependences/reduction_only_reduction_like_access.ll
> new file mode 100644
> index 0000000..e8a0ef6
> --- /dev/null
> +++ b/test/Dependences/reduction_only_reduction_like_access.ll
> @@ -0,0 +1,36 @@
> +; RUN: opt %loadPolly -polly-dependences -analyze < %s | FileCheck %s
> +;
> +; CHECK:  Reduction dependences:
> +; CHECK:    {  }
> +;
> +; void f(int *sum) {
> +;   for (int i = 0; i < 100; i++)
> +;     sum[i] = sum[99-i] + i;

Can you add a comment why this test cases is interesting/what it is
testing. This is not immediately clear to me.

> diff --git a/test/Dependences/reduction_only_reduction_like_access_2.ll b/test/Dependences/reduction_only_reduction_like_access_2.ll
> new file mode 100644
> index 0000000..3bdd896
> --- /dev/null
> +++ b/test/Dependences/reduction_only_reduction_like_access_2.ll
> @@ -0,0 +1,52 @@
> +; RUN: opt %loadPolly -polly-dependences -analyze < %s | FileCheck %s
> +;
> +; CHECK:  Reduction dependences:
> +; CHECK:    {  }
> +;
> +; void f(int *sum) {
> +;   for (int i = 0; i < 1024; i++)
> +;     for (int j = 0; j < 512; j++)
> +;       sum[i + j] = sum[i] + 3;

Similar here. Can you explain what exactly is tested here. Is it
different from the previous test case?

(Btw, the test cases are already very small/nicely reduced. This is
nice!)

> diff --git a/test/Dependences/reduction_simple_iv.ll b/test/Dependences/reduction_simple_iv.ll
> new file mode 100644
> index 0000000..01b0a51
> --- /dev/null
> +++ b/test/Dependences/reduction_simple_iv.ll
> @@ -0,0 +1,41 @@
> +; RUN: opt %loadPolly -polly-dependences -analyze < %s | FileCheck %s
> +;
> +; CHECK:  RAW dependences:
> +; CHECK:    {  }
> +; CHECK:  WAR dependences:
> +; CHECK:    {  }
> +; CHECK:  WAW dependences:
> +; CHECK:    {  }
> +; CHECK:  Reduction dependences:
> +; CHECK:    { Stmt_for_cond[i0] -> Stmt_for_cond[1 + i0] : i0 >= 0 and i0 <= 99 }
> +;
> +;
> +; void f(int* sum) {
> +;   for (int i = 0; i <= 100; i++)
> +;     sum += i * 3;
> +; }

This is a nice and minimal test case. I like it.

To make these test cases easier to understand, it could help if you add
labels to the examples:

void f(int* sum) {
       for (int i = 0; i <= 100; i++)
S1:     sum += i * 3;
}

And then name the basic blocks according to the labels. The resulting
dependences should then contain the names of the basic blocks, so
looking at the C code and the dependences makes sense.

> +++ b/test/Dependences/reduction_simple_privatization_deps.ll
> @@ -0,0 +1,58 @@
> +; RUN: opt %loadPolly -polly-dependences -analyze < %s | FileCheck %s
> +;
> +; CHECK: Printing analysis 'Polly - Calculate dependences' for region: 'entry.1 => f.end' in function 'f':
> +; CHECK:   RAW dependences:
> +; CHECK:     [N] -> { Stmt_entry_2[] -> Stmt_for_body[o0] : N >= 11 and o0 <= 1023 and o0 >= 0; Stmt_for_body[i0] -> Stmt_for_end[] : N >= 11 and i0 <= 1023 and i0 >= 0 }
> +; CHECK:   WAR dependences:
> +; CHECK:     [N] -> {  }
> +; CHECK:   WAW dependences:
> +; CHECK:     [N] -> { Stmt_entry_2[] -> Stmt_for_body[o0] : N >= 11 and o0 <= 1023 and o0 >= 0; Stmt_for_body[i0] -> Stmt_for_end[] : N >= 11 and i0 <= 1023 and i0 >= 0 }
> +; CHECK:   Reduction dependences:
> +; CHECK:     [N] -> { Stmt_for_body[i0] -> Stmt_for_body[1 + i0] : i0 >= 0 and i0 <= 1022 and N >= 11 }
> +;
> +; void f(int *sum, int N) {
> +;   if (N >= 10) {

Why is there an N and an if? Is this necessary for this test case?

> +;     *sum = 0;
> +;     for (int i = 0; i < 1024; i++)
> +;       *sum += i;
> +;     *sum = *sum * 3;
> +;   }
> +; }

Otherwise this is a very nice, minimal test case.

> +++ b/test/Dependences/reduction_simple_privatization_deps_2.ll
> @@ -0,0 +1,67 @@
> +; RUN: opt %loadPolly -polly-dependences -analyze < %s | FileCheck %s
> +;
> +; CHECK: Printing analysis 'Polly - Calculate dependences' for region: 'for.cond => for.end8' in function 'f':
> +; CHECK:  RAW dependences:
> +; CHECK:    { Stmt_for_body3[i0, i1] -> Stmt_for_end[o0] : i0 <= 99 and i0 >= 0 and i1 <= 99 and i1 >= 0 and o0 <= 99 and o0 >= 0; Stmt_for_body[i0] -> Stmt_for_body3[o0, o1] : i0 <= 99 and i0 >= 0 and o0 <= 99 and o0 >= 0 and o1 <= 99 and o1 >= 0; Stmt_for_end[i0] -> Stmt_for_body[1 + i0] : i0 >= 0 and i0 <= 98 }
> +; CHECK:  WAR dependences:
> +; CHECK:    {  }
> +; CHECK:  WAW dependences:
> +; CHECK:    { Stmt_for_body3[i0, i1] -> Stmt_for_end[o0] : i0 <= 99 and i0 >= 0 and i1 <= 99 and i1 >= 0 and o0 <= 99 and o0 >= 0; Stmt_for_body[i0] -> Stmt_for_body3[o0, o1] : i0 <= 99 and i0 >= 0 and o0 <= 99 and o0 >= 0 and o1 <= 99 and o1 >= 0; Stmt_for_end[i0] -> Stmt_for_body[1 + i0] : i0 >= 0 and i0 <= 98 }
> +; CHECK:  Reduction dependences:
> +; CHECK:    { Stmt_for_body3[i0, i1] -> Stmt_for_body3[i0, 1 + i1] : i0 >= 0 and i0 <= 99 and i1 >= 0 and i1 <= 98 }
> +;
> +; void f(int *sum) {
> +;   int i, j;
> +;   for (i = 0; i < 100; i++) {
> +;     *sum *= 42;
> +;     for (j = 0; j < 100; j++)
> +;       *sum += i * j;
> +;     *sum *= 7;
> +;   }
> +; }

And this is a very nice extension to multiple loops.

Regarding the privatization dependences I am slighly confused. You
currently have as dependences:

Stmt_for_body[i0] -> Stmt_for_body3[o0, o1] : i0 <= 99 and i0 >= 0 and 
o0 <= 99 and o0 >= 0 and o1 <= 99 and o1 >= 0;
Stmt_for_body3[i0, i1] -> Stmt_for_end[o0] : i0 <= 99 and i0 >= 0 and i1 
<= 99 and i1 >= 0 and o0 <= 99 and o0 >= 0
Stmt_for_end[i0] -> Stmt_for_body[1 + i0] : i0 >= 0 and i0 <= 98

This includes the following dependences:

Stmt_for_body[9] -> Stmt_for_body3[10,0]
Stmt_for_body3[10, 0] -> Stmt_for_end[8];
Stmt_for_end[8] -> Stmt_for_body[9]

If I did not miss something, this is a dependency cycle for which there
is no valid schedule that can be found. Right?

I would have expected something like the following instead:

Stmt_for_body3[i0, i1] -> Stmt_for_end[i0] : i0 <= 99 and i0 >= 0 and i1 
<= 99 and i1 >= 0

In case you believe a modification of your algorithm and you are unsure
where to look it may make sense to discuss the core algorithm on
isl-dev. One direction to look is isl's transitive closure function.
I tried the following with islpy:

$ m = islpy.Map("{ Stmt_for_body3[i0, i1] -> Stmt_for_body3[i0, 1 + i1] 
: i0 >= 0 and i0 <= 99 and i1 >= 0 and i1 <= 98 }")
$ m.transitive_closure()
(Map("{ Stmt_for_body3[i0, i1] -> Stmt_for_body3[i0, o1] : i0 >= 0 and 
i0 <= 99 and i1 >= 0 and i1 <= 98 and o1 >= 1 + i1 and o1 <= 99 and o1 
 >= 1 }"),
  1)
$ m.transitive_closure()[0].reverse().lexmin().reverse()
Map("{ Stmt_for_body3[i0, 0] -> Stmt_for_body3[i0, o1] : i0 <= 99 and i0 
 >= 0 and o1 <= 99 and o1 >= 1 }")

Maybe this is helpful.

Cheers,
Tobias