[PATCH] D114336: [Polly] Generalize the pattern matching to the case of tensor contractions.

Michael Kruse via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Wed Jun 29 21:01:56 PDT 2022


Meinersbur added a comment.

In D114336#3576199 <https://reviews.llvm.org/D114336#3576199>, @gareevroman wrote:

>   void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
>   for (int i = 0; i < 1024; i++)
>       for (int j = 0; j < 1024; j++)
>         for (int l = 0; l < 64; l++)
>           for (int w = 0; w < 64; ++w)
>              if (w != 0)
>                C[i][j] += A[i][l][w] * B[w][j][l];
>   }
>
> ScopBuilder generates the following memory accesses:
>
>   { Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
>   { Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
>   { Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
>   { Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }
>
> In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.
>
>> Where does it come from?
>
> If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

>> Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:
>
> I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable.  paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.

However, I don't request  such a change atm.

>>   for (...) {
>>     ...
>>     for (...) {
>>       if (c)
>>         auto acc = C[P]; // ReadFromC
>>         auto a = A[Pa, I];
>>         auto b = B[Pb,J];
>>         auto arg = f(a,b);
>>         acc = acc op arg;
>>         C[P] = acc; // WriteToC
>>       }
>>     }
>>   }
>>   
>>   where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.



> I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

It is also 'redundant' with the code itself, but it helps understanding it.

The description of `I` and `P` are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.

The "definition" of `isTCPattern` is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

>> `op=` is a commutative operation ....,
>
> I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

This is exactly the sort of thin I would want to clarify. Associativity could be sufficient. We do check for associativity, right?

>> c is an affine condition usually just `true`.
>
> Could you elaborate on that? I’m not sure that I understand where such a condition is used.

It corresponds to a filter node in the TC body. You have used the `if (w != 0)` to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

>> `f` is a side-effect free operation.
>
> If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.

Side-effect here means something different than ScopDetection. A write to an unrelated array `D` would be a side-effect for the TC, but accurately represented by a Scop.
We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with `-polly-allow-modref-calls`. These could just not be reordered relative to each other.

I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 <https://reviews.llvm.org/rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9> and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36 <https://reviews.llvm.org/rGa56bd7dec8da4348d847d53c96d8a30f4a821d36>.



================
Comment at: polly/lib/Support/ISLTools.cpp:266
 
+isl::val polly::getConstant(isl::map Map, isl::dim Dim, int Pos) {
+  unsigned NumDims = unsignedFromIslSize(Map.dim(Dim));
----------------
nice


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1219
+  for (MemoryAccess *MemA : reverse(Accesses)) {
+    if (!MemA->isLatestArrayKind())
+      continue;
----------------
gareevroman wrote:
> Meinersbur wrote:
> > I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored.
> > 
> > I suggest to have a single function that calls `getAccessesInOrder` and sort out which MemAccess is read/write in there, then analyze them.
> I am not sure whether modifications of implementations of tensor contractions, which contain read and write scalar memory accesses, are useful in practice. 
> 
> Moreover, since bundles of induction variables I, J, P can contain an unlimited number of dimensions, we possibly cannot follow the algorithm from the containsOnlyMatrMultAcc function, which permutes dimensions and checks that additional memory accesses have stride 0 in terms of dimensions MMM.i, MMM.j, and MMM.k. Consequently, such memory accesses can be treated as scalar memory accesses. I have not come up with an effective alternative yet.
> 
> That is why I do not consider scalar memory accesses in getWriteAccess and setReadAccesses functions. Could we mark it as TODO and do it future?
The concern is that I can modify what `isLatestArrayKind()` returns by simply importing a JScop. The `continue` just ignores such weirdness but I think it is safer to fail in this case.

You yourself mention that scalar accesses are likely not useful, so why not fail when one is found instead (`return null` instead of `continue`)? Some exceptions may be possible, such as read-only scalars (`VirtualUse::ReadOnly`, `VirtualUse::Synthesizable`)


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1465-1470
+    // In the case of the tensor contraction, the difference between image
+    // elements and domain elements lies on a hyperplane where a dimension
+    // has the fixed value one.
+    isl::set Intersection = DepDeltas.fix_si(isl::dim::set, i, 1);
+    if (Intersection.is_empty())
+      continue;
----------------
gareevroman wrote:
> Meinersbur wrote:
> > This is going to check whether each element out of `Intersection` is a contraction over dimension `i`. Don't we also need to check that every iteration out of the band `i` is contributing to that contraction?
> Could you clarify what do you mean by the band i? Are these indexes ki, which describe the dependencies?
> 
> isTcDep checks that the dependency has the form
> 
> /// S(..., ki, max(k(i + 1)), ..., max(kn), ...) -> S(..., ki + 1, min(k(i + 1)), ..., min(kn), …)
There is a check `Intersection.is_empty()` which is going to detect if a dependency is completely missing. But what detects that only some of the dependencies are present. Such as:
```
[p] -> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and p != 0 }
```
or 
```
{ Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and i0 < 42 }
```
(assuming contracting over `i=0`)

`isReductionCarriedOverDim` doesn't seem to check whether the dependency is over the complete domain either.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1613
+  // The partial schedule should contain only one statement.
+  if (isl_union_set_n_set(Domain.get()) != 1)
+    return false;
----------------
gareevroman wrote:
> Meinersbur wrote:
> > This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.
> Could we add a TODO comment for this?
Yes, that would be great.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1619
+
+  // Check that all predecessors of the node contain all band nodes
+  // for the statement, which represents the tensor contraction.
----------------



================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1622
+  // Subsequently, such band nodes will be replaced by a single band node.
+  while (NodeType != isl_schedule_node_domain) {
+    if (HasFilterNode && (NodeType == isl_schedule_node_band))
----------------
gareevroman wrote:
> Meinersbur wrote:
> > This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just `return false` in this case.
> If I am not mistaken, this only checks that all band nodes, which represent the statement, are not split by filter nodes. These accepts a straightforward implementation of TC with/without delicm. For example,
> 
>  domain: "{ Stmt_for_body8[i0, i1, i2]  : 0 <= i0 <= 1599 and
>                                           0 <= i1 <= 1799 and
>                                           0 <= i2 <= 2199;
>             Stmt_for_body3[i0, i1] :      0 <= i0 <= 1599 and
>                                           0 <= i1 <= 1799;
>             Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and
>                                           0 <= i1 <= 1799 }"
>  child:
>   sequence:
>   - filter: "{ Stmt_for_body3[i0, i1] }"
>     child:
>       schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]"
>       permutable: 1
>       coincident: [ 1, 1 ]
>   - filter: "{ Stmt_for_body3_last[i0, i1] }"
>     child:
>       schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"
>       permutable: 1
>       coincident: [ 1, 1 ]
>   - filter: "{ Stmt_for_body8[i0, i1, i2] }"
>     child:
>       schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },
>                   { Stmt_for_body8[i0, i1, i2] -> [(i1)] },
>                   { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"
>       permutable: 1
>       coincident: [ 1, 1, 0 ]
> 
> domain: "{ Stmt2[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }"
>  child:
>   schedule: "[{ Stmt2[i0, i1, i2] -> [(i0)] }, { Stmt2[i0, i1, i2] -> [(i1)] }, { Stmt2[i0, i1, i2] -> [(i2)] }]"
>   permutable: 1
>   coincident: [ 1, 1, 0 ]
> 
> Sorry, I have not committed an updated version of the optimization of TC to my github repo. However, I believe that, if this is that case, we can safely replace all such nodes.
> 
> +  auto NodeType = isl_schedule_node_get_type(Node.get());
> +  while ((NodeType != isl_schedule_node_domain) &&
> +         (NodeType != isl_schedule_node_filter)) {
> +    assert((NodeType != isl_schedule_node_sequence) &&
> +           L"Prevent the undefined behavior");
> +    Node = Node.parent();
> +    NodeType = isl_schedule_node_get_type(Node.get());
> +  }
> +  Node = Node.child(0);
> +  Node = isl::manage(isl_schedule_node_cut(Node.release()));
> +  return Node.insert_partial_schedule(Dimensions);
> 
> I think taht the detection of a more sophisticated implementations of TC is a possible goal of a future research.
> 
> I have described this in the comment.
I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)



================
Comment at: polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll:1-3
+; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \
+; RUN: -polly-pattern-matching-based-opts=true \
+; RUN: -polly-tc-opt=true -debug < %s 2>&1 | FileCheck %s
----------------
gareevroman wrote:
> Meinersbur wrote:
> > Since this is not FileCheck-ing the LLVM-IR output, suppress it with `-disable-output`
> Could we fix the existing test cases in a separate patch?
> 
> polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
> 
> ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> ; REQUIRES: asserts
> 
> polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll
> 
> ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> 
> polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll
> 
> ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> 
> polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll
> 
> ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> 
> polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll
> 
> ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> 
> polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll
> 
> ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> 
👍


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114336/new/

https://reviews.llvm.org/D114336



More information about the llvm-commits mailing list