[PATCH] D114336: [Polly] Generalize the pattern matching to the case of tensor contractions.

Fri Jul 29 23:56:52 PDT 2022

gareevroman marked 4 inline comments as done.
gareevroman added inline comments.

================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1219
+  for (MemoryAccess *MemA : reverse(Accesses)) {
+    if (!MemA->isLatestArrayKind())
+      continue;
----------------
Meinersbur wrote:
> gareevroman wrote:
> > Meinersbur wrote:
> > > I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored.
> > > 
> > > I suggest to have a single function that calls `getAccessesInOrder` and sort out which MemAccess is read/write in there, then analyze them.
> > I am not sure whether modifications of implementations of tensor contractions, which contain read and write scalar memory accesses, are useful in practice. 
> > 
> > Moreover, since bundles of induction variables I, J, P can contain an unlimited number of dimensions, we possibly cannot follow the algorithm from the containsOnlyMatrMultAcc function, which permutes dimensions and checks that additional memory accesses have stride 0 in terms of dimensions MMM.i, MMM.j, and MMM.k. Consequently, such memory accesses can be treated as scalar memory accesses. I have not come up with an effective alternative yet.
> > 
> > That is why I do not consider scalar memory accesses in getWriteAccess and setReadAccesses functions. Could we mark it as TODO and do it future?
> The concern is that I can modify what `isLatestArrayKind()` returns by simply importing a JScop. The `continue` just ignores such weirdness but I think it is safer to fail in this case.
> 
> You yourself mention that scalar accesses are likely not useful, so why not fail when one is found instead (`return null` instead of `continue`)? Some exceptions may be possible, such as read-only scalars (`VirtualUse::ReadOnly`, `VirtualUse::Synthesizable`)
I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong. We need scalar read memory accesses. For example, in the case of the following matrix-matrix multiplication, a SCoP statement, which represents the body of the loop, contains the constant alpha.

C = alpha*A*B

Could we accept non-partial scalar read memory accesses? I think this is legal.

================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1465-1470
+    // In the case of the tensor contraction, the difference between image
+    // elements and domain elements lies on a hyperplane where a dimension
+    // has the fixed value one.
+    isl::set Intersection = DepDeltas.fix_si(isl::dim::set, i, 1);
+    if (Intersection.is_empty())
+      continue;
----------------
Meinersbur wrote:
> gareevroman wrote:
> > Meinersbur wrote:
> > > This is going to check whether each element out of `Intersection` is a contraction over dimension `i`. Don't we also need to check that every iteration out of the band `i` is contributing to that contraction?
> > Could you clarify what do you mean by the band i? Are these indexes ki, which describe the dependencies?
> > 
> > isTcDep checks that the dependency has the form
> > 
> > /// S(..., ki, max(k(i + 1)), ..., max(kn), ...) -> S(..., ki + 1, min(k(i + 1)), ..., min(kn), …)
> There is a check `Intersection.is_empty()` which is going to detect if a dependency is completely missing. But what detects that only some of the dependencies are present. Such as:
> ```
> [p] -> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and p != 0 }
> ```
> or 
> ```
> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and i0 < 42 }
> ```
> (assuming contracting over `i=0`)
> 
> `isReductionCarriedOverDim` doesn't seem to check whether the dependency is over the complete domain either.
Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function, we check that memory accesses aren't partial. Isn't it sufficient?

I've tried to check whether the dependency is over the complete domain though.

================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1613
+  // The partial schedule should contain only one statement.
+  if (isl_union_set_n_set(Domain.get()) != 1)
+    return false;
----------------
Meinersbur wrote:
> gareevroman wrote:
> > Meinersbur wrote:
> > > This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.
> > Could we add a TODO comment for this?
> Yes, that would be great.
Ok. I've left that TODO comment.

================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1619
+
+  // Check that all predecessors of the node contain all band nodes
+  // for the statement, which represents the tensor contraction.
----------------
Meinersbur wrote:
> 
Looks like I missed that. Sorry. I will fix it in the next patch.

================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1622
+  // Subsequently, such band nodes will be replaced by a single band node.
+  while (NodeType != isl_schedule_node_domain) {
+    if (HasFilterNode && (NodeType == isl_schedule_node_band))
----------------
Meinersbur wrote:
> gareevroman wrote:
> > Meinersbur wrote:
> > > This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just `return false` in this case.
> > If I am not mistaken, this only checks that all band nodes, which represent the statement, are not split by filter nodes. These accepts a straightforward implementation of TC with/without delicm. For example,
> > 
> >  domain: "{ Stmt_for_body8[i0, i1, i2]  : 0 <= i0 <= 1599 and
> >                                           0 <= i1 <= 1799 and
> >                                           0 <= i2 <= 2199;
> >             Stmt_for_body3[i0, i1] :      0 <= i0 <= 1599 and
> >                                           0 <= i1 <= 1799;
> >             Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and
> >                                           0 <= i1 <= 1799 }"
> >  child:
> >   sequence:
> >   - filter: "{ Stmt_for_body3[i0, i1] }"
> >     child:
> >       schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]"
> >       permutable: 1
> >       coincident: [ 1, 1 ]
> >   - filter: "{ Stmt_for_body3_last[i0, i1] }"
> >     child:
> >       schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"
> >       permutable: 1
> >       coincident: [ 1, 1 ]
> >   - filter: "{ Stmt_for_body8[i0, i1, i2] }"
> >     child:
> >       schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },
> >                   { Stmt_for_body8[i0, i1, i2] -> [(i1)] },
> >                   { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"
> >       permutable: 1
> >       coincident: [ 1, 1, 0 ]
> > 
> > domain: "{ Stmt2[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }"
> >  child:
> >   schedule: "[{ Stmt2[i0, i1, i2] -> [(i0)] }, { Stmt2[i0, i1, i2] -> [(i1)] }, { Stmt2[i0, i1, i2] -> [(i2)] }]"
> >   permutable: 1
> >   coincident: [ 1, 1, 0 ]
> > 
> > Sorry, I have not committed an updated version of the optimization of TC to my github repo. However, I believe that, if this is that case, we can safely replace all such nodes.
> > 
> > +  auto NodeType = isl_schedule_node_get_type(Node.get());
> > +  while ((NodeType != isl_schedule_node_domain) &&
> > +         (NodeType != isl_schedule_node_filter)) {
> > +    assert((NodeType != isl_schedule_node_sequence) &&
> > +           L"Prevent the undefined behavior");
> > +    Node = Node.parent();
> > +    NodeType = isl_schedule_node_get_type(Node.get());
> > +  }
> > +  Node = Node.child(0);
> > +  Node = isl::manage(isl_schedule_node_cut(Node.release()));
> > +  return Node.insert_partial_schedule(Dimensions);
> > 
> > I think taht the detection of a more sophisticated implementations of TC is a possible goal of a future research.
> > 
> > I have described this in the comment.
> I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."
> 
> Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)
> 
> I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

I've added it it.

> Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)

Sequence nodes could be necessary, if DeLICM was applied. Please, see the example inside the isTCPattern. Yes, I think other types except for marker nodes should be rejected. Additionally, as a precaution, I propose to check that a filter node has only a sequence and a domain nodes as its predecessors.  I've updated the patch.

================
Comment at: polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll:1-3
+; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \
+; RUN: -polly-pattern-matching-based-opts=true \
+; RUN: -polly-tc-opt=true -debug < %s 2>&1 | FileCheck %s
----------------
Meinersbur wrote:
> gareevroman wrote:
> > Meinersbur wrote:
> > > Since this is not FileCheck-ing the LLVM-IR output, suppress it with `-disable-output`
> > Could we fix the existing test cases in a separate patch?
> > 
> > polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
> > 
> > ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> > ; REQUIRES: asserts
> > 
> > polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll
> > 
> > ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> > ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> > 
> > polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll
> > 
> > ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> > ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> > 
> > polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll
> > 
> > ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> > ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> > 
> > polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll
> > 
> > ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> > ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> > 
> > polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll
> > 
> > ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
> > ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
> > 
> 👍
Ok. I've added this to my TODO list.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114336/new/

https://reviews.llvm.org/D114336