[PATCH] D114336: [Polly] Generalize the pattern matching to the case of tensor contractions.

Michael Kruse via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Mar 28 18:37:05 PDT 2022


Meinersbur added a comment.
Herald added a project: All.

I am extremely sorry for the late review, I hope you could use the time to work on/polish the follow-up patch. Some things are a bit hard to understand independently. I promise to review in a more timely manner from now on, although I might always add that many comments.

Please add to the tests cases what they are supposed to be testing and maybe give them more meaningful filenames.

The following is successfully detected as tensor contraction. Is this intended?

  void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
       for (int i = 0; i < 1024; i++)
           for (int j = 0; j < 1024; j++)
             for (int l = 0; l < 64; ++l)
                if (l != 0)
                  for (int w = 0; w < 64; ++w)
                    C[i][j] += A[i][l][w] * B[w][j][l];
  }

It might be if the codegen part is able exclude the element 0. In contrast, this one is rejected:

  void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
       for (int i = 0; i < 1024; i++)
           for (int j = 0; j < 1024; j++)
             for (int l = 0; l < 64; l++)
               for (int w = 0; w < 64; ++w)
                  if (w != 0)
                    C[i][j] += A[i][l][w] * B[w][j][l];
  }

or this:

  void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
       for (int i = 0; i < 1024; i++)
           for (int j = 0; j < 1024; j++)
             for (int l = 0; l < 64; l+=2)
               for (int w = 0; w < 64; ++w)
                    C[i][j] += A[i][l][w] * B[w][j][l];
  }

I do get an assertion failure with this one:

  void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
       for (int i = 0; i < 32; i++)
           for (int j = 0; j < 32; j++)
             for (int l = 0; l < 32; l++)
               for (int w = 0; w < 32; ++w)
                    C[i][j] += A[i][l][w] * B[w][j][i+3];
  }

Here, `i` occurs as indices for A, B, and C and detected as TC. Is this supported?

  void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
      for (int i = 0; i < 32; i++)
      for (int j = 0; j < 32; j++)
      for (int l = 0; l < 32; l++)
      for (int w = 0; w < 32; w++)
          C[i][j] += A[i][l][w] * B[w][j][i];
  }

Is it possible to make it work with `-polly-reschedule=0` as well? AFAICS there is nothing that requires rescheduling and would allow us to test the detection independently. `isTCPattern` seems to already consider multiple bands; the hurdle at the moment seems to be that the bands are not marked as permutable.



================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:183-186
+  // Memory accesses that represent reading from tensors, which are operands of
+  // the tensor contraction.
+  MemoryAccess *A = nullptr;
+  MemoryAccess *B = nullptr;
----------------
Please use doxygen comments to describe class/struct members. `@{ @}` can be used to [[ https://stackoverflow.com/questions/5777982/doxygen-comment-multiple-variables-at-once  | group  ]] them.

Also add an empty line before a new comment begins.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1136
+  // words, it is a Partial write and Partial writes must be rejected.
+  return AccMap.is_equal(PossibleTensor);
+}
----------------
I like the idea of verifying the correctness by reconstructing and comparing to the original.

Maybe do it at the end to verify that the entire `TCInfoTy` is correct? On the other size, earlier fail would be better. What do you think?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1161-1164
+  isl::map CheckMap = isl::manage(AccMap.copy());
+  unsigned OutDimNum = unsignedFromIslSize(CheckMap.dim(isl::dim::out));
+  for (unsigned i = 0; i < OutDimNum; i++)
+    CheckMap = CheckMap.fix_si(isl::dim::out, i, i);
----------------
This is a weird way to find out which indices map to what other index, I guess the equivalent of `isMatMulOperandAcc`. It requires that the dimension number if part of the AccMap's range, and if there is any expression will fail (eg `Stmt[i][j] -> A[i-1][j+1]` matches the wrong dimensions), but at least there is the verification afterwards.

I am not sure I like this sort of cleverness; I'd rather expect some sort of introspection into the map's coefficients, but I also think this should work in nearly all relevant cases and should be save due to the verification, so lets keep it.

However, please document this better, eg. add an example on what is expected to happen.



 


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1171-1172
+  for (unsigned i : rangeIslSize(0, CheckMap.dim(isl::dim::in))) {
+    isl::val Val = isl::manage(
+        isl_map_plain_get_val_if_fixed(CheckMap.get(), isl_dim_in, i));
+    if (!Val.is_int())
----------------
The "plain" function are unfortunately not very robust, eg its result is different depending on the internal representation. I'd suggestion `getConstant` (from ISLTools) but only takes an pw_aff.

Could you extract uses of `plain_get_val_if_fixed` into such a function, and mark it as TODO to cope with it later?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1219
+  for (MemoryAccess *MemA : reverse(Accesses)) {
+    if (!MemA->isLatestArrayKind())
+      continue;
----------------
I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored.

I suggest to have a single function that calls `getAccessesInOrder` and sort out which MemAccess is read/write in there, then analyze them.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1250-1252
+    // Probably IndexSet is a union of I and P sets.
+    if (intersect(IndexSet, TCI.P).size() != TCI.P.size())
+      return false;
----------------
Introduce a `is_superset` (etc) call?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1270-1272
+    if (!(intersect(IndexSet, TCI.J).size() == TCI.J.size()) &&
+        (intersect(IndexSet, TCI.P).size() == TCI.P.size()) &&
+        (IndexSet.size() == TCI.P.size() + TCI.J.size()))
----------------
Could we add utility functions such that this becomes `unite(J, P) == IndexSet`?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1374
+
+/// Check that dependency corresponds to the tensor contraction.
+///
----------------



================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1395
+///         and false, otherwise.
+static bool isTcDep(isl::set DepDelta, unsigned Pos, isl::set BoundDeltas,
+                    SmallDenseSet<int> *IndexSet) {
----------------
This seems to check setermine whether there is a reduction (contraction) carried by loop number `Pos`. The function name could be more meaningful. Suggestion: `isDepCarryingReductionOverDim` (not nice, but "TcDep" can mean anything)


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1396
+static bool isTcDep(isl::set DepDelta, unsigned Pos, isl::set BoundDeltas,
+                    SmallDenseSet<int> *IndexSet) {
+  // Check the difference between the image element and the domain element
----------------
Consider passing by const-reference.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1399
+  // in the case of the index ki.
+  if (!DepDelta.plain_get_val_if_fixed(isl::dim::set, Pos).is_one())
+    return false;
----------------
`plain_get_val_if_fixed` is not really robust as it depends on the internal representation that can be different after eg simplify.

Since this just checks to a specific value, the best would be to create a new set where all the fixed dimensions are that value (here: 1), and check whether `DepDelta` is a subset of it.



================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1410-1412
+  isl::map DepDeltaNegToBoundDeltas = isl::map::from_domain_and_range(
+      isl::manage(isl_set_neg(DepDelta.copy())), BoundDeltas);
+  isl::set Complement = DepDeltaNegToBoundDeltas.deltas();
----------------
Why not `BoundDeltas.subtract()` instead of deltas?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1451-1454
+  isl::union_map Dep = D->getDependences(Dependences::TYPE_RAW);
+  isl::union_map Red = D->getDependences(Dependences::TYPE_RED);
+  if (!Red.is_null())
+    Dep = Dep.unite(Red);
----------------
Should we also check whether WAW, RAW dependences are incompatible?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1460-1461
+  unsigned DeltasDimNum = unsignedFromIslSize(DepDeltas.dim(isl::dim::set));
+  isl::pw_multi_aff LowerBound = Domain.lexmin_pw_multi_aff();
+  isl::pw_multi_aff BoundSub = Domain.lexmax_pw_multi_aff().sub(LowerBound);
+  auto BoundDeltas = isl::manage(isl_set_from_pw_multi_aff(BoundSub.release()));
----------------
lexmin/lexmax can be expensive. Wrap into a `IslMaxOperationsGuard`?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1462
+  isl::pw_multi_aff BoundSub = Domain.lexmax_pw_multi_aff().sub(LowerBound);
+  auto BoundDeltas = isl::manage(isl_set_from_pw_multi_aff(BoundSub.release()));
+
----------------
You seem to assume an functional relationship from here on. If that's the case, you can keep the type a `pw_multi_aff` which supports more functions that you may have missed such as `pw_multi_aff::add`


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1464
+
+  for (int i = static_cast<int>(DeltasDimNum) - 1; i >= 0; i--) {
+    // In the case of the tensor contraction, the difference between image
----------------
Consider using `reverse(rangeIslSize(0,DeltasDimNum))` (from ISLTools.h).


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1465-1470
+    // In the case of the tensor contraction, the difference between image
+    // elements and domain elements lies on a hyperplane where a dimension
+    // has the fixed value one.
+    isl::set Intersection = DepDeltas.fix_si(isl::dim::set, i, 1);
+    if (Intersection.is_empty())
+      continue;
----------------
This is going to check whether each element out of `Intersection` is a contraction over dimension `i`. Don't we also need to check that every iteration out of the band `i` is contributing to that contraction?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1574
+
+  orderDimensions(TCI);
+
----------------
This is not part of the pattern detection, but the optimization. Could we move it to the patch that does the actual optimization?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1586-1587
+/// 2. there are exactly three input dimensions.
+/// 3. there are four memory accesses that represent accesses to tensor
+///    contraction operand.
+/// 4. all memory accesses of the statement except from the last one, are
----------------
Could you describe here what those 4 accesses are?


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1590-1591
+///    read memory accesses and the last one is a write memory access.
+/// 5. all subscripts of the last memory access of the statement do not
+///    contain the variable used in the inner loop.
+///
----------------
5. should not be necessary; any permutation of the surrounding loops can be valid. Eg, 

```
   for (w = 0; w < 64; ++w)
     for (l = 0; l < 64; ++l)
       for (i = 0; i < 1024; i++)
         for (j = 0; j < 1024; j++)
              C[i][j] += A[i][l][w] * B[w][j][l];
```
yields the same result.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1592
+///    contain the variable used in the inner loop.
+///
+/// If this is the case, we could use an approach that is similar to
----------------
Could you add a high-level description how the algorithm actually works? I.e. dependencies used to determine contraction dimensions, etc.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1602
+  Node = Node.child(0);
+  auto LeafType = isl_schedule_node_get_type(Node.get());
+  isl::union_map PartialSchedule = Node.get_prefix_schedule_union_map();
----------------
Prefer `Node.isa<isl::schedule_node_leaf>()` (and then typed subclass: `Node.as<isl_schedule_node_leaf>()`)


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1607-1610
+  // The innermost band node is expected.
+  if (LeafType != isl_schedule_node_leaf ||
+      isl_union_map_n_map(PartialSchedule.get()) != 1)
+    return false;
----------------
This condition is effectively identical to the next


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1613
+  // The partial schedule should contain only one statement.
+  if (isl_union_set_n_set(Domain.get()) != 1)
+    return false;
----------------
This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1616-1617
+
+  auto HasFilterNode = false;
+  auto NodeType = isl_schedule_node_get_type(Node.get());
+
----------------
[style] [[ https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable | No Almost-Always-Auto in LLVM's coding style. ]]


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1622
+  // Subsequently, such band nodes will be replaced by a single band node.
+  while (NodeType != isl_schedule_node_domain) {
+    if (HasFilterNode && (NodeType == isl_schedule_node_band))
----------------
This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just `return false` in this case.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1093
+  isl::map PossibleTensor = isl::manage(Universe.copy());
+  for (int i = 0; i < static_cast<int>(Dimensions.size()); i++) {
+    const int InPos = Dimensions[i];
----------------
gareevroman wrote:
> Meinersbur wrote:
> > or introduce `intFromIslSize`.
> As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the type isl_size. So, in the new version I used the unsigned type to avoid the cast.
`rangeIslSize` should make it easier.


================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1261
+  TCI.ReadFromC = nullptr;
+  SmallVector<MemoryAccess *, 32> Accesses = getAccessesInOrder(*Stmt);
+  for (auto *MemA = Accesses.begin(); *MemA != TCI.WriteToC; MemA++) {
----------------
gareevroman wrote:
> Meinersbur wrote:
> > `getAccessesInOrder` requires `Stmt` to not be a RegionStmt. Please add a test for it.
> I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like? Should it be a region statement that contains a matrix multiplication with right order of memory accesses?
Test in `containsOnlyTCAcc` is exactly what I was looking for. A region statement could look like this:

```
c = C[i][j];
if (/*non-affine condition*/) {
  (void)A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}
```
which has the correct order of accesses but is obviously not what we are looking for.



================
Comment at: polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll:13
 ;
+; This test case generates the following schedule, which contans filters:
+;
----------------



================
Comment at: polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll:1-3
+; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \
+; RUN: -polly-pattern-matching-based-opts=true \
+; RUN: -polly-tc-opt=true -debug < %s 2>&1 | FileCheck %s
----------------
Since this is not FileCheck-ing the LLVM-IR output, suppress it with `-disable-output`


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114336/new/

https://reviews.llvm.org/D114336



More information about the llvm-commits mailing list