[PATCH] D114336: [Polly] Generalize the pattern matching to the case of tensor contractions.

Sun Jun 12 04:12:52 PDT 2022

gareevroman marked 6 inline comments as done.
gareevroman added a comment.

In D114336#3517540 <https://reviews.llvm.org/D114336#3517540>, @Meinersbur wrote:

> In D114336#3500858 <https://reviews.llvm.org/D114336#3500858>, @gareevroman wrote:
>
>> 1
>> Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.
>
> Please document what pattern is intended to be recognized. I don't think the doc for `isTCPattern` is sufficient, it only mentioned what is checked. Documenting the intended pattern would help identifying if a check has been forgotten. E.g. for the statement domain.

I've added a description of the TC-like kernel, which is the intended pattern, to the doc for isTCPattern function. I've added additional remarks according to your comments and restrictions of the current implementation.

>> As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.
>>
>>   ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
>>       { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
>>   ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
>>       { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
>>   ReadAccess :=	[Reduction Type: +] [Scalar: 0]
>>       { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
>>   MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
>>       { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
>
> What do you mean by "codegen modifies some memory accesses"? Polly's Codegen?

Sorry, I meant ScopBuilder. In the following case

  void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
  for (int i = 0; i < 1024; i++)
      for (int j = 0; j < 1024; j++)
        for (int l = 0; l < 64; ++l)
           if (l != 0)
             for (int w = 0; w < 64; ++w)
               C[i][j] += A[i][l][w] * B[w][j][l];
  }

ScopBuilder generates the following memory accesses, which correspond to the pattern:

  { Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }
  { Stmt4[i0, i1, i2, i3] -> MemRef3[o0, o1, o2] : o0 = i3 and o1 = i1 and o2 = i2 }
  { Stmt4[i0, i1, i2, i3] -> MemRef2[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = i3 }
  { Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }

If we changes that code a bit

  void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
  for (int i = 0; i < 1024; i++)
      for (int j = 0; j < 1024; j++)
        for (int l = 0; l < 64; l++)
          for (int w = 0; w < 64; ++w)
             if (w != 0)
               C[i][j] += A[i][l][w] * B[w][j][l];
  }

ScopBuilder generates the following memory accesses:

  { Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
  { Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
  { Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
  { Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }

In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

> What is the check to exclude this?

They will be rejected by isTCOperandAcc. If we fix the output dimensions, values of output dimensions will not form a permutation of a subset of values of input dimensions. Please see comments inside this function.

> Is the `1 + i3` memory access expression?

Yes, I believe so.

> Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and 0 <= i2 <= 63 and 0 <= i3 <= 62 }"

>> 4
>>
>>> Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?
>>
>>
>>
>>> void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
>>>
>>>   for (int i = 0; i < 32; i++)
>>>   for (int j = 0; j < 32; j++)
>>>   for (int l = 0; l < 32; l++)
>>>   for (int w = 0; w < 32; w++)
>>>       C[i][j] += A[i][l][w] * B[w][j][i];
>>>
>>> }
>>
>> For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.
>
> Should or should not be detected?

That should not be detected, because the intersection of free and contracted indices should always be empty. We check this at the line "if (intersect(IandJIndexSet, TCI.P).size() != 0)".

> While debugging, it is now rejected because of this:
> ``
>
>   if (!TCI.B) {
>     // IndexSet should be a union of J and P sets.
>     if (unite(TCI.P, TCI.J) != IndexSet)
>       return false;
>
> ``

You’re right. Thank you. "if (intersect(IandJIndexSet, TCI.P).size() != 0)" doesn’t help in this case. In that example, we have dependencies of the form:

  { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : (o0 = i0 and o1 = i1 and o2 = i2 and o3 = 1 + i3 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 31 and i3 >= 0 and i3 <= 30) or (i3 = 31 and o0 = i0 and o1 = i1 and o2 = 1 + i2 and o3 = 0 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 30) }

the isl ast has the form:

  { domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 and 0 <= i3 <= 31 }", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i0)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i1)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i2)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i3)] }]" } } } } } } } } }
  if (1 && (&MemRef5[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef5[0][0][0]) && (&MemRef4[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef4[0][0][0]))

      // Loop with Metadata
      for (int c0 = 0; c0 <= 31; c0 += 1) {
        // Loop with Metadata
        for (int c1 = 0; c1 <= 31; c1 += 1) {
          // Loop with Metadata
          for (int c2 = 0; c2 <= 31; c2 += 1) {
            // Loop with Metadata
            for (int c3 = 0; c3 <= 31; c3 += 1)
              Stmt3(c0, c1, c2, c3);
          }
        }
      }

  else
      {  /* original code */ }

Consequently, only "l" and "w" are treated as "contracted indices", which are stored in TCI.P.  Sorry, I missed that.

If indexes of an operand of the tensor contraction don’t contain TCI.P, we don't accept the program. We check this in lines

  …
  if (!isSuperset(IndexSet, TCI.P))
        return false;
  …

  …
  if (unite(TCI.P, TCI.J) != IndexSet)
        return false;
  …

unite(TCI.P, TCI.J) still isn't equal to IndexSet in the considered case, because IndexSet doesn't contain the "l".

If we didn't have the "l-loop", unite(TCI.P, TCI.J) still would not be equal to IndexSet, which would contain i, j and w, because TCI.I would contain only i and TCI.J would contain only j.

test/ScheduleOptimizer/pattern-matching-based-opts_21.ll is the corresponding test case. Additionally, I've added test/ScheduleOptimizer/pattern-matching-based-opts_25.ll.

P.S.: I had to use the -fno-unroll-loops option of clang. Otherwise, one of the loops is optimized out on my machine.

> Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

[1] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach // ACM Transactions Architecture and Code Optimization (TACO). 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.
[2] - Matthews D. High-Performance Tensor Contraction without BLAS // SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. C 1—C 24. DOI: 110.1137/16m108968x.

>   for (...) {
>     ...
>     for (...) {
>       if (c)
>         auto acc = C[P]; // ReadFromC
>         auto a = A[Pa, I];
>         auto b = B[Pb,J];
>         auto arg = f(a,b);
>         acc = acc op arg;
>         C[P] = acc; // WriteToC
>       }
>     }
>   }
>   
>   where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

> `op=` is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

> c is an affine condition usually just `true`.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

> `f` is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

>   (I don't whether this is correct, I want to understand whether the checked conditions are sufficient).
>   
>   > 5
>   > 
>   > I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.
>   
>   Ok.

================
Comment at: polly/lib/Transform/MatmulOptimizer.cpp:1223
+                   const std::set<int> &IandJIndexSet,
+                   SmallVector<int, 30> Dimensions, TCInfoTy &TCI) {
+  if (!TCI.A) {
----------------
Meinersbur wrote:
> gareevroman wrote:
> > Meinersbur wrote:
> > > 
> > As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and TCI.BDimensions
> The assignments should just make a copy of the array . With `Dimensions` being passed by-value, the caller has to make the copy which it should not need to.
> 
> `SmallVector` has no overload for being assigned an `ArrayRef`, but you could use `llvm::append_range` to insert all the values.
I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114336/new/

https://reviews.llvm.org/D114336