[PATCH] D114336: [Polly] Generalize the pattern matching to the case of tensor contractions.

Fri Jul 29 23:48:57 PDT 2022

gareevroman added a comment.

In D114336#3621094 <https://reviews.llvm.org/D114336#3621094>, @Meinersbur wrote:

> In D114336#3576199 <https://reviews.llvm.org/D114336#3576199>, @gareevroman wrote:
>
>>   void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
>>   for (int i = 0; i < 1024; i++)
>>       for (int j = 0; j < 1024; j++)
>>         for (int l = 0; l < 64; l++)
>>           for (int w = 0; w < 64; ++w)
>>              if (w != 0)
>>                C[i][j] += A[i][l][w] * B[w][j][l];
>>   }
>>
>> ScopBuilder generates the following memory accesses:
>>
>>   { Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
>>   { Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
>>   { Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
>>   { Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }
>>
>> In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.
>>
>>> Where does it come from?
>>
>> If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.
>
> Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

I agree. Improving the detection is a possible goal of a future work.

>>> Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:
>>
>> I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.
>
> In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable.  paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.
>
> However, I don't request  such a change atm.

Ok. I've tried to make the description of the algorithm self-consistent.

>>>   for (...) {
>>>     ...
>>>     for (...) {
>>>       if (c)
>>>         auto acc = C[P]; // ReadFromC
>>>         auto a = A[Pa, I];
>>>         auto b = B[Pb,J];
>>>         auto arg = f(a,b);
>>>         acc = acc op arg;
>>>         C[P] = acc; // WriteToC
>>>       }
>>>     }
>>>   }
>>>   
>>>   where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.
>
>
>
>> I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?
>
> It is also 'redundant' with the code itself, but it helps understanding it.
>
> The description of `I` and `P` are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.
>
> The "definition" of `isTCPattern` is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

I've tried to improve the description.

>>> `op=` is a commutative operation ....,
>>
>> I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.
>
> This is exactly the sort of thin I would want to clarify.

We don't check for associativity, because it's difficult and not necessary for the optimization. The optimization doesn't change the order of loops with indexes from the bundle P during the optimization, even if you parallize the outermost loop. Hence, it doesn't violate anything. For the same reason, we don't check for associativity in the case of the optimization of the generalization of matrix-matrix multiplication, which is currently used in Polly.

>>> c is an affine condition usually just `true`.
>>
>> Could you elaborate on that? I’m not sure that I understand where such a condition is used.
>
> It corresponds to a filter node in the TC body. You have used the `if (w != 0)` to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

Ok. Unfortunately, the current approach does't support this. So, it's not the part of the pattern.

>>> `f` is a side-effect free operation.
>>
>> If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.
>
> f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.
>
> Side-effect here means something different than ScopDetection. A write to an unrelated array `D` would be a side-effect for the TC, but accurately represented by a Scop.
> We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with `-polly-allow-modref-calls`. These could just not be reordered relative to each other.

I agree that only side-effect free operations are considered in the pattern. Nevertheless, I propose not to use terms that may require an additional specification. I've tried to improve the description of the pattern.

> I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 <https://reviews.llvm.org/rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9> and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36 <https://reviews.llvm.org/rGa56bd7dec8da4348d847d53c96d8a30f4a821d36>.

Sure. Let's continue improving it.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114336/new/

https://reviews.llvm.org/D114336