[llvm-dev] Determination of statements that contain only matrix multiplication

Tue May 17 07:37:24 PDT 2016

On 05/17/2016 01:47 PM, Michael Kruse wrote:
> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>:
>> Hi Tobias,
>>
>> could we use information about memory accesses of a SCoP statement and
>> def-use chains to determine statements, which don’t contain matrix
>> multiplication of the following form?
> 
> Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops.
> 
> 
>> for (int i = 0; i < Upper Bound1; i++)
>>   for (int j = 0; j < Upper Bound2; j++)
>>     for (int k = 0; k < Upper Bound3; j++)
>>       C[i][j] += A[i][k] * B[k][j]
>>
>> We could probably check that memory access relations have the following form:
>>
>> "accesses" : [
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }"
>>   },
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }"
>>   },
>>   {
>>     "kind" : "read",
>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }"
>>   },
>>   {
>>     "kind" : "write",
>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }"
>>   }
>> ]
>>
>> and there are the following relations between access instructions of
>> the memory accesses:
>>
>> AccessInst1 --- def-use --- \
>> AccessInst2 --- def-use --- fmul
>>                                             |
>>                                        def-use
>>                                             |
>> AccessInst3 --- def-use --- fadd
>>                                             |
>>                                        def-use
>>                                             |
>>                                  store (AccessInst4)
>>
>> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
>> fmul and AccessInst3. store (AccessInst4) is a user of fadd)
> 
> You'd also have to check which access are not there. E.g. the value of
> C[i][j] might be used in the inner loop for some other purpose as
> well, which would make some transformations illegal.
> 
> Also notice that LICM will hoist the C[i][j] reduction out of the
> k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
> undo this again.

Very right.

>> Maybe it could be a temporary solution. I think that if the checks are
>> successfully passed and the basic block of the statement has exactly
>> 14 instructions, the statement contains matrix multiplication and can
>> be safely optimized with a generation of specific code, which takes
>> into account information about usage of SIMD registers.
> 
> What kind of transformation do you have in mind that cannot be done in
> a general way? If you want to take register pressure into account, why
> not also for other scops?

Good questions.

> Like the LoopIdiomPass, one could replace a detected gemm kernel my a
> call to a hand-optimized/external [sdcz]gemm function. However, I
> think the goal of Polly is to optimize such kernels by itself.

Right. This does not seem to be too interesting.

I am personally mostly interested in getting the pieces in place that
allow us to get close-to-peak performance for gemm, but agree with
Michael that we clearly want to generalize this in the future.

However, going with a vertical implementation first might be a good idea
(assuming the pattern matching is simple). In this case we could detect
a gemm kernel and use this logic to get all important optimizations we
need in place.

Best,
Tobias