[llvm-dev] Determination of statements that contain only matrix multiplication

Roman Gareev via llvm-dev llvm-dev at lists.llvm.org
Thu May 19 07:09:33 PDT 2016


Thank you for the comments and answers!

2016-05-17 19:37 GMT+05:00 Tobias Grosser <tobias at grosser.es>:
> On 05/17/2016 01:47 PM, Michael Kruse wrote:
>> 2016-05-16 19:52 GMT+02:00 Roman Gareev <gareevroman at gmail.com>:
>>> Hi Tobias,
>>>
>>> could we use information about memory accesses of a SCoP statement and
>>> def-use chains to determine statements, which don’t contain matrix
>>> multiplication of the following form?
>>
>> Assuming s/don't/do you want to pattern-match gemm kernels inside larger scops.
>>
>>
>>> for (int i = 0; i < Upper Bound1; i++)
>>>   for (int j = 0; j < Upper Bound2; j++)
>>>     for (int k = 0; k < Upper Bound3; j++)
>>>       C[i][j] += A[i][k] * B[k][j]
>>>
>>> We could probably check that memory access relations have the following form:
>>>
>>> "accesses" : [
>>>   {
>>>     "kind" : "read",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_1[i0, i2] }"
>>>   },
>>>   {
>>>     "kind" : "read",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_2[i2, i1] }"
>>>   },
>>>   {
>>>     "kind" : "read",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_3[i0, i1] }"
>>>   },
>>>   {
>>>     "kind" : "write",
>>>     "relation" : "{ Stmt_14[i0, i1, i2] -> MemRef_4[i0, i1] }"
>>>   }
>>> ]
>>>
>>> and there are the following relations between access instructions of
>>> the memory accesses:
>>>
>>> AccessInst1 --- def-use --- \
>>> AccessInst2 --- def-use --- fmul
>>>                                             |
>>>                                        def-use
>>>                                             |
>>> AccessInst3 --- def-use --- fadd
>>>                                             |
>>>                                        def-use
>>>                                             |
>>>                                  store (AccessInst4)
>>>
>>> (fmul is a user of AccessInst1 and AccessInst2. fadd is a user of the
>>> fmul and AccessInst3. store (AccessInst4) is a user of fadd)
>>
>> You'd also have to check which access are not there. E.g. the value of
>> C[i][j] might be used in the inner loop for some other purpose as
>> well, which would make some transformations illegal.
>>
>> Also notice that LICM will hoist the C[i][j] reduction out of the
>> k-loop, i.e. the pattern is different. DeLICM (WIP) is intended to
>> undo this again.
>
> Very right.

I agree with it. However, it could probably be used, if we were able
to apply only transformations, which can be done in general way.

>>> Maybe it could be a temporary solution. I think that if the checks are
>>> successfully passed and the basic block of the statement has exactly
>>> 14 instructions, the statement contains matrix multiplication and can
>>> be safely optimized with a generation of specific code, which takes
>>> into account information about usage of SIMD registers.
>>
> What kind of transformation do you have in mind that cannot be done in
> a general way? If you want to take register pressure into account, why
> not also for other stops?

I wasn’t sure whether llvm is able to hoist loads into SIMD registers
and sink stores from them. However, yesterday I found out that in some
cases llvm can perform such transformations.

For example, let’s consider the following code:

Loop 1 for ic = 0, ..., m−1 in steps of mc
Loop 2   for pc = 0, ..., k−1 in steps of kc
                A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 3     for jc = 0, ..., n−1 in steps of nc
                  B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 4       for ir = 0, ..., mc −1 in steps of mr
Loop 5         for jr = 0, ..., nc −1 in steps of nr
Loop 6           for pr = 0, ..., kc −1 in steps of 1
Loop 7             for ir' = 0, ..., mr −1 in steps of 1
Loop 8               for jr' = 0, ..., nr −1 in steps of 1
                            C[ic * mr + i * mc + ir'][jc * nr + j * nc
+ jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
jr'];
                            endfor
                          endfor
                        endfor
                      endfor
                    endfor
                  endfor
                endfor
              endfor

To get closer to an implementation of the algorithm from [1] for
matrices stored in row-major order, we can unroll loop 7 and loop 8
and perform vectorization with llvm (a corresponding code can be found
attached). According to the attached IR, llvm sinks and hoists stores
and loads related to matrix C.

Nevertheless, llvm can’t do it, if, for example, we call prefetch
within loop 6 or apply packing transformations in a way that is
similar to the one mentioned in [2] (corresponding implementations are
attached to the email). I haven’t found the reason yet.

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[2] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm

>> Like the LoopIdiomPass, one could replace a detected gemm kernel my a
>> call to a hand-optimized/external [sdcz]gemm function. However, I
>> think the goal of Polly is to optimize such kernels by itself.
>
> Right. This does not seem to be too interesting.
>
> I am personally mostly interested in getting the pieces in place that
> allow us to get close-to-peak performance for gemm, but agree with
> Michael that we clearly want to generalize this in the future.
>
> However, going with a vertical implementation first might be a good idea
> (assuming the pattern matching is simple). In this case we could detect
> a gemm kernel and use this logic to get all important optimizations we
> need in place.

Yes, it is planned for the future. Sorry, I should probably have
mentioned it in the proposal.

-- 
                                    Cheers, Roman Gareev.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_SIMD.c
Type: text/x-csrc
Size: 8387 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: IR
Type: application/octet-stream
Size: 17577 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_SIMD_with_prefetch.c
Type: text/x-csrc
Size: 8492 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment-0001.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_C_SIMD_another_packing.c
Type: text/x-csrc
Size: 7900 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160519/ce7cd76c/attachment-0002.c>


More information about the llvm-dev mailing list