# [llvm-dev] Determination of statements that contain only matrix multiplication

Tobias Grosser via llvm-dev llvm-dev at lists.llvm.org
Fri May 20 07:11:40 PDT 2016

```On 05/20/2016 03:07 PM, Roman Gareev wrote:
> 2016-05-20 14:17 GMT+05:00 Tobias Grosser <tobias at grosser.es>:
>>>>>> Maybe it could be a temporary solution. I think that if the checks are
>>>>>> successfully passed and the basic block of the statement has exactly
>>>>>> 14 instructions, the statement contains matrix multiplication and can
>>>>>> be safely optimized with a generation of specific code, which takes
>>>>>> into account information about usage of SIMD registers.
>>>>>
>>>> What kind of transformation do you have in mind that cannot be done in
>>>> a general way? If you want to take register pressure into account, why
>>>> not also for other stops?
>>>
>>> I wasn’t sure whether llvm is able to hoist loads into SIMD registers
>>> and sink stores from them. However, yesterday I found out that in some
>>> cases llvm can perform such transformations.
>>>
>>> For example, let’s consider the following code:
>>>
>>> Loop 1 for ic = 0, ..., m−1 in steps of mc
>>> Loop 2   for pc = 0, ..., k−1 in steps of kc
>>>                 A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
>>> Loop 3     for jc = 0, ..., n−1 in steps of nc
>>>                   B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
>>> Loop 4       for ir = 0, ..., mc −1 in steps of mr
>>> Loop 5         for jr = 0, ..., nc −1 in steps of nr
>>> Loop 6           for pr = 0, ..., kc −1 in steps of 1
>>> Loop 7             for ir' = 0, ..., mr −1 in steps of 1
>>> Loop 8               for jr' = 0, ..., nr −1 in steps of 1
>>>                             C[ic * mr + i * mc + ir'][jc * nr + j * nc
>>> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
>>> jr'];
>>>                             endfor
>>>                           endfor
>>>                         endfor
>>>                       endfor
>>>                     endfor
>>>                   endfor
>>>                 endfor
>>>               endfor
>>>
>>> To get closer to an implementation of the algorithm from [1] for
>>> matrices stored in row-major order, we can unroll loop 7 and loop 8
>>> and perform vectorization with llvm (a corresponding code can be found
>>> attached). According to the attached IR, llvm sinks and hoists stores
>>> and loads related to matrix C.
>>>
>>> Nevertheless, llvm can’t do it, if, for example, we call prefetch
>>> within loop 6
>>
>> This is likely because LLVM does not understand the side-effects
>> of the prefetch instruction.
>>
>>> or apply packing transformations in a way that is
>>> similar to the one mentioned in [2] (corresponding implementations are
>>> attached to the email). I haven’t found the reason yet.
>>
>> I assume this is also due to the unknown side-effects of the packing
>> implementation.
>> (LLVM does not understand that the packing function-call does not modify
>> the C array).
>>
>> Both can likely be addressed by improving aliasing-modref information,
>> adding relevant annotations or just hacking the code to ignore this
>> issue. Otherwise, we can probably also generate IR that makes it easier
>> for LLVM to perform this register promotion.
>
> I think that this is right. Thank you for the explanation!
>
>> I am not fully sure where you are heading at the moment. Do you have
>> identified a set of individual tasks/changes that you could implement?
>
> If I’m not mistaken, we decided that the pattern matching could be a
> good starting point. That’s why I’m planning to try to implement it in
> the near future. Implementation of tiling, interchanging and unrolling
> of specific loops based on the algorithm presented in [1] could be the
> next step. I could start working on the packing transformation, if the