# [llvm-dev] Determination of statements that contain only matrix multiplication

Tobias Grosser via llvm-dev llvm-dev at lists.llvm.org
Fri May 20 02:17:39 PDT 2016

```>>>> Maybe it could be a temporary solution. I think that if the checks are
>>>> successfully passed and the basic block of the statement has exactly
>>>> 14 instructions, the statement contains matrix multiplication and can
>>>> be safely optimized with a generation of specific code, which takes
>>>> into account information about usage of SIMD registers.
>>>
>> What kind of transformation do you have in mind that cannot be done in
>> a general way? If you want to take register pressure into account, why
>> not also for other stops?
>
> I wasn’t sure whether llvm is able to hoist loads into SIMD registers
> and sink stores from them. However, yesterday I found out that in some
> cases llvm can perform such transformations.
>
> For example, let’s consider the following code:
>
> Loop 1 for ic = 0, ..., m−1 in steps of mc
> Loop 2   for pc = 0, ..., k−1 in steps of kc
>                 A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
> Loop 3     for jc = 0, ..., n−1 in steps of nc
>                   B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
> Loop 4       for ir = 0, ..., mc −1 in steps of mr
> Loop 5         for jr = 0, ..., nc −1 in steps of nr
> Loop 6           for pr = 0, ..., kc −1 in steps of 1
> Loop 7             for ir' = 0, ..., mr −1 in steps of 1
> Loop 8               for jr' = 0, ..., nr −1 in steps of 1
>                             C[ic * mr + i * mc + ir'][jc * nr + j * nc
> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
> jr'];
>                             endfor
>                           endfor
>                         endfor
>                       endfor
>                     endfor
>                   endfor
>                 endfor
>               endfor
>
> To get closer to an implementation of the algorithm from [1] for
> matrices stored in row-major order, we can unroll loop 7 and loop 8
> and perform vectorization with llvm (a corresponding code can be found
> attached). According to the attached IR, llvm sinks and hoists stores
> and loads related to matrix C.
>
> Nevertheless, llvm can’t do it, if, for example, we call prefetch
> within loop 6

This is likely because LLVM does not understand the side-effects
of the prefetch instruction.

> or apply packing transformations in a way that is
> similar to the one mentioned in [2] (corresponding implementations are
> attached to the email). I haven’t found the reason yet.

I assume this is also due to the unknown side-effects of the packing
implementation.
(LLVM does not understand that the packing function-call does not modify
the C array).

Both can likely be addressed by improving aliasing-modref information,
adding relevant annotations or just hacking the code to ignore this
issue. Otherwise, we can probably also generate IR that makes it easier
for LLVM to perform this register promotion.

I am not fully sure where you are heading at the moment. Do you have
identified a set of individual tasks/changes that you could implement?

Best,
Tobias

```