[llvm-dev] Determination of statements that contain only matrix multiplication

Fri May 20 06:07:01 PDT 2016

2016-05-20 14:17 GMT+05:00 Tobias Grosser <tobias at grosser.es>:
>>>>> Maybe it could be a temporary solution. I think that if the checks are
>>>>> successfully passed and the basic block of the statement has exactly
>>>>> 14 instructions, the statement contains matrix multiplication and can
>>>>> be safely optimized with a generation of specific code, which takes
>>>>> into account information about usage of SIMD registers.
>>>>
>>> What kind of transformation do you have in mind that cannot be done in
>>> a general way? If you want to take register pressure into account, why
>>> not also for other stops?
>>
>> I wasn’t sure whether llvm is able to hoist loads into SIMD registers
>> and sink stores from them. However, yesterday I found out that in some
>> cases llvm can perform such transformations.
>>
>> For example, let’s consider the following code:
>>
>> Loop 1 for ic = 0, ..., m−1 in steps of mc
>> Loop 2   for pc = 0, ..., k−1 in steps of kc
>>                 A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
>> Loop 3     for jc = 0, ..., n−1 in steps of nc
>>                   B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
>> Loop 4       for ir = 0, ..., mc −1 in steps of mr
>> Loop 5         for jr = 0, ..., nc −1 in steps of nr
>> Loop 6           for pr = 0, ..., kc −1 in steps of 1
>> Loop 7             for ir' = 0, ..., mr −1 in steps of 1
>> Loop 8               for jr' = 0, ..., nr −1 in steps of 1
>>                             C[ic * mr + i * mc + ir'][jc * nr + j * nc
>> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
>> jr'];
>>                             endfor
>>                           endfor
>>                         endfor
>>                       endfor
>>                     endfor
>>                   endfor
>>                 endfor
>>               endfor
>>
>> To get closer to an implementation of the algorithm from [1] for
>> matrices stored in row-major order, we can unroll loop 7 and loop 8
>> and perform vectorization with llvm (a corresponding code can be found
>> attached). According to the attached IR, llvm sinks and hoists stores
>> and loads related to matrix C.
>>
>> Nevertheless, llvm can’t do it, if, for example, we call prefetch
>> within loop 6
>
> This is likely because LLVM does not understand the side-effects
> of the prefetch instruction.
>
>> or apply packing transformations in a way that is
>> similar to the one mentioned in [2] (corresponding implementations are
>> attached to the email). I haven’t found the reason yet.
>
> I assume this is also due to the unknown side-effects of the packing
> implementation.
> (LLVM does not understand that the packing function-call does not modify
> the C array).
>
> Both can likely be addressed by improving aliasing-modref information,
> adding relevant annotations or just hacking the code to ignore this
> issue. Otherwise, we can probably also generate IR that makes it easier
> for LLVM to perform this register promotion.

I think that this is right. Thank you for the explanation!

> I am not fully sure where you are heading at the moment. Do you have
> identified a set of individual tasks/changes that you could implement?

If I’m not mistaken, we decided that the pattern matching could be a
good starting point. That’s why I’m planning to try to implement it in
the near future. Implementation of tiling, interchanging and unrolling
of specific loops based on the algorithm presented in [1] could be the
next step. I could start working on the packing transformation, if the
previous tasks are implemented.

What do you think about it?

Refs:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

-- 
                                    Cheers, Roman Gareev.