[llvm-dev] Determination of statements that contain only matrix multiplication
Tobias Grosser via llvm-dev
llvm-dev at lists.llvm.org
Fri May 20 07:11:40 PDT 2016
On 05/20/2016 03:07 PM, Roman Gareev wrote:
> 2016-05-20 14:17 GMT+05:00 Tobias Grosser <tobias at grosser.es>:
>>>>>> Maybe it could be a temporary solution. I think that if the checks are
>>>>>> successfully passed and the basic block of the statement has exactly
>>>>>> 14 instructions, the statement contains matrix multiplication and can
>>>>>> be safely optimized with a generation of specific code, which takes
>>>>>> into account information about usage of SIMD registers.
>>>> What kind of transformation do you have in mind that cannot be done in
>>>> a general way? If you want to take register pressure into account, why
>>>> not also for other stops?
>>> I wasn’t sure whether llvm is able to hoist loads into SIMD registers
>>> and sink stores from them. However, yesterday I found out that in some
>>> cases llvm can perform such transformations.
>>> For example, let’s consider the following code:
>>> Loop 1 for ic = 0, ..., m−1 in steps of mc
>>> Loop 2 for pc = 0, ..., k−1 in steps of kc
>>> A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
>>> Loop 3 for jc = 0, ..., n−1 in steps of nc
>>> B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
>>> Loop 4 for ir = 0, ..., mc −1 in steps of mr
>>> Loop 5 for jr = 0, ..., nc −1 in steps of nr
>>> Loop 6 for pr = 0, ..., kc −1 in steps of 1
>>> Loop 7 for ir' = 0, ..., mr −1 in steps of 1
>>> Loop 8 for jr' = 0, ..., nr −1 in steps of 1
>>> C[ic * mr + i * mc + ir'][jc * nr + j * nc
>>> + jr'] += Ac[mr * (pr + kc * ic) + ir'] * Bc[(pr + jc * kc) * nr +
>>> To get closer to an implementation of the algorithm from  for
>>> matrices stored in row-major order, we can unroll loop 7 and loop 8
>>> and perform vectorization with llvm (a corresponding code can be found
>>> attached). According to the attached IR, llvm sinks and hoists stores
>>> and loads related to matrix C.
>>> Nevertheless, llvm can’t do it, if, for example, we call prefetch
>>> within loop 6
>> This is likely because LLVM does not understand the side-effects
>> of the prefetch instruction.
>>> or apply packing transformations in a way that is
>>> similar to the one mentioned in  (corresponding implementations are
>>> attached to the email). I haven’t found the reason yet.
>> I assume this is also due to the unknown side-effects of the packing
>> (LLVM does not understand that the packing function-call does not modify
>> the C array).
>> Both can likely be addressed by improving aliasing-modref information,
>> adding relevant annotations or just hacking the code to ignore this
>> issue. Otherwise, we can probably also generate IR that makes it easier
>> for LLVM to perform this register promotion.
> I think that this is right. Thank you for the explanation!
>> I am not fully sure where you are heading at the moment. Do you have
>> identified a set of individual tasks/changes that you could implement?
> If I’m not mistaken, we decided that the pattern matching could be a
> good starting point. That’s why I’m planning to try to implement it in
> the near future. Implementation of tiling, interchanging and unrolling
> of specific loops based on the algorithm presented in  could be the
> next step. I could start working on the packing transformation, if the
> previous tasks are implemented.
> What do you think about it?
More information about the llvm-dev