[llvm-dev] [RFC] Matrix support (take 2)

Wed Dec 19 14:07:02 PST 2018

> On Dec 19, 2018, at 1:31 PM, Stephen Canon <scanon at apple.com> wrote:
> 
>> On Dec 19, 2018, at 11:09 AM, Stephen Canon via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>>> On Dec 18, 2018, at 10:18 PM, Adam Nemet <anemet at apple.com <mailto:anemet at apple.com>> wrote:
>>> 
>>>> I don’t understand this.  What is the benefit of providing layout info to element wise operations?  This defeats the goal of having simple lowering and representation: you are encoding an ND vector form into the IR in a really ugly way, and this will cause a proliferation of intrinsics that are redundant with the core ops.
>>> 
>>> The reason we need that information so that for example we can lower an operation on a 3-element column into a vector of 2 and a scalar op.  This should be beneficial for power consumption since for example in the case of a 3x3 with a single element padding rather than operating on 12 elements you’d operate only on 9 (vector ops consume more power than their scalar counterparts).
>>> 
>>> That said we should be able to remove these intrinsics in the long term.  Once we have masking on the core ops in the IR, we should be able to express the same semantics without dedicated intrinsics.
>> 
>> There may be some cases where this holds (maybe with 5x5 or something), but most of the time I would expect to get better power from doing a four-element vector op with one wasted lane than doing two arithmetic ops (plus possibly extracts and inserts, depending on physical layout details).
>> 
>> Explicit masking or arranging for zero in padding lanes seems like a better way forward to me.
>> – Steve
> 
> I spent some time chatting with Adam about this and have a better understanding of his concerns here. It seems to me that if having masking intrinsics is the long-term solution we want, we should do that now (for add and sub) rather than building arbitrary matrix layout info into intrinsics, since a mask has all the information that we actually need.

I think that sounds like a reasonable compromise.  We already have masked load/store intrinsics so adding add and sub just follows that precedent.  If the decision is made to move masking to the core operations, the new intrinsics would just move as well.

So an add->multiply for option B + masking intrinsics would look like this:

  %a = load <12 x float>, <12 x float>* %A, align 16
  %b = load <12 x float>, <12 x float>* %B, align 16
  %c = load <8 x float>, <8 x float>* %C, align 16

  %add = call <12 x float> @llvm.masked.fadd(<12 x float> %a, <12 x float> %b,
      					     ; mask, if false element is taken from passthrough
                                             <12 x i1> <i1 true, i1 true, i1 true, i1 false,
                                                        i1 true, i1 true, i1 true, i1 false,
                                                        i1 true, i1 true, i1 true, i1 false >
                                             ; passthrough:
                                             <12 x float> <float undef, float undef, float undef, float undef,
                                                           float undef, float undef, float undef, float undef,
                                                           float undef, float undef, float undef, float undef >)

  %mul = call <8 x float> @llvm.matrix.multiply(<12 x float> %add, <8 x float> %c,
                                               ;     3 x 3             3 x 2  column-major:
                                                i32 3, i32 3,     i32 3, i32 2,     i1 true)
  store <8 x float> %mul, <8 x float>* %MUL, align 16

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181219/00df7388/attachment.html>