[llvm-dev] [RFC] Vector Predication

Mon Feb 4 14:04:54 PST 2019

On 2/4/19 10:40 PM, Robin Kruppe wrote:
>
>
> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de 
> <mailto:moll at cs.uni-saarland.de>> wrote:
>
>     On 2/4/19 9:18 PM, Robin Kruppe wrote:
>>
>>
>>     On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev
>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>         Simon Moll <moll at cs.uni-saarland.de
>>         <mailto:moll at cs.uni-saarland.de>> writes:
>>
>>         > You are referring to the sub-vector sizes, if i am
>>         understanding
>>         > correctly. I'd assume that the mask sub-vector length
>>         always has to be
>>         > either 1 or the same as the data sub-vector length. For
>>         example, this
>>         > is ok:
>>         >
>>         > %result = call <scalable 3 x float>
>>         @llvm.evl.fsub.v4f32(<scalable 3 x
>>         > float> %x, <scalable 3 x float> %y, <scalable 1 x i1> %M,
>>         i32 %L)
>>
>>         What does <scalable 1 x i1> applied to <scalable 3 x float>
>>         mean?  I
>>         would expect a requirement of <scalable 3 x i1>.  At least
>>         that's how I
>>         understood the SVE proposal [1].  The n's in <scalable n x
>>         type> have to
>>         match.
>>
>>
>>     I believe the idea is to allow each single mask bit to control
>>     multiple consecutive lanes at once, effectively interpreting the
>>     vector being operated on as "many short fixed-length vectors,
>>     concatenated" rather than a single long vector of scalars. This
>>     is a different interpretation of that type than usual, but it's
>>     not crazy, e.g. a similar reinterpretation of vector types seems
>>     to be the favored approach for adding matrix operations to LLVM
>>     IR. It somewhat obscures the point to discuss this only for
>>     scalable vectors, there's no conceptual reason why one couldn't
>>     do the same with fixed size vectors.
>>
>>     In fact, I would recommend against making almost any new feature
>>     or intrinsic exclusive to scalable vectors, including this one:
>>     there shouldn't be much extra code required to allow and support
>>     it, and not doing so makes the IR less orthogonal. For example,
>>     if a <scalable 4 x float> fadd with a <scalable 1 x i1> mask
>>     works, then <4 x float> fadd with a <1 x i1> mask, a <8 x float>
>>     fadd with a <2 x i1> mask, etc. should also be possible overloads
>>     of the same intrinsic.
>     Yep. Doing the same for standard vector IR is on the radar:
>     https://reviews.llvm.org/D57504#1380587.
>>
>>     So far, so good. A bit odd, when I think about it, but if
>>     hardware out there has that capability, maybe this is a good way
>>     to encode it in IR (other options might work too, though). The
>>     crux, however, is the interaction with the dynamic vector length:
>>     is it in terms of the mask? the longer data vector? if the
>>     latter, what happens if it isn't divisible by the mask length?
>>     There are multiple options and it's not clear to me which one is
>>     "the right one", both for architectures with native support
>>     (hopefully the one brough up here won't be the only one) and for
>>     internal consistency of the IR. If there was an established
>>     architecture with this kind of feature where people have gathered
>>     lots of practical experience with it, we could use that inform
>>     the decision (just as we have for ordinary predication and
>>     dynamic vector length). But I'm not aware of any architecture
>>     that does this other than the one Jacob and lkcl are working on,
>>     and as far as I know their project still in the early stages.
>
>     The current understanding is that the dynamic vector length
>     operates in the granularity of the mask:
>     https://reviews.llvm.org/D57504#1381211
>
> I do understand that this is what Jacob proposes based on the 
> architecture he works on. However, it is not yet clear to me whether 
> that is the most useful option overall, nor that it is the only option 
> that will lead to reasonable codegen for their architecture. But let's 
> leave discussion of the details on Phab. I just want to highlight one 
> issue that is not specific to Jacob's angle, as it relates to the 
> interpretation of scalable vectors more generally:
>
>     In unscaled IR types, this means VL masks each scalar result, in
>     scaled types VL masks sub vectors. E.g. for %L == 1 the following
>     call produces a pair of floats as the result:
>
>         <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>
> As I wrote on Phab mere minutes before you sent this email, I do not 
> think this is the right interpretation for any architecture I know 
> about (I do not know anything about the things Jacob and Luke are 
> working on) nor from the POV of the scalable vector types proposal. A 
> scalable vector is not conventionally "a variable-length vector of 
> fixed-size vectors", it it simply an ordinary "flat" vector whose 
> length happens to be mostly unknown at compile time. If some 
> intrinsics want to interpret it differently, that is fine, but that's 
> a property of those specific intrinsics -- similar to how proposed 
> matrix intrinsics might interpret a 16 element vector as a 4x4 matrix.

On NEC SX-Aurora the vector length is always interpreted in 64bit data 
chunks. That is one example of a real architecture where the vscaled 
interpretation of VL makes sense.

>     I agree that we should only consider the tied sub-vector case for
>     this first version and keep discussing the unconstrained version.
>     It is seductively easy to allow this but impossible to take it back.
>
>     ---
>
>     The story is different when we talk only(!) about memory accesses
>     and having different vector sizes in the operands and the
>     transferred type (result type for loads, value operand type for
>     stores):
>
>     Eg on AVX, this call could turn into a 64bit gather operation of
>     pairs of floats:
>
>     <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1>
>     mask %M, i32 vlen 8)
>
> Is that IR you'd expect someone to generate (or a backend to consume) 
> for this operation? It seems like a rather unnatural or "magical" way 
> to represent the intent (load 64b each from 8 pointers), at least with 
> the way I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast.
>
>     And there is a native 16 x 16 element load (VLD2D) on SX-Aurora,
>     which may be represented as:
>
>     <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x
>     double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>
> In contrast to the above I can't very well say one should write this 
> as a gather of i1024, but it also seems like a rather specialized 
> instruction (presumably used for blocked processing of matrices?) so I 
> can't say that this on its own motivates me to complicate a proposed 
> core IR construct.
It actually reduces complexity by shifting it from the address 
computation into the instruction. This would cover all three cases: 
VLD2D, <2 x float> gather on AVX and <W x float> loads for this early 
RISC-V based architecture that Jacob and lkcl are working on. However, 
this is not a top priority and we can leave it out of the first version.
>
> Cheers,
> Robin
>
- Simon

-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/0327a7b9/attachment.html>