[llvm-dev] [RFC] Vector Predication
Simon Moll via llvm-dev
llvm-dev at lists.llvm.org
Mon Feb 4 14:04:54 PST 2019
On 2/4/19 10:40 PM, Robin Kruppe wrote:
>
>
> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de
> <mailto:moll at cs.uni-saarland.de>> wrote:
>
> On 2/4/19 9:18 PM, Robin Kruppe wrote:
>>
>>
>> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>> Simon Moll <moll at cs.uni-saarland.de
>> <mailto:moll at cs.uni-saarland.de>> writes:
>>
>> > You are referring to the sub-vector sizes, if i am
>> understanding
>> > correctly. I'd assume that the mask sub-vector length
>> always has to be
>> > either 1 or the same as the data sub-vector length. For
>> example, this
>> > is ok:
>> >
>> > %result = call <scalable 3 x float>
>> @llvm.evl.fsub.v4f32(<scalable 3 x
>> > float> %x, <scalable 3 x float> %y, <scalable 1 x i1> %M,
>> i32 %L)
>>
>> What does <scalable 1 x i1> applied to <scalable 3 x float>
>> mean? I
>> would expect a requirement of <scalable 3 x i1>. At least
>> that's how I
>> understood the SVE proposal [1]. The n's in <scalable n x
>> type> have to
>> match.
>>
>>
>> I believe the idea is to allow each single mask bit to control
>> multiple consecutive lanes at once, effectively interpreting the
>> vector being operated on as "many short fixed-length vectors,
>> concatenated" rather than a single long vector of scalars. This
>> is a different interpretation of that type than usual, but it's
>> not crazy, e.g. a similar reinterpretation of vector types seems
>> to be the favored approach for adding matrix operations to LLVM
>> IR. It somewhat obscures the point to discuss this only for
>> scalable vectors, there's no conceptual reason why one couldn't
>> do the same with fixed size vectors.
>>
>> In fact, I would recommend against making almost any new feature
>> or intrinsic exclusive to scalable vectors, including this one:
>> there shouldn't be much extra code required to allow and support
>> it, and not doing so makes the IR less orthogonal. For example,
>> if a <scalable 4 x float> fadd with a <scalable 1 x i1> mask
>> works, then <4 x float> fadd with a <1 x i1> mask, a <8 x float>
>> fadd with a <2 x i1> mask, etc. should also be possible overloads
>> of the same intrinsic.
> Yep. Doing the same for standard vector IR is on the radar:
> https://reviews.llvm.org/D57504#1380587.
>>
>> So far, so good. A bit odd, when I think about it, but if
>> hardware out there has that capability, maybe this is a good way
>> to encode it in IR (other options might work too, though). The
>> crux, however, is the interaction with the dynamic vector length:
>> is it in terms of the mask? the longer data vector? if the
>> latter, what happens if it isn't divisible by the mask length?
>> There are multiple options and it's not clear to me which one is
>> "the right one", both for architectures with native support
>> (hopefully the one brough up here won't be the only one) and for
>> internal consistency of the IR. If there was an established
>> architecture with this kind of feature where people have gathered
>> lots of practical experience with it, we could use that inform
>> the decision (just as we have for ordinary predication and
>> dynamic vector length). But I'm not aware of any architecture
>> that does this other than the one Jacob and lkcl are working on,
>> and as far as I know their project still in the early stages.
>
> The current understanding is that the dynamic vector length
> operates in the granularity of the mask:
> https://reviews.llvm.org/D57504#1381211
>
> I do understand that this is what Jacob proposes based on the
> architecture he works on. However, it is not yet clear to me whether
> that is the most useful option overall, nor that it is the only option
> that will lead to reasonable codegen for their architecture. But let's
> leave discussion of the details on Phab. I just want to highlight one
> issue that is not specific to Jacob's angle, as it relates to the
> interpretation of scalable vectors more generally:
>
> In unscaled IR types, this means VL masks each scalar result, in
> scaled types VL masks sub vectors. E.g. for %L == 1 the following
> call produces a pair of floats as the result:
>
> <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>
> As I wrote on Phab mere minutes before you sent this email, I do not
> think this is the right interpretation for any architecture I know
> about (I do not know anything about the things Jacob and Luke are
> working on) nor from the POV of the scalable vector types proposal. A
> scalable vector is not conventionally "a variable-length vector of
> fixed-size vectors", it it simply an ordinary "flat" vector whose
> length happens to be mostly unknown at compile time. If some
> intrinsics want to interpret it differently, that is fine, but that's
> a property of those specific intrinsics -- similar to how proposed
> matrix intrinsics might interpret a 16 element vector as a 4x4 matrix.
On NEC SX-Aurora the vector length is always interpreted in 64bit data
chunks. That is one example of a real architecture where the vscaled
interpretation of VL makes sense.
> I agree that we should only consider the tied sub-vector case for
> this first version and keep discussing the unconstrained version.
> It is seductively easy to allow this but impossible to take it back.
>
> ---
>
> The story is different when we talk only(!) about memory accesses
> and having different vector sizes in the operands and the
> transferred type (result type for loads, value operand type for
> stores):
>
> Eg on AVX, this call could turn into a 64bit gather operation of
> pairs of floats:
>
> <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1>
> mask %M, i32 vlen 8)
>
> Is that IR you'd expect someone to generate (or a backend to consume)
> for this operation? It seems like a rather unnatural or "magical" way
> to represent the intent (load 64b each from 8 pointers), at least with
> the way I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast.
>
> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora,
> which may be represented as:
>
> <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x
> double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>
> In contrast to the above I can't very well say one should write this
> as a gather of i1024, but it also seems like a rather specialized
> instruction (presumably used for blocked processing of matrices?) so I
> can't say that this on its own motivates me to complicate a proposed
> core IR construct.
It actually reduces complexity by shifting it from the address
computation into the instruction. This would cover all three cases:
VLD2D, <2 x float> gather on AVX and <W x float> loads for this early
RISC-V based architecture that Jacob and lkcl are working on. However,
this is not a top priority and we can leave it out of the first version.
>
> Cheers,
> Robin
>
- Simon
--
Simon Moll
Researcher / PhD Student
Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31
Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/0327a7b9/attachment.html>
More information about the llvm-dev
mailing list