[llvm-dev] [RFC] Vector Predication

Mon Feb 4 14:26:34 PST 2019

On Mon, Feb 4, 2019, 14:23 Jacob Lifshay <programmerjake at gmail.com wrote:

> The architecture Luke and I are working on, assuming it goes the way I
> think it will, will have instructions like:
> fmadd.s.vvss rd, rs1, rs2, rs3, len=N*VL, pred=rp
> where N is from 1 to 4
> which has the following pseudo-code:
> constexpr auto f32_per_reg = 2;
> union FReg
> {
>     double f64[1];
>     float f32[f32_per_reg];
>     _Float16 f16[4];
> };
> union Reg
> {
>     uint64_t i64[1];
>     uint32_t i32[2];
>     uint16_t i16[4];
>     uint8_t i8[8];
> };
> // registers
> FReg fregs[128];
> Reg regs[128];
> uint64_t vl;
> // instruction fields
> int rd, rs1, rs2, rs3, rp, N;
> // main code
> for(uint64_t i = 0; i < vl * N; i++)
> {
>     if(regs[rp].i64[0] & (1ULL << i / N))
>     {
>         auto rv = i / f32_per_reg;
>         auto sv = i % f32_per_reg;
>         auto rs = (i % N) / f32_per_reg;
>         auto ss = (i % N) / f32_per_reg;
>
should have been:
auto ss = (i % N) % f32_per_reg;

>         // *+ is contracted into fma
>         fregs[rd + rv].f32[sv] = fregs[rs1 + rv].f32[sv] * fregs[rs2 +
> rs].f32[ss] + fregs[rs3 + rs].f32[ss];
>     }
> }
>
> So it would be handy for the vector length on evl intrinsics to be in
> units of the mask length so we don't have to pattern match a division in
> the backend. We could have 2 variants of the vector length argument, one in
> terms of the data vector and one in terms of the mask vector. we could
> legalize the mask vector variant for those architectures that need it by
> pulling the multiplication out and switching to the data vector variants.
>
> Jacob
>
> On Mon, Feb 4, 2019, 13:41 Robin Kruppe via llvm-dev <
> llvm-dev at lists.llvm.org wrote:
>
>>
>>
>> On Mon, 4 Feb 2019 at 22:04, Simon Moll <moll at cs.uni-saarland.de> wrote:
>>
>>> On 2/4/19 9:18 PM, Robin Kruppe wrote:
>>>
>>>
>>>
>>> On Mon, 4 Feb 2019 at 18:15, David Greene via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Simon Moll <moll at cs.uni-saarland.de> writes:
>>>>
>>>> > You are referring to the sub-vector sizes, if i am understanding
>>>> > correctly. I'd assume that the mask sub-vector length always has to be
>>>> > either 1 or the same as the data sub-vector length. For example, this
>>>> > is ok:
>>>> >
>>>> > %result = call <scalable 3 x float> @llvm.evl.fsub.v4f32(<scalable 3 x
>>>> > float> %x, <scalable 3 x float> %y, <scalable 1 x i1> %M, i32 %L)
>>>>
>>>> What does <scalable 1 x i1> applied to <scalable 3 x float> mean?  I
>>>> would expect a requirement of <scalable 3 x i1>.  At least that's how I
>>>> understood the SVE proposal [1].  The n's in <scalable n x type> have to
>>>> match.
>>>>
>>>
>>> I believe the idea is to allow each single mask bit to control multiple
>>> consecutive lanes at once, effectively interpreting the vector being
>>> operated on as "many short fixed-length vectors, concatenated" rather than
>>> a single long vector of scalars. This is a different interpretation of that
>>> type than usual, but it's not crazy, e.g. a similar reinterpretation of
>>> vector types seems to be the favored approach for adding matrix operations
>>> to LLVM IR. It somewhat obscures the point to discuss this only for
>>> scalable vectors, there's no conceptual reason why one couldn't do the same
>>> with fixed size vectors.
>>>
>>> In fact, I would recommend against making almost any new feature or
>>> intrinsic exclusive to scalable vectors, including this one: there
>>> shouldn't be much extra code required to allow and support it, and not
>>> doing so makes the IR less orthogonal. For example, if a <scalable 4 x
>>> float> fadd with a <scalable 1 x i1> mask works, then <4 x float> fadd with
>>> a <1 x i1> mask, a <8 x float> fadd with a <2 x i1> mask, etc. should also
>>> be possible overloads of the same intrinsic.
>>>
>>> Yep. Doing the same for standard vector IR is on the radar:
>>> https://reviews.llvm.org/D57504#1380587.
>>>
>>>
>>> So far, so good. A bit odd, when I think about it, but if hardware out
>>> there has that capability, maybe this is a good way to encode it in IR
>>> (other options might work too, though). The crux, however, is the
>>> interaction with the dynamic vector length: is it in terms of the mask? the
>>> longer data vector? if the latter, what happens if it isn't divisible by
>>> the mask length? There are multiple options and it's not clear to me which
>>> one is "the right one", both for architectures with native support
>>> (hopefully the one brough up here won't be the only one) and for internal
>>> consistency of the IR. If there was an established architecture with this
>>> kind of feature where people have gathered lots of practical experience
>>> with it, we could use that inform the decision (just as we have for
>>> ordinary predication and dynamic vector length). But I'm not aware of any
>>> architecture that does this other than the one Jacob and lkcl are working
>>> on, and as far as I know their project still in the early stages.
>>>
>>> The current understanding is that the dynamic vector length operates in
>>> the granularity of the mask: https://reviews.llvm.org/D57504#1381211
>>>
>> I do understand that this is what Jacob proposes based on the
>> architecture he works on. However, it is not yet clear to me whether that
>> is the most useful option overall, nor that it is the only option that will
>> lead to reasonable codegen for their architecture. But let's leave
>> discussion of the details on Phab. I just want to highlight one issue that
>> is not specific to Jacob's angle, as it relates to the interpretation of
>> scalable vectors more generally:
>>
>>> In unscaled IR types, this means VL masks each scalar result, in scaled
>>> types VL masks sub vectors. E.g. for %L == 1 the following call produces a
>>> pair of floats as the result:
>>>
>>>    <scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)
>>>
>>> As I wrote on Phab mere minutes before you sent this email, I do not
>> think this is the right interpretation for any architecture I know about (I
>> do not know anything about the things Jacob and Luke are working on) nor
>> from the POV of the scalable vector types proposal. A scalable vector is
>> not conventionally "a variable-length vector of fixed-size vectors", it it
>> simply an ordinary "flat" vector whose length happens to be mostly unknown
>> at compile time. If some intrinsics want to interpret it differently, that
>> is fine, but that's a property of those specific intrinsics -- similar to
>> how proposed matrix intrinsics might interpret a 16 element vector as a 4x4
>> matrix.
>>
>>> I agree that we should only consider the tied sub-vector case for this
>>> first version and keep discussing the unconstrained version. It is
>>> seductively easy to allow this but impossible to take it back.
>>>
>>> ---
>>>
>>> The story is different when we talk only(!) about memory accesses and
>>> having different vector sizes in the operands and the transferred type
>>> (result type for loads, value operand type for stores):
>>>
>>> Eg on AVX, this call could turn into a 64bit gather operation of pairs
>>> of floats:
>>>
>>>     <16 x float> llvm.evl.gather.v16f32(<8 x float*> %Ptr, <8 x i1> mask %M, i32 vlen 8)
>>>
>>> Is that IR you'd expect someone to generate (or a backend to consume)
>> for this operation? It seems like a rather unnatural or "magical" way to
>> represent the intent (load 64b each from 8 pointers), at least with the way
>> I'm thinking about it. I'd expect a gather of 8xi64 and a bitcast.
>>
>>> And there is a native 16 x 16 element load (VLD2D) on SX-Aurora, which
>>> may be represented as:
>>>
>>>     <scalable 256 x double> llvm.evl.gather.nxv16f64(<scalable 16 x double*> %Ptr, <scalable 16 x i1> mask %M, i32 vlen 16)
>>>
>>> In contrast to the above I can't very well say one should write this as
>> a gather of i1024, but it also seems like a rather specialized instruction
>> (presumably used for blocked processing of matrices?) so I can't say that
>> this on its own motivates me to complicate a proposed core IR construct.
>>
>> Cheers,
>> Robin
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190204/b67100d5/attachment.html>