[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
Simon Moll via llvm-dev
llvm-dev at lists.llvm.org
Mon Apr 8 03:37:41 PDT 2019
Hi,
On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>> Proposed change:
>>>
>>> ----------------------------
>>>
>>> In this RFC I propose changing the intrinsics for
>>> llvm.experimental.vector.reduce.fadd and
>>> llvm.experimental.vector.reduce.fmul (see options A and B). I also
>>> propose renaming the 'accumulator' operand to 'start value' because
>>> for fmul this is the start value of the reduction, rather than a
>>> value to which the fmul reduction is accumulated into.
>>>
Note that the LLVM-VP proposal also changes the way reductions are
handled in IR (https://reviews.llvm.org/D57504). This could be an
opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic
to the "llvm.vp.*" namespace and we can fix the reduction semantics in
the progress.
Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>> [Option A] Always using the start value operand in the reduction
>>> (https://reviews.llvm.org/D60261)
>>>
>>> declare float
>>> @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float
>>> %start_value, <4 x float> %vec)
>>>
>>> This means that if the start value is 'undef', the result will be
>>> undef and all code creating such a reduction will need to ensure it
>>> has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When
>>> using 'fast' or ‘reassoc’ on the call it will be implemented using
>>> an unordered reduction, otherwise it will be implemented with an
>>> ordered reduction. Note that a new intrinsic is required to capture
>>> the new semantics. In this proposal the intrinsic is prefixed with a
>>> 'v2' for the time being, with the expectation this will be dropped
>>> when we remove 'experimental' from the reduction intrinsics in the
>>> future.
>>>
>>> [Option B] Having separate ordered and unordered intrinsics
>>> (https://reviews.llvm.org/D60262).
>>>
>>> declare float
>>> @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float
>>> %start_value, <4 x float> %vec)
>>>
>>> declare float
>>> @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x
>>> float> %vec)
>>>
>>> This will mean that the behaviour is explicit from the intrinsic and
>>> the use of 'fast' or ‘reassoc’ on the call has no effect on how that
>>> intrinsic is lowered. The ordered reduction intrinsic will take a
>>> scalar start-value operand, where the unordered reduction intrinsic
>>> will only take a vector operand.
>>>
>>> Both options auto-upgrade the IR to use the new (version of the)
>>> intrinsics. I'm personally slightly in favour of [Option B], because
>>> it better aligns with the definition of the SelectionDAG nodes and
>>> is more explicit in its semantics. We also avoid having to use an
>>> artificial 'v2' like prefix to denote the new behaviour of the
>>> intrinsic.
>>>
>> Do we have any targets with instructions that can actually use the
>> start value? TBH I'd be tempted to suggest we just make the initial
>> extractelement/fadd/insertelement pattern a manual extra stage and
>> avoid having having that argument entirely.
>>
NEC SX-Aurora has reduction instructions that take in a start value in a
scalar register. We are hoping to upstream the backend:
http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>>
>>> Further efforts:
>>>
>>> ----------------------------
>>>
>>> Here a non-exhaustive list of items I think work towards making the
>>> intrinsics non-experimental:
>>>
>>> * Adding SelectionDAG legalization for the _STRICT reduction
>>> SDNodes. After some great work from Nikita in D58015, unordered
>>> reductions are now legalized/expanded in SelectionDAG, so if we
>>> add expansion in SelectionDAG for strict reductions this would
>>> make the ExpandReductionsPass redundant.
>>> * Better enforcing the constraints of the intrinsics (see
>>> https://reviews.llvm.org/D60260 ).
>>> * I think we'll also want to be able to overload the result
>>> operand based on the vector element type for the intrinsics
>>> having the constraint that the result type must match the vector
>>> element type. e.g. dropping the redundant 'i32' in:
>>> i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a)
>>> => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>
>>> since i32 is implied by <4 x i32>. This would have the added benefit
>>> that LLVM would automatically check for the operands to match.
>>>
>> Won't this cause issues with overflow? Isn't the point of an add (or
>> mul....) reduction of say, <64 x i8> giving a larger (i32 or i64)
>> result so we don't lose anything? I agree for bitop reductions it
>> doesn't make sense though.
>>
> Sorry - I forgot to add: which asks the question - should we be
> considering signed/unsigned add/mul and possibly saturation reductions?
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--
Simon Moll
Researcher / PhD Student
Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31
Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065 : http://compilers.cs.uni-saarland.de/people/moll
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190408/6673e7b2/attachment-0001.html>
More information about the llvm-dev
mailing list