[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Simon Pilgrim via llvm-dev llvm-dev at lists.llvm.org
Fri Apr 5 01:47:43 PDT 2019


On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
> On 04/04/2019 14:11, Sander De Smalen wrote:
>> Proposed change:
>>
>> ----------------------------
>>
>> In this RFC I propose changing the intrinsics for 
>> llvm.experimental.vector.reduce.fadd and 
>> llvm.experimental.vector.reduce.fmul (see options A and B). I also 
>> propose renaming the 'accumulator' operand to 'start value' because 
>> for fmul this is the start value of the reduction, rather than a 
>> value to which the fmul reduction is accumulated into.
>>
>> [Option A] Always using the start value operand in the reduction 
>> (https://reviews.llvm.org/D60261)
>>
>>   declare float 
>> @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 
>> %start_value, <4 x float> %vec)
>>
>> This means that if the start value is 'undef', the result will be 
>> undef and all code creating such a reduction will need to ensure it 
>> has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When 
>> using 'fast' or ‘reassoc’ on the call it will be implemented using an 
>> unordered reduction, otherwise it will be implemented with an ordered 
>> reduction. Note that a new intrinsic is required to capture the new 
>> semantics. In this proposal the intrinsic is prefixed with a 'v2' for 
>> the time being, with the expectation this will be dropped when we 
>> remove 'experimental' from the reduction intrinsics in the future.
>>
>> [Option B] Having separate ordered and unordered intrinsics 
>> (https://reviews.llvm.org/D60262).
>>
>>   declare float 
>> @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float 
>> %start_value, <4 x float> %vec)
>>
>>   declare float 
>> @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> 
>> %vec)
>>
>> This will mean that the behaviour is explicit from the intrinsic and 
>> the use of 'fast' or ‘reassoc’ on the call has no effect on how that 
>> intrinsic is lowered. The ordered reduction intrinsic will take a 
>> scalar start-value operand, where the unordered reduction intrinsic 
>> will only take a vector operand.
>>
>> Both options auto-upgrade the IR to use the new (version of the) 
>> intrinsics. I'm personally slightly in favour of [Option B], because 
>> it better aligns with the definition of the SelectionDAG nodes and is 
>> more explicit in its semantics. We also avoid having to use an 
>> artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>
> Do we have any targets with instructions that can actually use the 
> start value? TBH I'd be tempted to suggest we just make the initial 
> extractelement/fadd/insertelement pattern a manual extra stage and 
> avoid having having that argument entirely.
>
>> Further efforts:
>>
>> ----------------------------
>>
>> Here a non-exhaustive list of items I think work towards making the 
>> intrinsics non-experimental:

>>
>>   * Adding SelectionDAG legalization for the  _STRICT reduction
>>     SDNodes. After some great work from Nikita in D58015, unordered
>>     reductions are now legalized/expanded in SelectionDAG, so if we
>>     add expansion in SelectionDAG for strict reductions this would
>>     make the ExpandReductionsPass redundant.
>>   * Better enforcing the constraints of the intrinsics (see
>>     https://reviews.llvm.org/D60260 ).

>>   * I think we'll also want to be able to overload the result operand
>>     based on the vector element type for the intrinsics having the
>>     constraint that the result type must match the vector element
>>     type. e.g. dropping the redundant 'i32' in:
>>     i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a)
>>     => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>
>> since i32 is implied by <4 x i32>. This would have the added benefit 
>> that LLVM would automatically check for the operands to match.

>>
> Won't this cause issues with overflow? Isn't the point  of an add (or 
> mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) 
> result so we don't lose anything? I agree for bitop reductions it 
> doesn't make sense though.
>
Sorry - I forgot to add: which asks the question - should we be 
considering signed/unsigned add/mul and possibly saturation reductions?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190405/cb9e00ac/attachment.html>


More information about the llvm-dev mailing list