[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Thu May 16 12:56:41 PDT 2019

I am fine with either option, really.

I think we do want to remove the result type restriction.

I think I'm fine with removing experimental but do we need more time
with the new versions to be sure?

                           -David

Sander De Smalen <Sander.DeSmalen at arm.com> writes:

> Hello again,
>
> I've been meaning to follow up on this thread for the last couple of weeks, my apologies for the delay.
>
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
>
> There seems to be consensus to keep the explicit start value to better
> accommodate chained reductions (as opposed to generating IR that
> performs the reduction of the first element using extract/fadd/insert
> pattern). An important use-case for these reductions is to work inside
> vectorized loops, where chaining happens through the reduction value's
> PHI node (i.e. the scalar reduction value from one iteration will be
> the input to the next iteration). This intrinsic would also naturally
> match reduction instructions of ARM SVE and NEC SX-aurora.
>
> For Option A (https://reviews.llvm.org/D60261), there is an argument
> that code creating or operating on these intrinsics can treat ordered
> and unordered reductions the same (in that they have the same
> arguments). Fast-math flags determine whether or not the intrinsic
> needs to be evaluated in strict order. Codegen for non-strict
> reductions should be able to fold away the identity-value.
>
> For Option B (https://reviews.llvm.org/D60262), David made the
> argument that making the reduction-order explicit (as opposed to
> deducing this from fast-math flags) would ensure the ordering is
> always as expected, even when FMF on the call sites are dropped for
> some reason. 
>
>
> Is it correct that I sensed a slight preference for Option A?
> i.e. Renaming the intrinsics and keeping the same signature, but
> dropping the special-cased behaviour for the identity-value with
> non-strict reductions. For David's argument, I think that although the
> extra expressiveness would be nice to have, LLVM normally depends on
> the FMF being propagated correctly to produce faster code so this
> should also be sufficient for reductions.
>
> If we go for Option A, I suggest we drop the 'experimental' prefix
> from experimental.vector.reduce.fadd/fmul to avoid having to add an
> awkward '.v2' suffix to the new intrinsic. When we implement all the
> suggestions from this proposal (possibly including the one mentioned
> below), I wouldn't really know what other features we could add other
> than predication (which would be covered by the LLVM-VP proposal and
> thus require another renaming), or possibly adding 'constrained'
> variants which I assume would have separate intrinsics. So we might as
> well drop the 'experimental' prefix.
>
> Finally, do we want to remove the restriction that the result type
> must always match the vector-element type? A wider result type would
> then allow the reduction to be performed in the wider type.
>
> Thanks,
>
> Sander
>
>> On 10 Apr 2019, at 18:56, Amara Emerson <aemerson at apple.com> wrote:
>> 
>> I’m fine with the direction this is going, but let’s keep renaming
>> to a minimum. They’ve been experimental long enough now that we
>> should be able to now jump to a final form after all the feedback.
>> 
>> Amara
>> 
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>> 
>>>> 
>>>> On 8 Apr 2019, at 11:37, Simon Moll <moll at cs.uni-saarland.de> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for
>>>>>>> llvm.experimental.vector.reduce.fadd and
>>>>>>> llvm.experimental.vector.reduce.fmul (see options A and B). I
>>>>>>> also propose renaming the 'accumulator' operand to 'start
>>>>>>> value' because for fmul this is the start value of the
>>>>>>> reduction, rather than a value to which the fmul reduction is
>>>>>>> accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions are
>>>> handled in IR (https://reviews.llvm.org/D57504). This could be an
>>>> opportunity to avoid the "v2" suffix issue: LLVM-VP moves the
>>>> intrinsic to the "llvm.vp.*" namespace and we can fix the
>>>> reduction semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this
>>> proposal separate from LLVM-VP as they serve different purposes and
>>> have different scope. But yes we can easily rename the intrinsics
>>> again when the VP proposal lands.
>>> 
>>>> 
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>>> 
>>>>>>> 
>>>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)>>>>>> 
>>>>>>>  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>> 
>>>>>>> This means that if the start value is 'undef', the result will
>>>>>>> be undef and all code creating such a reduction will need to
>>>>>>> ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0
>>>>>>> for fmul). When using 'fast' or ‘reassoc’ on the call it will
>>>>>>> be implemented using an unordered reduction, otherwise it will
>>>>>>> be implemented with an ordered reduction. Note that a new
>>>>>>> intrinsic is required to capture the new semantics. In this
>>>>>>> proposal the intrinsic is prefixed with a 'v2' for the time
>>>>>>> being, with the expectation this will be dropped when we remove
>>>>>>> 'experimental' from the reduction intrinsics in the future.
>>>>>>> 
>>>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).>>>>>> 
>>>>>>>  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>>> 
>>>>>>> This will mean that the behaviour is explicit from the
>>>>>>> intrinsic and the use of 'fast' or ‘reassoc’ on the call has no
>>>>>>> effect on how that intrinsic is lowered. The ordered reduction
>>>>>>> intrinsic will take a scalar start-value operand, where the
>>>>>>> unordered reduction intrinsic will only take a vector operand.
>>>>>>> 
>>>>>>> Both options auto-upgrade the IR to use the new (version of
>>>>>>> the) intrinsics. I'm personally slightly in favour of [Option
>>>>>>> B], because it better aligns with the definition of the
>>>>>>> SelectionDAG nodes and is more explicit in its semantics. We
>>>>>>> also avoid having to use an artificial 'v2' like prefix to
>>>>>>> denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can actually use
>>>>>> the start value? TBH I'd be tempted to suggest we just make the
>>>>>> initial extractelement/fadd/insertelement pattern a manual extra
>>>>>> stage and avoid having having that argument entirely.
>>>>>> 
>>>> NEC SX-Aurora has reduction instructions that take in a start
>>>> value in a scalar register. We are hoping to upstream the backend:
>>>> http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html>> Great, I think combined with the argument for chaining of ordered
>>> reductions (often inside vectorized loops) and two architectures
>>> (ARM SVE and SX-Aurora) taking a scalar start register, this is
>>> enough of an argument to keep the explicit operand for the ordered
>>> reductions.
>>> 
>>>>>> 
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>>> 
>>>>>>> 	• Adding SelectionDAG legalization for the _STRICT reduction
>>>>>>> SDNodes. After some great work from Nikita in D58015, unordered
>>>>>>> reductions are now legalized/expanded in SelectionDAG, so if we
>>>>>>> add expansion in SelectionDAG for strict reductions this would
>>>>>>> make the ExpandReductionsPass redundant.
>>>>>>> 	• Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>> 
>>>>>>> 	• I think we'll also want to be able to overload the result
>>>>>>> operand based on the vector element type for the intrinsics
>>>>>>> having the constraint that the result type must match the
>>>>>>> vector element type. e.g. dropping the redundant 'i32' in:
>>>>>>>  i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32>
>>>>>>> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32>
>>>>>>> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would have the added
>>>>>>> benefit that LLVM would automatically check for the operands to
>>>>>>> match.
>>>>>>> 
>>>>>> Won't this cause issues with overflow? Isn't the point of an add
>>>>>> (or mul....) reduction of say, <64 x i8> giving a larger (i32 or
>>>>>> i64) result so we don't lose anything? I agree for bitop
>>>>>> reductions it doesn't make sense though.
>>>>>> 
>>>>> Sorry - I forgot to add: which asks the question - should we be
>>>>> considering signed/unsigned add/mul and possibly saturation
>>>>> reductions?
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> 
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>>> -- 
>>>> 
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>> 
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>> 
>>>> Tel. +49 (0)681 302-57521 : 
>>>> moll at cs.uni-saarland.de
>>>> 
>>>> Fax. +49 (0)681 302-3065  : 
>>>> http://compilers.cs.uni-saarland.de/people/moll>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>