[cfe-dev] RFC: Add New Set of Vector Math Builtins

Florian Hahn via cfe-dev cfe-dev at lists.llvm.org
Fri Oct 1 05:55:17 PDT 2021



> On Sep 29, 2021, at 04:48, Craig Topper <craig.topper at gmail.com> wrote:
> 
> 
> 
> On Tue, Sep 28, 2021 at 2:10 AM Florian Hahn <florian_hahn at apple.com <mailto:florian_hahn at apple.com>> wrote:
> Hi Craig,
> 
> > On Sep 27, 2021, at 23:54, Craig Topper <craig.topper at gmail.com <mailto:craig.topper at gmail.com>> wrote:
> > 
> > Hi Florian,
> > 
> > I have a few questions about thereduction builtins.
> > 
> 
> Thanks for taking a look!
> 
> > llvm.reduce.fadd is currently defined as ordered unless the reassociate fast math flag is present. Are you proposing to change that to make it pairwise? 
> > 
> 
> That’s a good point and I forgot to explicitly call this out! The reduction builtin unfortunately cannot express pairwise reductions and the reassoicate flag would be too permissive. An initial lowering in Clang could just generate the pairwise reduction tree directly, but down the road I anticipate improving the reduction builtin to allow expressing pairwise reductions. This would probably be helpful for parts of the middle-end too which at the moment manually emit pairwise reduction trees (e.g. in the epilogue of vector loops with reductions).
> 
> I didn't think the vectorizers used pairwise reductions. The cost modelling flag for it was removed in https://reviews.llvm.org/D105484 <https://reviews.llvm.org/D105484>
>  

I was referring to the code in fixReduction (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L4427 <https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L4427>) 

It now emits a call to the reduction intrinsics, but the ExpandReductions pass should lower this to a reduction tree which matches the specification.

> FWIW, the X86 backend barely uses the pairwise reduction instructions like haddps. They have a suboptimal implementation on most CPUs that makes them not good for reducing over a single register.

Yes, thanks for raising that. I realized the original proposal was a bit ambiguous on which pairs exactly are being added. Originally the intention was to use even/odd pairs because this is what multiple architectures provide instructions for. Unfortunately the performance of those instructions varies across hardware implementations, especially on X86 as you mentioned.

It seems to me like whatever order we choose, we will leave performance on the table on some architectures/hardware implementations. The goal is to allow users to write high-performance code for platforms they care about, so I think it would be best to allow users to pick the evolution order.  That way, they can make an informed decision and get the best performance on the targets they care about. One way to do that could be to actually expose 3 different versions of the fadd reduction builtin: 1) __builtin_reduce_unordered_fadd, 2) __builtin_reduce_adjecent_pairs_fadd and 3) __builtin_reduce_low_high_pairs_fadd. Alternatively we could add an additional parameter, but that seems less explicit than encoding it in the name.

What do you think? Do you have an alternative suggestion in mind?

For the other builtins, the evaluation order should not really matter.

> 
> 
> 
> > llvm.reduce.fmin/fmax change behavior based on the nonans fast math flag. And I think they always imply no signed zeros regardless of whether the fast math flag is present. The vectorizers check the fast math flags before creating the intrinsics today. What are the semantics of the proposed builtin?
> 
> 
> I tried to specify NaN handling the the `Special Values` section. At the moment it says "If exactly one argument is a NaN, return the other argument. If both arguments are NaNs, return a NaN”. This should match both the NaN handling of llvm.minnum and libm’s fmin(f). Note that in the original email, the Special Values section still includes a mention to fmax. That reference should be removed.
> 
> The current proposal does not specifically talk about signed zeros, but I am not sure we have to. The proposal defines min/max as returning the smaller/larger value. Both -0 and +0 are equal, so either can be returned. I think this again matches libm’s fmin(f)’s and llvm.minnum’s behavior although llvm.minnum’ definition calls this out explicitly by stating explicitly what happens when called with equal arguments. Should the proposed definitions also spell that out?
> 
> I just noticed that the ExpandReductions pass uses fcmp+select for expanding llvm.reduce.fmin/fmax with nonans. But SelectionDAG expands it using ISD::FMAXNUM and ISD::FMINNUM. I only looked at ExpandReductions and saw the nonans check there, but didn't realize it was using fcmp+select.
>  

It looks like most backends do not directly lower the reduction intrinsics and still rely on the ExpandReduction pass. The AArch64 backend supports lowering the intrinsics directly and produces the expected results. For backends to get the most out of the proposed builtins I think they have to make sure they lower reduction intrinsics as best as they can.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211001/f6c8eba5/attachment.html>


More information about the cfe-dev mailing list