[cfe-dev] RFC: Add New Set of Vector Math Builtins

Mon Oct 11 03:20:22 PDT 2021

> On Oct 8, 2021, at 17:06, Steve (Numerics) Canon <scanon at apple.com> wrote:
> 
>> On Oct 1, 2021, at 8:55 AM, Florian Hahn <florian_hahn at apple.com <mailto:florian_hahn at apple.com>> wrote:
>> 
>>> FWIW, the X86 backend barely uses the pairwise reduction instructions like haddps. They have a suboptimal implementation on most CPUs that makes them not good for reducing over a single register.
>> 
>> Yes, thanks for raising that. I realized the original proposal was a bit ambiguous on which pairs exactly are being added. Originally the intention was to use even/odd pairs because this is what multiple architectures provide instructions for. Unfortunately the performance of those instructions varies across hardware implementations, especially on X86 as you mentioned.
>> 
>> It seems to me like whatever order we choose, we will leave performance on the table on some architectures/hardware implementations. The goal is to allow users to write high-performance code for platforms they care about, so I think it would be best to allow users to pick the evolution order.  That way, they can make an informed decision and get the best performance on the targets they care about. One way to do that could be to actually expose 3 different versions of the fadd reduction builtin: 1) __builtin_reduce_unordered_fadd, 2) __builtin_reduce_adjecent_pairs_fadd and 3) __builtin_reduce_low_high_pairs_fadd. Alternatively we could add an additional parameter, but that seems less explicit than encoding it in the name.
> 
> Some notes on the subject of SIMD reductions:
> 
> 1. It’s desirable to have a third option other than the existing ordered and unordered. Ordered is too slow to be desirable on most implementations, and unordered blocks portability. Either obvious binary tree reduction (even/odd or high/low) is strictly more desirable for explicit SIMD code (which is what these builtins are meant to support). Even when one option is “worse” for a given platform (e.g. HADD on x86) it’s still vastly better than ordered while still allowing us to get the same result everywhere if we need to.
> 
> 2. even/odd (pairwise reduction) has the slight virtue that it probably maps more nicely to an unknown variable-vector length ISA. This is quite minor, though, because we’re talking about explicit SIMD code.
> 
> If it’s worth having a C builtin reduce for explicit SIMD types, I think it’s worth having it use a defined reduction tree. Even/odd is probably the more forward-thinking option, but is somewhat suboptimal on generic x86. Hi/lo is somewhat suboptimal on generic arm64, but not as much so as even/odd on x86.
> 
> Doing both at the C level seems slightly silly to me; I would rather just lower to undefined evaluation order when reassoc is set, I think, for people who want “whatever is fastest”.
> 
> 

Thanks Steve! 

I tried to write up the proposed builtins in a concise way and put up a patch which hopefully makes it easier to track & incorporate any further updates: https://reviews.llvm.org/D111529 <https://reviews.llvm.org/D111529> 

For now I went with even/odd pairing for reductions. It should be straight-forward for users to allow reassociation at builtin call sites using ` #pragma clang fp reassociate(on)`, which hopefully provides enough flexibility to keep the set of reduction builtins compact.

So far it sounds like there are no concerns/objections to adding a new set of vector builtins in general, but I am going to add a few additional people explicitly to the thread & patch for extra visibility.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211011/f28db01e/attachment.html>