<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Oct 8, 2021, at 17:06, Steve (Numerics) Canon <<a href="mailto:scanon@apple.com" class="">scanon@apple.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=utf-8" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class=""><blockquote type="cite" class=""><div class="">On Oct 1, 2021, at 8:55 AM, Florian Hahn <<a href="mailto:florian_hahn@apple.com" class="">florian_hahn@apple.com</a>> wrote:</div></blockquote><blockquote type="cite" class=""><br class=""></blockquote><blockquote type="cite" class=""><div class=""><blockquote type="cite" class="" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"><div class=""><div class="gmail_quote" style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><div class="">FWIW, the X86 backend barely uses the pairwise reduction instructions like haddps. They have a suboptimal implementation on most CPUs that makes them not good for reducing over a single register.</div></div></div></blockquote><div style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br class=""></div><div style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class="">Yes, thanks for raising that. I realized the original proposal was a bit ambiguous on which pairs exactly are being added. Originally the intention was to use even/odd pairs because this is what multiple architectures provide instructions for. Unfortunately the performance of those instructions varies across hardware implementations, especially on X86 as you mentioned.</div><div style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><br class=""></div><div style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class="">It seems to me like whatever order we choose, we will leave performance on the table on some architectures/hardware implementations. The goal is to allow users to write high-performance code for platforms they care about, so I think it would be best to allow users to pick the evolution order.  That way, they can make an informed decision and get the best performance on the targets they care about. One way to do that could be to actually expose 3 different versions of the fadd reduction builtin: 1) __builtin_reduce_unordered_fadd, 2) __builtin_reduce_adjecent_pairs_fadd and 3) __builtin_reduce_low_high_pairs_fadd. Alternatively we could add an additional parameter, but that seems less explicit than encoding it in the name.</div></div></blockquote></div><br class=""><div class="">Some notes on the subject of SIMD reductions:</div><div class=""><br class=""></div><div class="">1. It’s desirable to have a third option other than the existing ordered and unordered. Ordered is too slow to be desirable on most implementations, and unordered blocks portability. Either obvious binary tree reduction (even/odd or high/low) is strictly more desirable for explicit SIMD code (which is what these builtins are meant to support). Even when one option is “worse” for a given platform (e.g. HADD on x86) it’s still vastly better than ordered while still allowing us to get the same result everywhere if we need to.</div><div class=""><br class=""></div><div class="">2. even/odd (pairwise reduction) has the slight virtue that it probably maps more nicely to an unknown variable-vector length ISA. This is quite minor, though, because we’re talking about explicit SIMD code.</div><div class=""><br class=""></div><div class="">If it’s worth having a C builtin reduce for explicit SIMD types, I think it’s worth having it use a defined reduction tree. Even/odd is probably the more forward-thinking option, but is somewhat suboptimal on generic x86. Hi/lo is somewhat suboptimal on generic arm64, but not as much so as even/odd on x86.</div><div class=""><br class=""></div><div class="">Doing both at the C level seems slightly silly to me; I would rather just lower to undefined evaluation order when reassoc is set, I think, for people who want “whatever is fastest”.</div><div class=""><br class=""></div><div class=""><br class=""></div></div></div></blockquote><br class=""></div><div><br class=""></div><div>Thanks Steve! </div><div><br class=""></div><div>I tried to write up the proposed builtins in a concise way and put up a patch which hopefully makes it easier to track & incorporate any further updates: <a href="https://reviews.llvm.org/D111529" class="">https://reviews.llvm.org/D111529</a> </div><div><br class=""></div><span class="">For now I went with even/odd pairing for reductions. It should be straight-forward for users to allow reassociation at builtin call sites using ` #pragma clang fp reassociate(on)`, which hopefully provides enough flexibility to keep the set of reduction builtins compact.<br class=""></span><span class=""><br class=""></span><div>So far it sounds like there are no concerns/objections to adding a new set of vector builtins in general, but I am going to add a few additional people explicitly to the thread & patch for extra visibility.</div><div><br class=""></div><div>Cheers,</div><div>Florian</div></body></html>