[PATCH] D117829: [Clang] Add integer add/mul reduction builtins

Fri Jan 21 03:44:19 PST 2022

fhahn added a comment.

In D117829#3260772 <https://reviews.llvm.org/D117829#3260772>, @RKSimon wrote:

> I'm happy to continue with this just for integers or wait until we have a plan for floats as well. I guess we need to decide if we want to support the starting value in the fadd/fmul intrinsics from the builtin or not? If we don't then adding float support to the add/mul reduction builtins now (or later on) would just involve using default starting values, if we do then we probably need a separate fadd/fmul builtin.

I'm not sure if there's a major benefit of specifying the start value? Unless there is, I think we should not add a start value argument.

> Or we could add starting values to the add/mul reduction builtins as well and we manually insert a scalar post-reduction add/mul instruction in cgbuiltin?
>
> wrt float orders, currently the avx512f reductions attach fmf attributes when the builtin is translated to the intrinsic: https://github.com/llvm/llvm-project/blob/d2012d965d60c3258b3a69d024491698f8aec386/clang/lib/CodeGen/CGBuiltin.cpp#L14070

I am not familiar how exactly `ia32_reduce_fadd_pd512` & co are specified, but it looks like adding `reassoicate` to the intrinsic call there might be incorrect technically, i.e. calming the order does not matter, while I assume the C builtin guarantees a particular order? It might not result in an actual mis-compile on X86, because the X86 backend kind of guarantees that reductions with `reassoicate` are lowered exactly as the C builtin requires (relying on some kind of secret handshake between frontend & backend).

As discussed in  D117480 <https://reviews.llvm.org/D117480> this seems like a major weakness in how `llvm.vector.reduce.fadd` is specified. As long as it is only used for target specific builtins, things might work out fine in most cases, but different targets might lower `reassoicate` reductions differently. Also, things might fall apart if the middle end uses the `reassoicate` property during a transform that changes the reduction order.

> We might be able to get away with just expecting users to handle this with pragmas in code?

If the user allows reassoication via a pragma/flag, we can use the reduction builtin at the moment without problems. The motivation behind specifying a well defined order that can be lowered efficiently by targets is to guarantee portable results by default. This is important for uses cases where the builtins are used in code that's executed on different platforms.

> I was planning to see if that would work for the avx512f fmin/fmax reduction intrinsics so I can use `__builtin_reduce_min/max`.

There shouldn't be any problem with fmin/fmax, as the result should be independent of the evaluation order IIUC.

> I'm mainly interested in `__builtin_reduce_mul` as avx512f requires it - the (few) cases I've see it used have always involved char/short pixel data, extended to int/long before the mul reduction to address the overflow issues.

Sounds reasonable!  As mentioned earlier, it would be good to do that as separate patch, also updating LanguageExtensions.rst.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117829/new/

https://reviews.llvm.org/D117829