[PATCH] D116736: [Clang] Add __builtin_reduce_or and __builtin_reduce_and

Thu Jan 13 07:28:58 PST 2022

fhahn added a comment.

@junaire did you already get commit access or should I commit this change on your behalf?

In D116736#3239530 <https://reviews.llvm.org/D116736#3239530>, @junaire wrote:

> In D116736#3230040 <https://reviews.llvm.org/D116736#3230040>, @fhahn wrote:
>
>> LGTM, thanks!
>>
>>> The last __builtin_reduce_add will be separated into another one.
>>
>> Are you planning on putting up a patch for this one as well? What makes add a bit different is that `‘llvm.vector.reduce.fadd.*’` can only perform reductions either in the original order or in an unspecified order. For the extension, we need a particular evaluation order (reduction tree adding adjacent element pairs). Technically this order is required for all reduction builtins, but for integers the order doesn't matter, same for min/max.
>
> Sorry about the late response. Yeah, I'm trying to work on this builtin too, but actually, I don't know if I can do this as all my previous work is like kind of boilerplate or something? I have read the whole discussion in the mailing list and the related LLVM IR reference, but I still get confused a little bit.
>
> So the difference of this builtin is because LLVM intrinsic declare it like:
>
>   declare float @llvm.vector.reduce.fadd.v4f32(float %start_value, <4 x float> %a)
>   declare double @llvm.vector.reduce.fadd.v2f64(double %start_value, <2 x double> %a)
>
> And it performs sequential reduction which is not what we want right? We need it to reduce like:
>
>   [e3, e2, e1, e0] => (e3, e2) + (e1, e0)
>
> Does it mean we should do something like a for loop? or like recursive calls? or like changing the order of the elements in the vector?

One way to go about it would be to extend the  `@llvm.vector.reduce.fadd` to take another integer or boolean argument indicating the order to apply.

Targets that support such horizontal add instructions, like AArch64, can then lower the intrinsic call directly to the right instructions. Otherwise we can generate the right instruction sequence for the reduction. We know the number of vector elements, so there should be no for a loop or recursion, we can just generate instructions for the full tree (extra the lanes using shuffle vector & add them).

> And another thing that confuses me is that `pad identity elements after the last element to widen the vector out to a power 2 `. According to the IR reference, is the neutral value just zero?
>
> The last confusing point is `%start_value`, we just simply consider it is `0`, isn't it?

For fadd reductions it should be -0.0 I think.

> I would appreciate it if you can give me any hints, which I think is very helpful to my LLVM learning :-)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116736/new/

https://reviews.llvm.org/D116736