[llvm] [AArch64][SVE] Fold zero-extend into add reduction. (PR #102325)

Tue Aug 20 07:27:10 PDT 2024

sdesmalen-arm wrote:

> Thanks for this. All the added test look good to me, but could this turn `i64 vecreduce(sext(v4i32))` into sve `saddv d0, p0, z0.s`, as opposed to neon `saddlv d0, v0.4s`? Can we get it to not do that one?
> 
> ```
> define i64 @add_v4i32_v4i64_sext(<4 x i32> %x) {
> ; CHECK-LABEL: add_v4i32_v4i64_sext:
> ; CHECK:       // %bb.0: // %entry
> ; CHECK-NEXT:    saddlv d0, v0.4s
> ; CHECK-NEXT:    fmov x0, d0
> ; CHECK-NEXT:    ret
> entry:
>   %xx = sext <4 x i32> %x to <4 x i64>
>   %z = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %xx)
>   ret i64 %z
> }
> ```
> 
> It might be worth trying the examples from `bin/llc -mtriple aarch64-none-eabi -mattr=+sve2 ../llvm/test/CodeGen/AArch64/vecreduce-add.ll -o -` compared to how they were before this patch. There are quite a few examples in that test file now.

Thanks, that's a good point. It's actually quite tricky to come up with a condition where we always generate the best code, because for NEON the zero-extend can be folded into more instructions than if we'd fold it into the vecreduce_add (note that this is not specific to NEON, but rather that we've not put in similar work for SVE to use the top/bottom instructions). There's an example in that vecreduce-add.ll file where it adds several 'sum of absolute difference's together and kind of by fluke from type-legalisation, LLVM is able to fold part of the zero-extend into the `uabd` and part of it into the add, resulting in slightly better code for NEON.

This PR now only improves the case where we always do better (i.e. scalable, streaming or >256bit). I may create a follow-up PR to improve some 128-bit vector cases too where folding the extend into the vecreduce_add will result in better code than for NEON.

https://github.com/llvm/llvm-project/pull/102325