[llvm] [AArch64] Improve code generation of bool vector reduce operations (PR #115713)

Fri Nov 29 10:27:03 PST 2024

================
@@ -20,11 +20,11 @@ define i1 @test_redand_v1i1(<1 x i1> %a) {
 define i1 @test_redand_v2i1(<2 x i1> %a) {
 ; CHECK-LABEL: test_redand_v2i1:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mvn v0.8b, v0.8b
 ; CHECK-NEXT:    shl v0.2s, v0.2s, #31
 ; CHECK-NEXT:    cmlt v0.2s, v0.2s, #0
-; CHECK-NEXT:    uminp v0.2s, v0.2s, v0.2s
-; CHECK-NEXT:    fmov w8, s0
-; CHECK-NEXT:    and w0, w8, #0x1
+; CHECK-NEXT:    fcmp d0, #0.0
----------------
Il-Capitano wrote:

For performance differences I originally tested a simple loop similar to this:
```cpp
void test(bool *dest, float32x4_t *p, std::size_t n) {
    for (std::size_t i = 0; i < n; ++i) {
        dest[i] = __builtin_reduce_or(p[i] < 0.0);
    }
}
```
My change typically resulted in a 10-15% improvement on various CPUs.
My original motivating use case was something like this:
```cpp
float32x4_t x = ...;
// ...
x = f(x);
if (__builtin_reduce_or(x < 0.0)) return;
x = g(x);
if (__builtin_reduce_or(x < 0.0)) return;
// ...
```
which should benefit a bit more from this change, since the reduction result is used for control flow.

For the case of `or(reduce_and(x), reduce_and(y))`, you make a good point. Currently LLVM generates [this](https://godbolt.org/z/n5KKvnTrh):
```
        uminv   b0, v0.8b
        uminv   b1, v1.8b
        fmov    w8, s0
        fmov    w9, s1
        orr     w8, w8, w9
        and     w0, w8, #0x1
        ret
```
With my change LLVM generates something like this:
```
        fcmp d0, #0.0
        cset w8, eq
        fcmp d1, #0.0
        csinc w0, w8, wzr, ne
        ret
```
In this example both snippets have a max dependency length of 4, but if we were to `or` together more `reduce_and` operations, the generated code would get worse after this change, which can be a concern. Although I'm not sure how common such a pattern is in real code.

Similar patterns like `and(reduce_and(x), reduce_and(y))` get folded to `reduce_and(and(x, y))` by instcombine, so that shouldn't be an issue.

I guess the question is how common the pattern you bring up is in real-world code, and does the potential regression in that case outweigh the improvements in other cases. I don't really have a good answer for this though.

https://github.com/llvm/llvm-project/pull/115713