[llvm] [AArch64] Improve code generation of bool vector reduce operations (PR #115713)
Csanád Hajdú via llvm-commits
llvm-commits at lists.llvm.org
Fri Nov 29 10:27:03 PST 2024
================
@@ -20,11 +20,11 @@ define i1 @test_redand_v1i1(<1 x i1> %a) {
define i1 @test_redand_v2i1(<2 x i1> %a) {
; CHECK-LABEL: test_redand_v2i1:
; CHECK: // %bb.0:
+; CHECK-NEXT: mvn v0.8b, v0.8b
; CHECK-NEXT: shl v0.2s, v0.2s, #31
; CHECK-NEXT: cmlt v0.2s, v0.2s, #0
-; CHECK-NEXT: uminp v0.2s, v0.2s, v0.2s
-; CHECK-NEXT: fmov w8, s0
-; CHECK-NEXT: and w0, w8, #0x1
+; CHECK-NEXT: fcmp d0, #0.0
----------------
Il-Capitano wrote:
For performance differences I originally tested a simple loop similar to this:
```cpp
void test(bool *dest, float32x4_t *p, std::size_t n) {
for (std::size_t i = 0; i < n; ++i) {
dest[i] = __builtin_reduce_or(p[i] < 0.0);
}
}
```
My change typically resulted in a 10-15% improvement on various CPUs.
My original motivating use case was something like this:
```cpp
float32x4_t x = ...;
// ...
x = f(x);
if (__builtin_reduce_or(x < 0.0)) return;
x = g(x);
if (__builtin_reduce_or(x < 0.0)) return;
// ...
```
which should benefit a bit more from this change, since the reduction result is used for control flow.
For the case of `or(reduce_and(x), reduce_and(y))`, you make a good point. Currently LLVM generates [this](https://godbolt.org/z/n5KKvnTrh):
```
uminv b0, v0.8b
uminv b1, v1.8b
fmov w8, s0
fmov w9, s1
orr w8, w8, w9
and w0, w8, #0x1
ret
```
With my change LLVM generates something like this:
```
fcmp d0, #0.0
cset w8, eq
fcmp d1, #0.0
csinc w0, w8, wzr, ne
ret
```
In this example both snippets have a max dependency length of 4, but if we were to `or` together more `reduce_and` operations, the generated code would get worse after this change, which can be a concern. Although I'm not sure how common such a pattern is in real code.
Similar patterns like `and(reduce_and(x), reduce_and(y))` get folded to `reduce_and(and(x, y))` by instcombine, so that shouldn't be an issue.
I guess the question is how common the pattern you bring up is in real-world code, and does the potential regression in that case outweigh the improvements in other cases. I don't really have a good answer for this though.
https://github.com/llvm/llvm-project/pull/115713
More information about the llvm-commits
mailing list