[llvm] [AArch64] Improve code generation of bool vector reduce operations (PR #115713)
David Sherwood via llvm-commits
llvm-commits at lists.llvm.org
Fri Nov 29 08:52:26 PST 2024
=?utf-8?q?Csanád_Hajdú?= <csanad.hajdu at arm.com>
Message-ID:
In-Reply-To: <llvm.org/llvm/llvm-project/pull/115713 at github.com>
================
@@ -20,11 +20,11 @@ define i1 @test_redand_v1i1(<1 x i1> %a) {
define i1 @test_redand_v2i1(<2 x i1> %a) {
; CHECK-LABEL: test_redand_v2i1:
; CHECK: // %bb.0:
+; CHECK-NEXT: mvn v0.8b, v0.8b
; CHECK-NEXT: shl v0.2s, v0.2s, #31
; CHECK-NEXT: cmlt v0.2s, v0.2s, #0
-; CHECK-NEXT: uminp v0.2s, v0.2s, v0.2s
-; CHECK-NEXT: fmov w8, s0
-; CHECK-NEXT: and w0, w8, #0x1
+; CHECK-NEXT: fcmp d0, #0.0
----------------
david-arm wrote:
I think the lowering is quite clever here, but is there now an issue with serialisation if you have multiple reductions? Suppose your IR looks like this:
```
%or_result1 = call i1 @llvm.vector.reduce.and.v2i1(<2 x i1> %a)
%or_result2 = call i1 @llvm.vector.reduce.and.v2i1(<2 x i1> %b)
%or_result = or i1 %or_result1, %or_result2
ret i1 %or_result
```
I haven't checked the code with and without this patch, but I imagine previously we could quite happily have interleaved the instructions like this:
```
uminp v0 ...
uminp v1 ...
fmov w8, s0
fmov w9, s1
and w0, w8, 0x1
and w1, w9, 0x1
or w0, w0, w1
```
whereas now due to the single CC register we have to serialise:
```
fcmp d0, #0.0
cset w0, eq
fcmp d1, #0.0
cset w1, eq
or w0, w0, w1
...
```
However, I can see how this new version is efficient if the result is then used for control flow:
```
%or_result = call i1 @llvm.vector.reduce.and.v2i1(<2 x i1> %a)
br i1 %or_result, ...
```
Do you have any examples showing where this patch helps improve performance?
https://github.com/llvm/llvm-project/pull/115713
More information about the llvm-commits
mailing list