[llvm] [AArch64] Improve code generation of bool vector reduce operations (PR #115713)

Fri Nov 29 08:52:26 PST 2024

=?utf-8?q?Csanád_Hajdú?= <csanad.hajdu at arm.com>
Message-ID:
In-Reply-To: <llvm.org/llvm/llvm-project/pull/115713 at github.com>


================
@@ -20,11 +20,11 @@ define i1 @test_redand_v1i1(<1 x i1> %a) {
 define i1 @test_redand_v2i1(<2 x i1> %a) {
 ; CHECK-LABEL: test_redand_v2i1:
 ; CHECK:       // %bb.0:
+; CHECK-NEXT:    mvn v0.8b, v0.8b
 ; CHECK-NEXT:    shl v0.2s, v0.2s, #31
 ; CHECK-NEXT:    cmlt v0.2s, v0.2s, #0
-; CHECK-NEXT:    uminp v0.2s, v0.2s, v0.2s
-; CHECK-NEXT:    fmov w8, s0
-; CHECK-NEXT:    and w0, w8, #0x1
+; CHECK-NEXT:    fcmp d0, #0.0
----------------
david-arm wrote:

I think the lowering is quite clever here, but is there now an issue with serialisation if you have multiple reductions? Suppose your IR looks like this:

```
  %or_result1 = call i1 @llvm.vector.reduce.and.v2i1(<2 x i1> %a)
  %or_result2 = call i1 @llvm.vector.reduce.and.v2i1(<2 x i1> %b)
  %or_result = or i1 %or_result1, %or_result2
  ret i1 %or_result
```

I haven't checked the code with and without this patch, but I imagine previously we could quite happily have interleaved the instructions like this:

```
  uminp v0 ...
  uminp v1 ...
  fmov w8, s0
  fmov w9, s1
  and w0, w8, 0x1
  and w1, w9, 0x1
  or w0, w0, w1
```

whereas now due to the single CC register we have to serialise:

```
  fcmp d0, #0.0
  cset w0, eq
  fcmp d1, #0.0
  cset w1, eq
  or w0, w0, w1
...
```

However, I can see how this new version is efficient if the result is then used for control flow:

```
  %or_result = call i1 @llvm.vector.reduce.and.v2i1(<2 x i1> %a)
  br i1 %or_result, ...
```

Do you have any examples showing where this patch helps improve performance?

https://github.com/llvm/llvm-project/pull/115713