[llvm] [AArch64][SVE] Fold zero-extend into add reduction. (PR #102325)
David Sherwood via llvm-commits
llvm-commits at lists.llvm.org
Wed Aug 21 03:01:42 PDT 2024
================
@@ -103,17 +103,12 @@ define i32 @add_i32(<vscale x 8 x i32> %a, <vscale x 4 x i32> %b) {
define i16 @add_ext_i16(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) {
; CHECK-LABEL: add_ext_i16:
; CHECK: // %bb.0:
-; CHECK-NEXT: uunpkhi z2.h, z0.b
-; CHECK-NEXT: uunpklo z0.h, z0.b
-; CHECK-NEXT: uunpkhi z3.h, z1.b
-; CHECK-NEXT: uunpklo z1.h, z1.b
-; CHECK-NEXT: ptrue p0.h
-; CHECK-NEXT: add z0.h, z0.h, z2.h
-; CHECK-NEXT: add z1.h, z1.h, z3.h
-; CHECK-NEXT: add z0.h, z0.h, z1.h
-; CHECK-NEXT: uaddv d0, p0, z0.h
-; CHECK-NEXT: fmov x0, d0
-; CHECK-NEXT: // kill: def $w0 killed $w0 killed $x0
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: uaddv d0, p0, z0.b
+; CHECK-NEXT: uaddv d1, p0, z1.b
+; CHECK-NEXT: fmov w8, s0
----------------
david-arm wrote:
Not for this patch, but I wonder if we can improve this further with:
```
uaddv d0, p0, z0.b
uaddv d1, p0, z1.b
add v0.4s, v0.4s, v1.4s
fmov w0, s0
```
The throughput of the NEON add is much higher than that of fmov and the latency is about the same.
https://github.com/llvm/llvm-project/pull/102325
More information about the llvm-commits
mailing list