[llvm] [AArch64] Lower bfloat FADD/SUB to BFMLAL top/bottom instructions (PR #174814)
Benjamin Maxwell via llvm-commits
llvm-commits at lists.llvm.org
Wed Feb 4 09:57:03 PST 2026
================
@@ -83,28 +83,36 @@ define <vscale x 4 x bfloat> @fadd_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
}
define <vscale x 8 x bfloat> @fadd_nxv8bf16(<vscale x 8 x bfloat> %a, <vscale x 8 x bfloat> %b) {
-; NOB16B16-LABEL: fadd_nxv8bf16:
-; NOB16B16: // %bb.0:
-; NOB16B16-NEXT: uunpkhi z2.s, z1.h
-; NOB16B16-NEXT: uunpkhi z3.s, z0.h
-; NOB16B16-NEXT: uunpklo z1.s, z1.h
-; NOB16B16-NEXT: uunpklo z0.s, z0.h
-; NOB16B16-NEXT: ptrue p0.s
-; NOB16B16-NEXT: lsl z2.s, z2.s, #16
-; NOB16B16-NEXT: lsl z3.s, z3.s, #16
-; NOB16B16-NEXT: lsl z1.s, z1.s, #16
-; NOB16B16-NEXT: lsl z0.s, z0.s, #16
-; NOB16B16-NEXT: fadd z2.s, z3.s, z2.s
-; NOB16B16-NEXT: fadd z0.s, z0.s, z1.s
-; NOB16B16-NEXT: bfcvt z1.h, p0/m, z2.s
-; NOB16B16-NEXT: bfcvt z0.h, p0/m, z0.s
-; NOB16B16-NEXT: uzp1 z0.h, z0.h, z1.h
-; NOB16B16-NEXT: ret
+; NOB16B16-NONSTREAMING-LABEL: fadd_nxv8bf16:
+; NOB16B16-NONSTREAMING: // %bb.0:
+; NOB16B16-NONSTREAMING-NEXT: movi v2.2d, #0000000000000000
+; NOB16B16-NONSTREAMING-NEXT: fmov z3.h, #1.87500000
+; NOB16B16-NONSTREAMING-NEXT: ptrue p0.s
+; NOB16B16-NONSTREAMING-NEXT: trn1 z4.h, z2.h, z0.h
+; NOB16B16-NONSTREAMING-NEXT: trn2 z2.h, z2.h, z0.h
+; NOB16B16-NONSTREAMING-NEXT: bfmlalb z4.s, z1.h, z3.h
+; NOB16B16-NONSTREAMING-NEXT: bfmlalt z2.s, z1.h, z3.h
+; NOB16B16-NONSTREAMING-NEXT: bfcvt z0.h, p0/m, z4.s
+; NOB16B16-NONSTREAMING-NEXT: bfcvtnt z0.h, p0/m, z2.s
----------------
MacDue wrote:
Maybe? I guess that would be something like (not tested):
```
trn1 z4.h, z2.h, z0.h
trn2 z2.h, z2.h, z0.h
trn1 z5.h, z2.h, z1.h
trn2 z6.h, z2.h, z1.h
fadd z4.s, z4.s, z5.s
fadd z2.s, z2.s, z6.s
bfcvt z0.h, p0/m, z4.s
bfcvtnt z1.h, p0/m, z2.s
```
vs
```
trn1 z4.h, z2.h, z0.h
trn2 z2.h, z2.h, z0.h
bfmlalb z4.s, z1.h, z3.h
bfmlalt z2.s, z1.h, z3.h
bfcvt z0.h, p0/m, z4.s
bfcvtnt z0.h, p0/m, z2.s
```
`llvm-mca` seems the think the latter is slightly cheaper (RThroughput 1.0 vs 1.5), but I don't know how much that can be trusted.
https://github.com/llvm/llvm-project/pull/174814
More information about the llvm-commits
mailing list