[llvm] [AArch64] Lower bfloat FADD/SUB to BFMLAL top/bottom instructions (PR #174814)

Benjamin Maxwell via llvm-commits llvm-commits at lists.llvm.org
Wed Feb 4 09:57:03 PST 2026


================
@@ -83,28 +83,36 @@ define <vscale x 4 x bfloat> @fadd_nxv4bf16(<vscale x 4 x bfloat> %a, <vscale x
 }
 
 define <vscale x 8 x bfloat> @fadd_nxv8bf16(<vscale x 8 x bfloat> %a, <vscale x 8 x bfloat> %b) {
-; NOB16B16-LABEL: fadd_nxv8bf16:
-; NOB16B16:       // %bb.0:
-; NOB16B16-NEXT:    uunpkhi z2.s, z1.h
-; NOB16B16-NEXT:    uunpkhi z3.s, z0.h
-; NOB16B16-NEXT:    uunpklo z1.s, z1.h
-; NOB16B16-NEXT:    uunpklo z0.s, z0.h
-; NOB16B16-NEXT:    ptrue p0.s
-; NOB16B16-NEXT:    lsl z2.s, z2.s, #16
-; NOB16B16-NEXT:    lsl z3.s, z3.s, #16
-; NOB16B16-NEXT:    lsl z1.s, z1.s, #16
-; NOB16B16-NEXT:    lsl z0.s, z0.s, #16
-; NOB16B16-NEXT:    fadd z2.s, z3.s, z2.s
-; NOB16B16-NEXT:    fadd z0.s, z0.s, z1.s
-; NOB16B16-NEXT:    bfcvt z1.h, p0/m, z2.s
-; NOB16B16-NEXT:    bfcvt z0.h, p0/m, z0.s
-; NOB16B16-NEXT:    uzp1 z0.h, z0.h, z1.h
-; NOB16B16-NEXT:    ret
+; NOB16B16-NONSTREAMING-LABEL: fadd_nxv8bf16:
+; NOB16B16-NONSTREAMING:       // %bb.0:
+; NOB16B16-NONSTREAMING-NEXT:    movi v2.2d, #0000000000000000
+; NOB16B16-NONSTREAMING-NEXT:    fmov z3.h, #1.87500000
+; NOB16B16-NONSTREAMING-NEXT:    ptrue p0.s
+; NOB16B16-NONSTREAMING-NEXT:    trn1 z4.h, z2.h, z0.h
+; NOB16B16-NONSTREAMING-NEXT:    trn2 z2.h, z2.h, z0.h
+; NOB16B16-NONSTREAMING-NEXT:    bfmlalb z4.s, z1.h, z3.h
+; NOB16B16-NONSTREAMING-NEXT:    bfmlalt z2.s, z1.h, z3.h
+; NOB16B16-NONSTREAMING-NEXT:    bfcvt z0.h, p0/m, z4.s
+; NOB16B16-NONSTREAMING-NEXT:    bfcvtnt z0.h, p0/m, z2.s
----------------
MacDue wrote:

Maybe? I guess that would be something like (not tested):

```
trn1 z4.h, z2.h, z0.h
trn2 z2.h, z2.h, z0.h
trn1 z5.h, z2.h, z1.h
trn2 z6.h, z2.h, z1.h
fadd z4.s, z4.s, z5.s
fadd z2.s, z2.s, z6.s
bfcvt z0.h, p0/m, z4.s
bfcvtnt z1.h, p0/m, z2.s
```
vs
```
trn1 z4.h, z2.h, z0.h
trn2 z2.h, z2.h, z0.h
bfmlalb z4.s, z1.h, z3.h
bfmlalt z2.s, z1.h, z3.h
bfcvt z0.h, p0/m, z4.s
bfcvtnt z0.h, p0/m, z2.s
```

`llvm-mca`  seems the think the latter is slightly cheaper (RThroughput 1.0 vs 1.5), but I don't know how much that can be trusted. 

https://github.com/llvm/llvm-project/pull/174814


More information about the llvm-commits mailing list