[llvm] [AMDGPU] Fix vector legalization for bf16 valu ops (PR #158439)

Wed Sep 24 02:35:06 PDT 2025

================
@@ -46780,6 +46925,13 @@ define <4 x bfloat> @v_fma_v4bf16(<4 x bfloat> %a, <4 x bfloat> %b, <4 x bfloat>
 ; GFX11FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_2)
 ; GFX11FAKE16-NEXT:    v_perm_b32 v1, v4, v1, 0x7060302
 ; GFX11FAKE16-NEXT:    s_setpc_b64 s[30:31]
+; GFX1250-LABEL: v_fma_v4bf16:
+; GFX1250:       ; %bb.0:
+; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
+; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    v_pk_fma_bf16 v0, v0, v2, v4
+; GFX1250-NEXT:    v_pk_fma_bf16 v1, v1, v3, v5
+; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
   %op = call <4 x bfloat> @llvm.fma.v4bf16(<4 x bfloat> %a, <4 x bfloat> %b, <4 x bfloat> %c)
   ret <4 x bfloat> %op
 }
----------------
arsenm wrote:

This test doesn't seem to cover the FMA sizes > 4

https://github.com/llvm/llvm-project/pull/158439