[llvm] [AMDGPU] Fix GFX11 WMMA intrinsic lowering regression for compute kernels (PR #164036)

Tue Oct 21 23:50:50 PDT 2025

mcgrof wrote:

I'm cosing this PR - after feedback I've root caused the issue to incorrect operand sizes in calling code, not LLVM.                                                                                                                                                        

I  found that the "Cannot select intrinsic" errors were caused by passing incorrectly sized operands to the WMMA intrinsics, not a lowering issue in LLVM.  I was passing 8-element vectors for the A/B operands when the llvm.amdgcn.wmma.f32.16x16x16.f16 intrinsic requires 16-element vectors (<16 x half>). LLVM was correctly rejecting the intrinsic because the operand types didn't match the intrinsic signature.   After fixing my fragment loaders to properly distribute the 16×16 input matrices across wave lanes with correct per-lane fragment sizes, this issue is now fixed:

  - A/B: 16 fp16 elements per lane (matching <16 x half> intrinsic signature)                                                                                                                                                                
  - C/D: 8 fp32 elements per lane for wave32 mode (matching <8 x float>)                                                                                                                                                                     

So with correctly sized operands, LLVM's existing patterns work as expected, and no changes needed.  The bug was in my codbease, passing undersized vectors. Apologies for the noise, and thanks for any review time spent on this! 

https://github.com/llvm/llvm-project/pull/164036