[llvm] [NVPTX] support packed f32 instructions for sm_100+ (PR #126337)

Tue Jul 22 11:35:23 PDT 2025

Prince781 wrote:

The problem is inadequate modeling of the semantics of `wgmma.mma_async` in inline ASM. You have [`extractelement`s between your `wgmma.fence` and mma](https://gist.github.com/npanchen/f3c5b5b657cb2a04a17c0b5467090383#file-bad-ll-L96-L99), which breaks WGMMA semantics. I think having a `wgmma.mma_async` intrinsic implementation in LLVM IR would fix this issue. Is the original source code for this kernel in CUDA C++?

https://github.com/llvm/llvm-project/pull/126337