[llvm] [NVPTX] support packed f32 instructions for sm_100+ (PR #126337)
Princeton Ferro via llvm-commits
llvm-commits at lists.llvm.org
Tue Jul 22 11:35:23 PDT 2025
Prince781 wrote:
The problem is inadequate modeling of the semantics of `wgmma.mma_async` in inline ASM. You have [`extractelement`s between your `wgmma.fence` and mma](https://gist.github.com/npanchen/f3c5b5b657cb2a04a17c0b5467090383#file-bad-ll-L96-L99), which breaks WGMMA semantics. I think having a `wgmma.mma_async` intrinsic implementation in LLVM IR would fix this issue. Is the original source code for this kernel in CUDA C++?
https://github.com/llvm/llvm-project/pull/126337
More information about the llvm-commits
mailing list