[llvm] [NVPTX] support packed f32 instructions for sm_100+ (PR #126337)
Princeton Ferro via llvm-commits
llvm-commits at lists.llvm.org
Tue Jul 22 11:28:40 PDT 2025
Prince781 wrote:
@npanchen I compared the "good" and "bad" PTX and I notice there are now `mov.b64`s inside your async wgmma pipeline (between `wgmma.mma_async` calls). Register accesses inside are disallowed[^1] and ptxas will force `wgmma` to be synchronous.
[^1]: See [PTX ISA for `wgmma.mma_async`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions-wgmma-mma): "wgmma.fence instruction must be used to fence the register accesses of wgmma.mma_async instruction from their prior accesses. Otherwise, the behavior is undefined."
https://github.com/llvm/llvm-project/pull/126337
More information about the llvm-commits
mailing list