[llvm] [NVPTX] support packed f32 instructions for sm_100+ (PR #126337)

Tue Jul 22 11:28:40 PDT 2025

Prince781 wrote:

@npanchen I compared the "good" and "bad" PTX and I notice there are now `mov.b64`s inside your async wgmma pipeline (between `wgmma.mma_async` calls). Register accesses inside are disallowed[^1] and ptxas will force `wgmma` to be synchronous.

[^1]: See [PTX ISA for `wgmma.mma_async`](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions-wgmma-mma): "wgmma.fence instruction must be used to fence the register accesses of wgmma.mma_async instruction from their prior accesses. Otherwise, the behavior is undefined."

https://github.com/llvm/llvm-project/pull/126337