[llvm] [NVPTX] support packed f32 instructions for sm_100+ (PR #126337)

Tue Jul 22 11:45:05 PDT 2025

Prince781 wrote:

@npanchen if the original source was CUDA C++, then you can use the [same trick CUTLASS uses](https://github.com/NVIDIA/cutlass/blob/main/include/cute/arch/mma_sm90_gmma.hpp#L86-L95) where you pass the operand through a `asm volatile` to get the desired anti-dependency with your `wgmma.mma_async` inline ASM call.

https://github.com/llvm/llvm-project/pull/126337