[llvm] [LLVM][NVPTX] Add support for tensormap.cp_fenceproxy (PR #107555)

Thu Sep 12 04:39:55 PDT 2024

================
@@ -311,7 +311,37 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
   ``@llvm.nvvm.fence.proxy.tensormap_generic.acquire.*`` ``fence.proxy.tensormap::generic.acquire.* [addr], size``
   ====================================================== =========================================================
 
-The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+
+'``llvm.nvvm.tensormap.cp_fenceproxy.global.shared.tensormap_generic.release.*.sync.aligned``'
----------------
gonzalobg wrote:

I understand that - for those that know what an intrinsic does - the longer names are painful. For those that do not, parts of the longer names (or call arguments) highlight deviation from defaults (generic, release, .sync.aligned), and help with search ability in the PTX spec, which at the end is the reference documentation for these intrinsics. There may be future tooling advantages to consider (like LLVM->PTX verification, auto-generating some of these from a machine readable PTX spec, etc.). 

I feel like we could debate what to do here forever, so maybe we should zoom out a bit: this is not the longest instruction there is, and it is "short" compared with some PTX instructions that may be coming.

We could spend engineering effort trying to shorten every intrinsic for every new PTX instruction, which:
- slows down support for new HW in NVPTX (NVPTX is missing hundreds of intrinsics), and
- risks spending more engineering effort once PTX adds a new modifier to unshorten them, then implementing AutoUpgrade, and maybe upgrading frontends.

I think it may actually be more valuable to, instead, prioritize adding support for new HW to NVPTX as soon as possible, e.g., by exposing PTX instructions 1:1 as intrinsics (modulo some simple mechanical transformations). That would free cycles for both adding more intrinsics, and providing better exposure for the more general purpose intrinsics, to simplify adoption in frontends and MLIR.

https://github.com/llvm/llvm-project/pull/107555