[llvm] [LLVM][NVPTX] Add support for tensormap.cp_fenceproxy (PR #107555)

Thu Sep 12 13:40:01 PDT 2024

================
@@ -311,7 +311,37 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
   ``@llvm.nvvm.fence.proxy.tensormap_generic.acquire.*`` ``fence.proxy.tensormap::generic.acquire.* [addr], size``
   ====================================================== =========================================================
 
-The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+
+'``llvm.nvvm.tensormap.cp_fenceproxy.global.shared.tensormap_generic.release.*.sync.aligned``'
----------------
gonzalobg wrote:

Agree that we should move this to the forums, since this discussion goes beyond the scope of this one intrinsic and more into strategy.
Trying to get this back on track to discuss the spelling changes that we could do for this one intrinsic: 

- Shortening  intrinsic by removing `.global.shared`: I think this is fine as long as we raise an user error if the `addrspace` of the `src` and `dst` pointers do fullfill the requirements, to prevent undefined behavior.
- Shortening intrinsic by removing `.sync.aligned`: I think that incorrect use of `.sync.aligned` is one of the main sources of undefined behavior in LLVM IR. I don't know to which degree we can produce an user error if this requirement is not full-filled, but if we cannot reliably error I think there needs to be something in the intrinsic name that conveys that this is a `.sync.aligned` intrinsic. Does LLVM IR have a better name for this? (`convergent` seems to be subtly different)
- Shortening intrinsic by removing `.release`: this intrinsic performs multiple operations, one of which is a `fence` instruction. When a memory ordering is omitted, the default is not `release` (but `seq_cst` in C++, and `acq_rel` in PTX), which differ from the ordering used by this fence. The ordering reveals with which other memory instructions this one instruction builds a release pattern with, so I think we should keep it. If we wanted to shorten it, an option is to provide an `i32 x` argument that conveys the order (`i32 5`) is shorter than `.release`.
- Shortening intrinsic by removing `.tensormap_generic`. The fence in this instruction is a cross-proxy fence (`fence.proxy`) from `generic`-proxy to `tensormap`-proxy, and in contrast with other cross-proxy fence instructions (e.g. `fence.proxy.alias`) it is not a bidirectional `acq_rel` cross-proxy fence, but an uni-directional one from `generic`-proxy to `tensormap`-proxy only. `.tensormap_generic` brings all of this with it. My preference is that if we wanted to shorten it, we could add two `i32 dstp, i32 srcp` that specify the destination and source proxies for uni-directional proxy fences (for bi-directional proxy fences we could have just one `i32`).
- Shortening the scope: using `i32 NVPTX::Sco` instead.

So applying the shortenings, it could look like: 

```llvm
llvm.nvvm.tensormap.cp_fenceproxy.sync.aligned(
  ptr addrspace(global) %0, // dst
  ptr addrspace(shared) %1, // src
  i32 128, // size imm
  i32 5, // release
  i32 1, // tensormap-proxy
  i32 0, // generic-proxy
  i32 sco, // NVPTX::Scope
)
```

I think this exposure is still a 1:1 PTX exposure (and therefore per the prior comment I think the pros overweight the cons), it just mechanically picks a different syntax than the one this PR picks:
- Use `addrspace` instead of `.statespace` modifiers.
- Use `NVPTX::Order` enum instead of `.sem` modifier.
- Use  (a new) `NVPTX::Proxy` enum instead of mapping `.to_proxy::from_proxy` to `.to_proxy_from_proxy`. 
- Use `NVPTX::Scope` enum instead of a `.scope` modifier.

https://github.com/llvm/llvm-project/pull/107555