[llvm] [LLVM][NVPTX] Add support for tensormap.cp_fenceproxy (PR #107555)

Tue Sep 10 14:52:28 PDT 2024

================
@@ -311,7 +311,37 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
   ``@llvm.nvvm.fence.proxy.tensormap_generic.acquire.*`` ``fence.proxy.tensormap::generic.acquire.* [addr], size``
   ====================================================== =========================================================
 
-The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+
+'``llvm.nvvm.tensormap.cp_fenceproxy.global.shared.tensormap_generic.release.*.sync.aligned``'
----------------
Artem-B wrote:

I don't mean one generic intrinsic handling everything, but rather to skip mentioning parts that never change.

E.g. until we have variants that support any other accesses other than `global.shared`, there's no benefit in encoding them into the name. 1:1 mapping is still in place.

If/when we will need to introduce other variants, e.g. `shared.shared`, then we can add intrinsics with the address space encoded in the name, and auto-upgrade the abbreviated one to the `global.shared`.

Another option is to use a less verbose mnemonics. E.g. instead of `llvm.nvvm.tensormap.cp_fenceproxy.global.shared.tensormap_generic.release.*.sync.aligned` we could use `llvm.nvvm.tm.cfp.g.s.tmg.release.*.sync.aligned`

Right now, according to the PTX spec, a lot fo the instruction name components are fixed:

```
tensormap.cp_fenceproxy.cp_qualifiers.fence_qualifiers.sync.aligned  [dst], [src], size;

.cp_qualifiers    = { .global.shared::cta }
.fence_qualifiers = { .to_proxy::from_proxy.release.scope }
.to_proxy::from_proxy  = { .tensormap::generic }
.scope            = { .cta, .cluster, .gpu , .sys }
```

Once this convoluted structure is expanded, only the `.scope` value appears to change in the name.
One can guess, based on this syntax structure, that NVIDIA may have plans to extend it in the future, but without nothing else to go on, it's hard to predict how, when, or whether it will actually change.

IMO, a naming scheme with abbreviated or omitted fields would be somewhat more user friendly.

That said, it's a cosmetic issue. While I'm not happy with the excessively long names, I can live with them.

https://github.com/llvm/llvm-project/pull/107555