[llvm] [LLVM][NVPTX] Add support for tensormap.cp_fenceproxy (PR #107555)

Thu Sep 12 11:11:08 PDT 2024

================
@@ -311,7 +311,37 @@ The ``@llvm.nvvm.fence.proxy.tensormap_generic.*`` is a uni-directional fence us
   ``@llvm.nvvm.fence.proxy.tensormap_generic.acquire.*`` ``fence.proxy.tensormap::generic.acquire.* [addr], size``
   ====================================================== =========================================================
 
-The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+The address operand ``addr`` and the operand ``size`` together specify the memory range ``[addr, addr+size)`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr must fall within the ``.global`` state space. Otherwise, the behavior is undefined. For more information, see PTX ISA `<https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
+
+'``llvm.nvvm.tensormap.cp_fenceproxy.global.shared.tensormap_generic.release.*.sync.aligned``'
----------------
Artem-B wrote:

See above:
> That said, it's a cosmetic issue. While I'm not happy with the excessively long names, I can live with them.

That said, I'll address your points
> For those that do not, parts of the longer names (or call arguments) highlight deviation from defaults (generic, release, .sync.aligned), and help with search ability in the PTX spec, which at the end is the reference documentation for these intrinsics. There may be future tooling advantages to consider (like LLVM->PTX verification, auto-generating some of these from a machine readable PTX spec, etc.).

A comment next to the intrinsic implementation, spelling out the exact instruction it will lower to addresses the searchability issue without cluttering the IR with excessively long names. The names are not intended to spell out all the details in full sentences, we've got documentation for that. The purpose of the intrinsic is to provide understandable consistent mapping for the underlying instruction. Whether we can come up with such mapping that's better than the default exact map is what I'm trying to figure out. Naming is hard.

> I feel like we could debate what to do here forever, so maybe we should zoom out a bit: this is not the longest instruction there is, and it is "short" compared with some PTX instructions that may be coming.

"It can be worse" is a weak counterargument. The "it *will* be worse" part does concern me a bit and suggests that we should try to come up with a better name, because those longer instructions will percolate into LLVM IR. While we can't do much about NVIDIA's choices of instruction names, we do not have to follow them, if we can do better.

> We could spend engineering effort trying to shorten every intrinsic for every new PTX instruction,

We're not doing that. As I said above, I'm all for straightforward 1:1 mapping, but this specific case (and as you mention incoming instructions) looks like we may need to consider other options before we proceed. We probably should discuss it on LLVM's discourse forum. If there's general consensus that N-line long names are fine, so be it.

> - slows down support for new HW in NVPTX (NVPTX is missing hundreds of intrinsics), and

Rushing is rarely a good choice, particularly when there are open questions, and no existing users for the new intrinsics. Specifically for intrinsics, delay with their implmentation is almost never a showstopper, as there's always an option to use inline asm as a workaround. So, yes, resolving the design issues/concerns is part of the review process, and it does take time, and no, the delay is not a good argument to short-circuit the review process.

> - risks spending more engineering effort once PTX adds a new modifier to unshorten them, then implementing AutoUpgrade, and maybe upgrading frontends.

What is your point? Regardless of what we do now, we will need to make changes in the future, refactor existing code, etc. Autopupgrade changes are largely mechanical and not a big deal in practice. Besides, they may or may not be needed, depending on which naming scheme we settle on, so this point is something to consider when we make that decision.

> I think it may actually be more valuable to, instead, prioritize adding support for new HW to NVPTX as soon as possible,

This is a false dichotomy, IMO. Yes, we need to add support for the new instructions and corresponding intrinsics, but we also need to consider how we do that. It's not an either/or proposition. We need both.







https://github.com/llvm/llvm-project/pull/107555