[clang] [llvm] [NVPTX] Add tcgen05 alloc/dealloc intrinsics (PR #124961)

Thu Jan 30 12:38:14 PST 2025

================
@@ -962,6 +962,109 @@ The ``griddepcontrol`` intrinsics allows the dependent grids and prerequisite gr
 For more information, refer 
 `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol>`__.
 
+TCGEN05 family of Intrinsics
+----------------------------
+
+The llvm.nvvm.tcgen05.* intrinsics model the TCGEN05 family of instructions
+exposed by PTX. These intrinsics use 'Tensor Memory' (henceforth ``tmem``).
+NVPTX represents this memory using ``addrspace(6)`` and is always 32-bits.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory>`_.
+
+The tensor-memory pointers may only be used with the tcgen05 intrinsics.
+There are specialized load/store instructions provided (tcgen05.ld/st) to
+work with tensor-memory.
+
+For more information on tensor-memory load/store instructions, refer
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-and-register-load-store-instructions>`_.
+
+All tcgen05 intrinsics use a ``null`` pointer in tmem address
+space as their last operand. This helps to preserve ordering among the tcgen05
+operations especially when the intrinsic lacks any tmem operands. This
+last operand is dropped during Codegen.
----------------
Artem-B wrote:

After reading PTX docs here's my understanding of the situation.
- there's a new kind of memory, so creating a separate AS for tmem is reasonable.
- tcgen05.alloc returns allocation result indirectly, by storing it in a shared memory. So LLVM has no direct indication that the intrinsic operates on tmem and affects both shared memory and tmem
- it's not clear from PTX docs what's the input for tcgen05.dealloc. It just says "The operand taddr must point to a previous [Tensor Memory](https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory) allocation" but I can't tell if that means the previous location in the shared memory where it stored a tmem pointer, or the tmem pointer itself. Judging by the proposed intrinsic signature, it's the latter. In this case LLVM knows that we're touching tmem.
- relinquish_alloc_permit blocks subsequent allocations, so it must not be reordered vs allocs.

So, the only odd thing is the allocation returning the result indirectly.

Proposed design adds artificial tmem pointer to let LLVM know that all tcgen05 intrinsics operate on tmem and we can give LLVM sufficient hints on how they should be ordered. However, the dummy argument is a crutch. 

The gist of the problem here is that LLVM's existing intrinsic annotation is not flexible enough to describe what we have here, exactly. I.e. there's no way to tell LLVM that alloc and relinquish_alloc_permit operate on tmem. 
Our current options are to either make all intrinsics conservatively with `HasSideEffects` or, with a more relaxed "IntrInaccessibleMemOnly". I think the latter would be a reasonable trade-off for the time being.

A longer-term approach would be to add a new intrinsic property allowing to specify specific AS accessed by the intrinsic. E.g. we may extend existing `IntrWriteMem` and `IntrWriteMem`  to allow narrowing the scope to particular AS, and allow specifying more than one. E.g. alloc would indicate that it writes both shared and tmem.
I think that would be a useful addition to a handful of other intrinsics we already have, not just in NVPTX, but in the other back-ends that need to deal with multiple AS.


https://github.com/llvm/llvm-project/pull/124961