[llvm] [LLVM][NVPTX] Add NVPTX codegen support for fence.proxy.tensormap (PR #100748)

via llvm-commits llvm-commits at lists.llvm.org
Thu Aug 1 10:28:59 PDT 2024


================
@@ -251,6 +251,34 @@ Overview:
 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 instruction, equivalent to the ``__syncthreads()`` call in CUDA.
 
+Membar/Fences
+-------------
+
+
+'``llvm.nvvm.fence.proxy.tensormap.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.cta()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.cluster()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.gpu()
+  declare void @llvm.nvvm.fence.proxy.tensormap.release.sys()
+
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cta(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.cluster(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.gpu(ptr %addr, i32 %size)
+  declare void @llvm.nvvm.fence.proxy.tensormap.acquire.sys(ptr %addr, i32 %size)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.fence.proxy.tensormap.release.*``' intrinsic emits ``fence.proxy.tensormap::generic.release.*`` and '``@llvm.nvvm.fence.proxy.tensormap.acquire.*``' intrinsic emits ``fence.proxy.tensormap::generic.acquire.* [addr], size;``. ``nvvm.fence.proxy.tensormap*`` is a uni-directional fence used to establish ordering between memory accesses that may happen through different proxies. ``nvvm.fence.proxy.tensormap.release`` can form a release sequence that synchronizes with an acquire sequence that contains the ``nvvm.fence.proxy.tensormap.acquire`` proxy fence
+
+The address operand ``addr`` and the operand ``size`` together specifies the memory range ``[addr, addr+size-1]`` on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the ``size`` operand is ``128`` and must be an immediate. For more information, see `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-membar>`_.
----------------
gonzalobg wrote:

I think this should call out that address must either be `global` or `generic` pointing to global.

Do we have a way to annotate an intrinsic with that, to help addrspace propagation? (i.e. such that if someone uses this intrinsic on a pointer, we propagate through out that the pointer, and any pointer it was derived from / derived from it, must point to global.

https://github.com/llvm/llvm-project/pull/100748


More information about the llvm-commits mailing list