[llvm] e86a429 - [NVPTX] Add missing mbarrier intrinsics (#164864)
via llvm-commits
llvm-commits at lists.llvm.org
Mon Oct 27 01:53:45 PDT 2025
Author: Durgadoss R
Date: 2025-10-27T14:23:40+05:30
New Revision: e86a42940a2c2c58ba5280ae2d54d58140a42875
URL: https://github.com/llvm/llvm-project/commit/e86a42940a2c2c58ba5280ae2d54d58140a42875
DIFF: https://github.com/llvm/llvm-project/commit/e86a42940a2c2c58ba5280ae2d54d58140a42875.diff
LOG: [NVPTX] Add missing mbarrier intrinsics (#164864)
This patch adds a few more mbarrier intrinsics,
completing support for all the mbarrier variants
up to Blackwell architecture.
* Docs are updated in NVPTXUsage.rst.
* lit tests are added for all the variants.
* lit tests are verified with PTXAS from CUDA-12.8 toolkit.
Signed-off-by: Durgadoss R <durgadossr at nvidia.com>
Added:
llvm/test/CodeGen/NVPTX/mbarrier_arr.ll
llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll
llvm/test/CodeGen/NVPTX/mbarrier_tx.ll
llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll
llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll
llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll
llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll
llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll
Modified:
llvm/docs/NVPTXUsage.rst
llvm/include/llvm/IR/IntrinsicsNVVM.td
llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
Removed:
################################################################################
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index e8dceb836f98a..5ad8f9ab07e40 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -322,6 +322,450 @@ aligned '``@llvm.nvvm.barrier.cta.*``' instruction should only be used if it is
known that all threads in the CTA evaluate the condition identically, otherwise
behavior is undefined.
+MBarrier family of Intrinsics
+-----------------------------
+
+Overview:
+^^^^^^^^^
+
+An ``mbarrier`` is a barrier created in shared memory that supports:
+
+* Synchronizing any subset of threads within a CTA.
+* One-way synchronization of threads across CTAs of a cluster.
+ Threads can perform only ``arrive`` operations but not ``*_wait`` on an
+ mbarrier located in shared::cluster space.
+* Waiting for completion of asynchronous memory operations initiated by a
+ thread and making them visible to other threads.
+
+Unlike ``bar{.cta}/barrier{.cta}`` instructions which can access a limited
+number of barriers per CTA, ``mbarrier`` objects are user-defined and are
+only limited by the total shared memory size available.
+
+An mbarrier object is an opaque object in shared memory with an
+alignment of 8-bytes. It keeps track of:
+
+* Current phase of the mbarrier object
+* Count of pending arrivals for the current phase of the mbarrier object
+* Count of expected arrivals for the next phase of the mbarrier object
+* Count of pending asynchronous memory operations (or transactions)
+ tracked by the current phase of the mbarrier object. This is also
+ referred to as ``tx-count``. The unit of ``tx-count`` is specified
+ by the asynchronous memory operation (for example,
+ ``llvm.nvvm.cp.async.bulk.tensor.g2s.*``).
+
+The ``phase`` of an mbarrier object is the number of times the mbarrier
+object has been used to synchronize threads/track async operations.
+In each phase, threads perform:
+
+* arrive/expect-tx/complete-tx operations to progress the current phase.
+* test_wait/try_wait operations to check for completion of the current phase.
+
+An mbarrier object completes the current phase when:
+
+* The count of the pending arrivals has reached zero AND
+* The tx-count has reached zero.
+
+When an mbarrier object completes the current phase, below
+actions are performed ``atomically``:
+
+* The mbarrier object transitions to the next phase.
+* The pending arrival count is reinitialized to the expected arrival count.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier>`_.
+
+'``llvm.nvvm.mbarrier.init``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.mbarrier.init(ptr %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.init.shared(ptr addrspace(3) %addr, i32 %count)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.init.*``' intrinsics are used to initialize
+an mbarrier object located at ``addr`` with the value ``count``.
+``count`` is a 32-bit unsigned integer value and must be within
+the range [1...2^20-1]. During initialization:
+
+* The tx-count and the current phase of the mbarrier object are set to 0.
+* The expected and pending arrival counts are set to ``count``.
+
+Semantics:
+""""""""""
+
+The ``.shared`` variant explicitly uses shared memory address space for
+the ``addr`` operand. If the ``addr`` does not fall within the
+shared::cta space, then the behavior of this intrinsic is undefined.
+Performing ``mbarrier.init`` on a valid mbarrier object is undefined;
+use ``mbarrier.inval`` before reusing the memory for another mbarrier
+or any other purpose.
+
+'``llvm.nvvm.mbarrier.inval``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.mbarrier.inval(ptr %addr)
+ declare void @llvm.nvvm.mbarrier.inval.shared(ptr addrspace(3) %addr)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.inval.*``' intrinsics invalidate the mbarrier
+object at the address specified by ``addr``.
+
+Semantics:
+""""""""""
+
+The ``.shared`` variant explicitly uses shared memory address space for
+the ``addr`` operand. If the ``addr`` does not fall within the
+shared::cta space, then the behavior of this intrinsic is undefined.
+It is expected that ``addr`` was previously initialized using
+``mbarrier.init``; otherwise, the behavior is undefined.
+
+'``llvm.nvvm.mbarrier.expect.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.expect.tx.*``' intrinsics increase the transaction
+count of the mbarrier object at ``%addr`` by ``%tx_count``. The ``%tx_count``
+is a 32-bit unsigned integer value.
+
+Semantics:
+""""""""""
+
+The ``.space.{cta/cluster}`` indicates the address space where the mbarrier
+object resides.
+
+The ``.scope.{cta/cluster}`` denotes the set of threads that can directly
+observe the synchronizing effect of the mbarrier operation. When scope is
+"cta", all threads executing in the same CTA (as the current thread) can
+directly observe the effect of the ``expect.tx`` operation. Similarly,
+when scope is "cluster", all threads executing in the same Cluster
+(as the current thread) can directly observe the effect of the operation.
+
+If the ``addr`` does not fall within shared::cta or shared::cluster space,
+then the behavior of this intrinsic is undefined. This intrinsic has
+``relaxed`` semantics and hence does not provide any memory ordering
+or visibility guarantees.
+
+'``llvm.nvvm.mbarrier.complete.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.complete.tx.*``' intrinsics decrease the transaction
+count of the mbarrier object at ``%addr`` by ``%tx_count``. The ``%tx_count``
+is a 32-bit unsigned integer value. As a result of this decrement,
+the mbarrier can potentially complete its current phase and transition
+to the next phase.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.expect.tx.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.arrive``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.*`` intrinsics signal the arrival of the
+executing thread or completion of an asynchronous instruction associated with
+an arrive operation on the mbarrier object at ``%addr``. This operation
+decrements the pending arrival count by ``%count``, a 32-bit unsigned integer,
+potentially completing the current phase and triggering a transition to the
+next phase.
+
+Semantics:
+""""""""""
+
+The ``.space.{cta/cluster}`` indicates the address space where the mbarrier
+object resides. When the mbarrier is in shared::cta space, the intrinsics
+return an opaque 64-bit value capturing the phase of the mbarrier object
+_prior_ to this arrive operation. This value can be used with a try_wait
+or test_wait operation to check for the completion of the mbarrier.
+
+The ``.scope.{cta/cluster}`` denotes the set of threads that can directly
+observe the synchronizing effect of the mbarrier operation. When scope is
+"cta", all threads executing in the same CTA (as the current thread) can
+directly observe the effect of the ``arrive`` operation. Similarly,
+when scope is "cluster", all threads executing in the same Cluster
+(as the current thread) can directly observe the effect of the operation.
+
+If the ``addr`` does not fall within shared::cta or shared::cluster space,
+then the behavior of this intrinsic is undefined.
+
+These intrinsics have ``release`` semantics by default. The release semantics
+ensure ordering of operations that occur in program order _before_ this arrive
+instruction, making their effects visible to subsequent operations in other
+threads of the CTA (or cluster, depending on scope). Threads performing
+corresponding acquire operations (such as mbarrier.test.wait) synchronize
+with this release. The ``relaxed`` variants of these intrinsics do not
+provide any memory ordering or visibility guarantees.
+
+'``llvm.nvvm.mbarrier.arrive.expect.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.expect.tx.*`` intrinsics are similar to
+the ``@llvm.nvvm.mbarrier.arrive`` intrinsics except that they also
+perform an ``expect-tx`` operation _prior_ to the ``arrive`` operation.
+The ``%tx_count`` specifies the transaction count for the ``expect-tx``
+operation and the count for the ``arrive`` operation is assumed to be 1.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.arrive.drop``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.drop.*`` intrinsics decrement the
+expected arrival count of the mbarrier object at ``%addr`` by
+``%count`` and then perform an ``arrive`` operation with ``%count``.
+The ``%count`` is a 32-bit integer.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.arrive.drop.expect.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+ declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.drop.expect.tx.*`` intrinsics perform
+the below operations on the mbarrier located at ``%addr``.
+
+* Perform an ``expect-tx`` operation i.e. increase the transaction count
+ of the mbarrier by ``%tx_count``, a 32-bit unsigned integer value.
+* Decrement the expected arrival count of the mbarrier by 1.
+* Perform an ``arrive`` operation on the mbarrier with a value of 1.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.test.wait``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state)
+ declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state)
+ declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase)
+ declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase)
+
+ declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state)
+ declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state)
+ declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase)
+ declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.test.wait.*`` intrinsics test for the completion
+of the current or the immediately preceding phase of an mbarrier object at
+``%addr``. The test for completion can be done with either the ``state`` or
+the ``phase-parity`` of the mbarrier object.
+
+* When done through the ``i64 %state`` operand, the state must be
+ returned by an ``llvm.nvvm.mbarrier.arrive.*`` on the _same_
+ mbarrier object.
+* The ``.parity`` variant of these intrinsics test for completion
+ of the phase indicated by the operand ``i32 %phase``, which is
+ the integer parity of either the current phase or the immediately
+ preceding phase of the mbarrier object. An even phase has integer
+ parity 0 and an odd phase has integer parity of 1. So the valid
+ values for phase-parity are 0 and 1.
+
+Semantics:
+""""""""""
+
+The ``.scope.{cta/cluster}`` denotes the set of threads that the
+test_wait operation can directly synchronize with.
+
+If the ``addr`` does not fall within shared::cta space, then the
+the behavior of this intrinsic is undefined.
+
+These intrinsics have ``acquire`` semantics by default. This acquire
+pattern establishes memory ordering for operations occurring in program
+order after this ``test_wait`` instruction by making operations from
+other threads in the CTA (or cluster, depending on scope) visible to
+subsequent operations in the current thread. When this wait completes,
+it synchronizes with the corresponding release pattern from the
+``mbarrier.arrive`` operation. The ``relaxed`` variants of these intrinsics
+do not provide any memory ordering or visibility guarantees.
+
+This ``test.wait`` intrinsic is non-blocking and immediately returns
+the completion status without suspending the executing thread.
+
+The boolean return value indicates:
+
+* True: The immediately preceding phase has completed
+* False: The current phase is still incomplete
+
+When this wait returns true, the following ordering guarantees hold:
+
+* All memory accesses (except async operations) requested prior to
+ ``mbarrier.arrive`` having release semantics by participating
+ threads of a CTA (or cluster, depending on scope) are visible to
+ the executing thread.
+* All ``cp.async`` operations requested prior to ``cp.async.mbarrier.arrive``
+ by participating threads of a CTA are visible to the executing thread.
+* All ``cp.async.bulk`` operations using the same mbarrier object requested
+ prior to ``mbarrier.arrive`` having release semantics by participating CTA
+ threads are visible to the executing thread.
+* Memory accesses requested after this wait are not visible to memory
+ accesses performed prior to ``mbarrier.arrive`` by other participating
+ threads.
+* No ordering guarantee exists for memory accesses by the same thread
+ between an ``mbarrier.arrive`` and this wait.
+
+'``llvm.nvvm.mbarrier.try.wait``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare i1 @llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state)
+ declare i1 @llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state)
+
+ declare i1 @llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase)
+ declare i1 @llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase)
+
+ declare i1 @llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state, i32 %timelimit)
+ declare i1 @llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state, i32 %timelimit)
+
+ declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase, i32 %timelimit)
+ declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase, i32 %timelimit)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.try.wait.*`` intrinsics test for the completion of
+the current or immediately preceding phase of an mbarrier object at ``%addr``.
+Unlike the ``test.wait`` intrinsics, which perform a non-blocking test, these
+intrinsics may block the executing thread until the specified phase completes
+or a system-dependent time limit expires. Suspended threads resume execution
+when the phase completes or the time limit elapses. This time limit is
+configurable through the ``.tl`` variants of these intrinsics, where the
+``%timelimit`` operand (an unsigned integer) specifies the limit in
+nanoseconds. Other semantics are identical to those of the ``test.wait``
+intrinsics described above.
+
Electing a thread
-----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index c9df6c43fd396..719181a09f475 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1866,6 +1866,73 @@ let IntrProperties = [IntrConvergent, IntrNoCallback] in {
def int_nvvm_mbarrier_pending_count : NVVMBuiltin,
Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem, IntrConvergent, IntrNoCallback]>;
+// mbarrier.{expect_tx/complete_tx}
+foreach op = ["expect_tx", "complete_tx"] in {
+ foreach scope = ["scope_cta", "scope_cluster"] in {
+ foreach space = ["space_cta", "space_cluster"] in {
+ defvar suffix = StrJoin<"_", [op, scope, space]>.ret;
+ defvar mbar_addr_ty = !if(!eq(space, "space_cta"),
+ llvm_shared_ptr_ty, llvm_shared_cluster_ptr_ty);
+
+ def int_nvvm_mbarrier_ # suffix :
+ Intrinsic<[], [mbar_addr_ty, llvm_i32_ty],
+ [IntrConvergent, IntrArgMemOnly, IntrNoCallback]>;
+ } // space
+ } // scope
+} // op
+
+// mbarrier.arrive and mbarrier.arrive.expect_tx
+// mbarrier.arrive_drop and mbarrier.arrive_drop.expect_tx
+foreach op = ["arrive", "arrive_expect_tx",
+ "arrive_drop", "arrive_drop_expect_tx"] in {
+ foreach scope = ["scope_cta", "scope_cluster"] in {
+ foreach space = ["space_cta", "space_cluster"] in {
+ defvar suffix = StrJoin<"_", [scope, space]>.ret;
+ defvar mbar_addr_ty = !if(!eq(space, "space_cta"),
+ llvm_shared_ptr_ty, llvm_shared_cluster_ptr_ty);
+ defvar args_ty = [mbar_addr_ty, // mbar_address_ptr
+ llvm_i32_ty]; // tx-count
+
+ // mbarriers in shared_cluster space cannot return any value.
+ defvar mbar_ret_ty = !if(!eq(space, "space_cta"),
+ [llvm_i64_ty], []<LLVMType>);
+
+ def int_nvvm_mbarrier_ # op # "_" # suffix:
+ Intrinsic<mbar_ret_ty, args_ty,
+ [IntrConvergent, IntrNoCallback]>;
+ def int_nvvm_mbarrier_ # op # "_relaxed_" # suffix :
+ Intrinsic<mbar_ret_ty, args_ty,
+ [IntrConvergent, IntrArgMemOnly, IntrNoCallback]>;
+ } // space
+ } // scope
+} // op
+
+// mbarrier.{test_wait and try_wait}
+foreach op = ["test_wait", "try_wait"] in {
+ foreach scope = ["scope_cta", "scope_cluster"] in {
+ foreach parity = [true, false] in {
+ foreach time_limit = !if(!eq(op, "try_wait"), [true, false], [false]) in {
+ defvar base_args = [llvm_shared_ptr_ty]; // mbar_ptr
+ defvar parity_args = !if(parity, [llvm_i32_ty], [llvm_i64_ty]);
+ defvar tl_args = !if(time_limit, [llvm_i32_ty], []<LLVMType>);
+ defvar args = !listconcat(base_args, parity_args, tl_args);
+ defvar tmp_op = StrJoin<"_", [op,
+ !if(parity, "parity", ""),
+ !if(time_limit, "tl", "")]>.ret;
+ defvar suffix = StrJoin<"_", [scope, "space_cta"]>.ret;
+
+ def int_nvvm_mbarrier_ # tmp_op # "_" # suffix :
+ Intrinsic<[llvm_i1_ty], args,
+ [IntrConvergent, NoCapture<ArgIndex<0>>, IntrNoCallback]>;
+ def int_nvvm_mbarrier_ # tmp_op # "_relaxed_" # suffix :
+ Intrinsic<[llvm_i1_ty], args,
+ [IntrConvergent, NoCapture<ArgIndex<0>>, IntrNoCallback,
+ IntrArgMemOnly, IntrReadMem]>;
+ } // tl
+ } // parity
+ } // scope
+} // op
+
// Generated within nvvm. Use for ldu on sm_20 or later. Second arg is the
// pointer's alignment.
let IntrProperties = [IntrReadMem, IntrArgMemOnly, IntrNoCallback, IntrWillReturn, NoCapture<ArgIndex<0>>] in {
@@ -3000,4 +3067,4 @@ foreach sp = [0, 1] in {
}
}
-} // let TargetPrefix = "nvvm"
\ No newline at end of file
+} // let TargetPrefix = "nvvm"
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 598735f5972bc..c923f0ec907e7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -1082,6 +1082,161 @@ let Predicates = [hasPTX<70>, hasSM<80>] in {
"mbarrier.pending_count.b64",
[(set i32:$res, (int_nvvm_mbarrier_pending_count i64:$state))]>;
}
+
+class MBAR_UTIL<string op, string scope,
+ string space = "", string sem = "",
+ bit tl = 0, bit parity = 0> {
+ // The mbarrier instructions in PTX ISA are of the general form:
+ // mbarrier.op.semantics.scope.space.b64 arg1, arg2 ...
+ // where:
+ // op -> arrive, expect_tx, complete_tx, arrive.expect_tx etc.
+ // semantics -> acquire, release, relaxed (default depends on the op)
+ // scope -> cta or cluster (default is cta-scope)
+ // space -> shared::cta or shared::cluster (default is shared::cta)
+ //
+ // The 'semantics' and 'scope' go together. If one is specified,
+ // then the other _must_ be specified. For example:
+ // (A) mbarrier.arrive <args> (valid, release and cta are default)
+ // (B) mbarrier.arrive.release.cta <args> (valid, sem/scope mentioned explicitly)
+ // (C) mbarrier.arrive.release <args> (invalid, needs scope)
+ // (D) mbarrier.arrive.cta <args> (invalid, needs order)
+ //
+ // Wherever possible, we prefer form (A) to (B) since it is available
+ // from early PTX versions. In most cases, explicitly specifying the
+ // scope requires a later version of PTX.
+ string _scope_asm = !cond(
+ !eq(scope, "scope_cluster") : "cluster",
+ !eq(scope, "scope_cta") : !if(!empty(sem), "", "cta"),
+ true : scope);
+ string _space_asm = !cond(
+ !eq(space, "space_cta") : "shared",
+ !eq(space, "space_cluster") : "shared::cluster",
+ true : space);
+
+ string _parity = !if(parity, "parity", "");
+ string asm_str = StrJoin<".", ["mbarrier", op, _parity,
+ sem, _scope_asm, _space_asm, "b64"]>.ret;
+
+ string _intr_suffix = StrJoin<"_", [!subst(".", "_", op), _parity,
+ !if(tl, "tl", ""),
+ sem, scope, space]>.ret;
+ string intr_name = "int_nvvm_mbarrier_" # _intr_suffix;
+
+ // Predicate checks:
+ // These are used only for the "test_wait/try_wait" variants as they
+ // have evolved since sm80 and are complex. The predicates for the
+ // remaining instructions are straightforward and have already been
+ // applied directly.
+ Predicate _sm_pred = !cond(!or(
+ !eq(op, "try_wait"),
+ !eq(scope, "scope_cluster"),
+ !eq(sem, "relaxed")) : hasSM<90>,
+ true : hasSM<80>);
+ Predicate _ptx_pred = !cond(
+ !eq(sem, "relaxed") : hasPTX<86>,
+ !ne(_scope_asm, "") : hasPTX<80>,
+ !eq(op, "try_wait") : hasPTX<78>,
+ parity : hasPTX<71>,
+ true : hasPTX<70>);
+ list<Predicate> preds = [_ptx_pred, _sm_pred];
+}
+
+foreach op = ["expect_tx", "complete_tx"] in {
+ foreach scope = ["scope_cta", "scope_cluster"] in {
+ foreach space = ["space_cta", "space_cluster"] in {
+ defvar intr = !cast<Intrinsic>(MBAR_UTIL<op, scope, space>.intr_name);
+ defvar suffix = StrJoin<"_", [op, scope, space]>.ret;
+ def mbar_ # suffix : BasicNVPTXInst<(outs), (ins ADDR:$addr, B32:$tx_count),
+ MBAR_UTIL<op, scope, space, "relaxed">.asm_str,
+ [(intr addr:$addr, i32:$tx_count)]>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+ } // space
+ } // scope
+} // op
+
+multiclass MBAR_ARR_INTR<string op, string scope, string sem,
+ list<Predicate> pred = []> {
+ // When either of sem or scope is non-default, both have to
+ // be explicitly specified. So, explicitly state that
+ // sem is `release` when scope is `cluster`.
+ defvar asm_sem = !if(!and(!empty(sem), !eq(scope, "scope_cluster")),
+ "release", sem);
+
+ defvar asm_cta = MBAR_UTIL<op, scope, "space_cta", asm_sem>.asm_str;
+ defvar intr_cta = !cast<Intrinsic>(MBAR_UTIL<op, scope,
+ "space_cta", sem>.intr_name);
+
+ defvar asm_cluster = MBAR_UTIL<op, scope, "space_cluster", asm_sem>.asm_str;
+ defvar intr_cluster = !cast<Intrinsic>(MBAR_UTIL<op, scope,
+ "space_cluster", sem>.intr_name);
+
+ def _CTA : NVPTXInst<(outs B64:$state),
+ (ins ADDR:$addr, B32:$tx_count),
+ asm_cta # " $state, [$addr], $tx_count;",
+ [(set i64:$state, (intr_cta addr:$addr, i32:$tx_count))]>,
+ Requires<pred>;
+ def _CLUSTER : NVPTXInst<(outs),
+ (ins ADDR:$addr, B32:$tx_count),
+ asm_cluster # " _, [$addr], $tx_count;",
+ [(intr_cluster addr:$addr, i32:$tx_count)]>,
+ Requires<pred>;
+}
+foreach op = ["arrive", "arrive.expect_tx",
+ "arrive_drop", "arrive_drop.expect_tx"] in {
+ foreach scope = ["scope_cta", "scope_cluster"] in {
+ defvar suffix = !subst(".", "_", op) # scope;
+ defm mbar_ # suffix # _release : MBAR_ARR_INTR<op, scope, "", [hasPTX<80>, hasSM<90>]>;
+ defm mbar_ # suffix # _relaxed : MBAR_ARR_INTR<op, scope, "relaxed", [hasPTX<86>, hasSM<90>]>;
+ } // scope
+} // op
+
+multiclass MBAR_WAIT_INTR<string op, string scope, string sem, bit time_limit> {
+ // When either of sem or scope is non-default, both have to
+ // be explicitly specified. So, explicitly state that the
+ // semantics is `acquire` when the scope is `cluster`.
+ defvar asm_sem = !if(!and(!empty(sem), !eq(scope, "scope_cluster")),
+ "acquire", sem);
+
+ defvar asm_parity = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+ time_limit, 1>.asm_str;
+ defvar pred_parity = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+ time_limit, 1>.preds;
+ defvar intr_parity = !cast<Intrinsic>(MBAR_UTIL<op, scope, "space_cta",
+ sem, time_limit, 1>.intr_name);
+
+ defvar asm_state = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+ time_limit>.asm_str;
+ defvar pred_state = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+ time_limit>.preds;
+ defvar intr_state = !cast<Intrinsic>(MBAR_UTIL<op, scope, "space_cta",
+ sem, time_limit>.intr_name);
+
+ defvar ins_tl_dag = !if(time_limit, (ins B32:$tl), (ins));
+ defvar tl_suffix = !if(time_limit, ", $tl;", ";");
+ defvar intr_state_dag = !con((intr_state addr:$addr, i64:$state),
+ !if(time_limit, (intr_state i32:$tl), (intr_state)));
+ defvar intr_parity_dag = !con((intr_parity addr:$addr, i32:$phase),
+ !if(time_limit, (intr_parity i32:$tl), (intr_parity)));
+
+ def _STATE : NVPTXInst<(outs B1:$res), !con((ins ADDR:$addr, B64:$state), ins_tl_dag),
+ asm_state # " $res, [$addr], $state" # tl_suffix,
+ [(set i1:$res, intr_state_dag)]>,
+ Requires<pred_state>;
+ def _PARITY : NVPTXInst<(outs B1:$res), !con((ins ADDR:$addr, B32:$phase), ins_tl_dag),
+ asm_parity # " $res, [$addr], $phase" # tl_suffix,
+ [(set i1:$res, intr_parity_dag)]>,
+ Requires<pred_parity>;
+}
+foreach op = ["test_wait", "try_wait"] in {
+ foreach scope = ["scope_cta", "scope_cluster"] in {
+ foreach time_limit = !if(!eq(op, "try_wait"), [true, false], [false]) in {
+ defvar suffix = StrJoin<"_", [op, scope, !if(time_limit, "tl", "")]>.ret;
+ defm mbar_ # suffix # "_acquire" : MBAR_WAIT_INTR<op, scope, "", time_limit>;
+ defm mbar_ # suffix # "_relaxed" : MBAR_WAIT_INTR<op, scope, "relaxed", time_limit>;
+ } // time_limit
+ } // scope
+} // op
+
//-----------------------------------
// Math Functions
//-----------------------------------
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll b/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll
new file mode 100644
index 0000000000000..c440caaf98aba
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll
@@ -0,0 +1,165 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; === space_cta (addrspace 3) ===
+define void @test_mbarrier_arrive_scope_cta_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cta_space_cta(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cta_space_cta_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cta_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cta_space_cta(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cta_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %r1 = call i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ ret void
+}
+
+define void @test_mbarrier_arrive_scope_cluster_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cluster_space_cta(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cluster_space_cta_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cta_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.release.cluster.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.release.cluster.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cluster_space_cta(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cluster_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.release.cluster.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.release.cluster.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %r1 = call i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ ret void
+}
+
+; === space_cluster (addrspace 7) ===
+define void @test_mbarrier_arrive_scope_cta_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cta_space_cluster(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cta_space_cluster_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cluster_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cta_space_cluster(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cta_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ call void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ ret void
+}
+
+define void @test_mbarrier_arrive_scope_cluster_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cluster_space_cluster(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cluster_space_cluster(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_scope_cluster_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ call void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ ret void
+}
+
+declare i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+
+declare i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll b/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll
new file mode 100644
index 0000000000000..e4d2aa21f7def
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll
@@ -0,0 +1,165 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; === space_cta (addrspace 3) ===
+define void @test_mbarrier_arrive_relaxed_scope_cta_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cta(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cta.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cta.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cta(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cta.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cta.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %r1 = call i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ ret void
+}
+
+define void @test_mbarrier_arrive_relaxed_scope_cluster_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cta(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cluster.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cta(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cluster.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %r1 = call i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+ ret void
+}
+
+; === space_cluster (addrspace 7) ===
+define void @test_mbarrier_arrive_relaxed_scope_cta_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cluster(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cluster(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ call void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ ret void
+}
+
+define void @test_mbarrier_arrive_relaxed_scope_cluster_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cluster(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_1];
+; CHECK-PTX64-NEXT: mbarrier.arrive.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cluster(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ call void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+ ret void
+}
+
+declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+
+declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll b/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll
new file mode 100644
index 0000000000000..441ade3351206
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll
@@ -0,0 +1,87 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)
+
+define void @test_mbarrier_tx_space_cta(ptr addrspace(3) %mbar, i32 %tx_count) {
+; CHECK-PTX64-LABEL: test_mbarrier_tx_space_cta(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_tx_space_cta_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cta_param_1];
+; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cta.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cluster.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cta.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cluster.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_tx_space_cta(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_tx_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cta.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cluster.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cta.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cluster.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+ tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+
+ tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+ tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+
+ ret void
+}
+
+define void @test_mbarrier_tx_space_cluster(ptr addrspace(7) %mbar, i32 %tx_count) {
+; CHECK-PTX64-LABEL: test_mbarrier_tx_space_cluster(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [test_mbarrier_tx_space_cluster_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cluster_param_1];
+; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cta.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.expect_tx.relaxed.cluster.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cta.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.complete_tx.relaxed.cluster.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_tx_space_cluster(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [test_mbarrier_tx_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [test_mbarrier_tx_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cta.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.expect_tx.relaxed.cluster.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cta.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.complete_tx.relaxed.cluster.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+ tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+
+ tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+ tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+
+ ret void
+}
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll
new file mode 100644
index 0000000000000..5130ae2bfea67
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll
@@ -0,0 +1,35 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70| %ptxas-verify -arch=sm_80 %}
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70 --nvptx-short-ptr| %ptxas-verify -arch=sm_80 %}
+
+declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3), i64)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT: mbarrier.test_wait.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<2>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+
+ ret void
+}
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll
new file mode 100644
index 0000000000000..9327e7908cabd
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll
@@ -0,0 +1,36 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.1 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71| %ptxas-verify -arch=sm_80 %}
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.1 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71 --nvptx-short-ptr| %ptxas-verify -arch=sm_80 %}
+
+; --- test.wait.parity ---
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3), i32)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<2>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.shared.b64 %p1, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<2>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.shared.b64 %p1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+ ret void
+}
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll
new file mode 100644
index 0000000000000..9b19ad5f26026
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll
@@ -0,0 +1,83 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-7.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-7.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; --- try.wait without timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.scope.cta.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cta.space.cta(ptr addrspace(3), i32)
+
+; --- try.wait with timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cta.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cta.space.cta(ptr addrspace(3), i32, i32)
+
+define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_try_wait(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<3>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_param_1];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_2];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+ ret void
+}
+
+define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) {
+; CHECK-PTX64-LABEL: mbar_try_wait_tl(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<3>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_tl_param_1];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_3];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.shared.b64 %p1, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_2];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<4>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_1];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_3];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.shared.b64 %p1, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r3, [mbar_try_wait_tl_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.shared.b64 %p2, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+
+ ret void
+}
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll
new file mode 100644
index 0000000000000..034953ddb3072
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll
@@ -0,0 +1,123 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; with sm-90 and ptx-80, we have support for cluster-scope
+
+; --- test.wait ---
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3), i64)
+
+; --- try.wait without timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.scope.cluster.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+; --- try.wait with timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cluster.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cluster.space.cta(ptr addrspace(3), i32, i32)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<3>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT: mbarrier.test_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_2];
+; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_test_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+ ret void
+}
+
+define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_try_wait(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<3>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_param_1];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_2];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+ ret void
+}
+
+define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) {
+; CHECK-PTX64-LABEL: mbar_try_wait_tl(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<3>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_tl_param_1];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_3];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_2];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<4>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_1];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_3];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r3, [mbar_try_wait_tl_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+
+ ret void
+}
diff --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll
new file mode 100644
index 0000000000000..652634b67da98
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll
@@ -0,0 +1,148 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; --- test.wait ---
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64)
+
+; --- try.wait without timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cta.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+; --- try.wait with timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cta.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cta.space.cta(ptr addrspace(3), i32, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32, i32)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<5>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT: mbarrier.test_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: mbarrier.test_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_2];
+; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.test_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<5>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_test_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.test_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+
+ %ret2 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+ %ret3 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+ ret void
+}
+
+define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_try_wait(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<5>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_param_1];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2;
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_2];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r1;
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<5>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+
+ %ret2 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+ %ret3 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+ ret void
+}
+
+define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) {
+; CHECK-PTX64-LABEL: mbar_try_wait_tl(
+; CHECK-PTX64: {
+; CHECK-PTX64-NEXT: .reg .pred %p<5>;
+; CHECK-PTX64-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX64-NEXT: .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT: // %bb.0:
+; CHECK-PTX64-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX64-NEXT: ld.param.b64 %rd2, [mbar_try_wait_tl_param_1];
+; CHECK-PTX64-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_3];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_2];
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT: ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl(
+; CHECK-PTX-SHARED32: {
+; CHECK-PTX-SHARED32-NEXT: .reg .pred %p<5>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b32 %r<4>;
+; CHECK-PTX-SHARED32-NEXT: .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT: // %bb.0:
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b64 %rd1, [mbar_try_wait_tl_param_1];
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r2, [mbar_try_wait_tl_param_3];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT: ld.param.b32 %r3, [mbar_try_wait_tl_param_2];
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT: mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT: ret;
+ %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+ %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+
+ %ret2 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+ %ret3 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+
+ ret void
+}
More information about the llvm-commits
mailing list