[llvm] e86a429 - [NVPTX] Add missing mbarrier intrinsics (#164864)

via llvm-commits llvm-commits at lists.llvm.org
Mon Oct 27 01:53:45 PDT 2025


Author: Durgadoss R
Date: 2025-10-27T14:23:40+05:30
New Revision: e86a42940a2c2c58ba5280ae2d54d58140a42875

URL: https://github.com/llvm/llvm-project/commit/e86a42940a2c2c58ba5280ae2d54d58140a42875
DIFF: https://github.com/llvm/llvm-project/commit/e86a42940a2c2c58ba5280ae2d54d58140a42875.diff

LOG: [NVPTX] Add missing mbarrier intrinsics (#164864)

This patch adds a few more mbarrier intrinsics,
completing support for all the mbarrier variants
up to Blackwell architecture.

* Docs are updated in NVPTXUsage.rst.
* lit tests are added for all the variants.
* lit tests are verified with PTXAS from CUDA-12.8 toolkit.

Signed-off-by: Durgadoss R <durgadossr at nvidia.com>

Added: 
    llvm/test/CodeGen/NVPTX/mbarrier_arr.ll
    llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll
    llvm/test/CodeGen/NVPTX/mbarrier_tx.ll
    llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll
    llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll
    llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll
    llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll
    llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll

Modified: 
    llvm/docs/NVPTXUsage.rst
    llvm/include/llvm/IR/IntrinsicsNVVM.td
    llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

Removed: 
    


################################################################################
diff  --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index e8dceb836f98a..5ad8f9ab07e40 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -322,6 +322,450 @@ aligned '``@llvm.nvvm.barrier.cta.*``' instruction should only be used if it is
 known that all threads in the CTA evaluate the condition identically, otherwise
 behavior is undefined.
 
+MBarrier family of Intrinsics
+-----------------------------
+
+Overview:
+^^^^^^^^^
+
+An ``mbarrier`` is a barrier created in shared memory that supports:
+
+* Synchronizing any subset of threads within a CTA.
+* One-way synchronization of threads across CTAs of a cluster.
+  Threads can perform only ``arrive`` operations but not ``*_wait`` on an
+  mbarrier located in shared::cluster space.
+* Waiting for completion of asynchronous memory operations initiated by a
+  thread and making them visible to other threads.
+
+Unlike ``bar{.cta}/barrier{.cta}`` instructions which can access a limited
+number of barriers per CTA, ``mbarrier`` objects are user-defined and are
+only limited by the total shared memory size available.
+
+An mbarrier object is an opaque object in shared memory with an
+alignment of 8-bytes. It keeps track of:
+
+* Current phase of the mbarrier object
+* Count of pending arrivals for the current phase of the mbarrier object
+* Count of expected arrivals for the next phase of the mbarrier object
+* Count of pending asynchronous memory operations (or transactions)
+  tracked by the current phase of the mbarrier object. This is also
+  referred to as ``tx-count``. The unit of ``tx-count`` is specified
+  by the asynchronous memory operation (for example,
+  ``llvm.nvvm.cp.async.bulk.tensor.g2s.*``).
+
+The ``phase`` of an mbarrier object is the number of times the mbarrier
+object has been used to synchronize threads/track async operations.
+In each phase, threads perform:
+
+* arrive/expect-tx/complete-tx operations to progress the current phase.
+* test_wait/try_wait operations to check for completion of the current phase.
+
+An mbarrier object completes the current phase when:
+
+* The count of the pending arrivals has reached zero AND
+* The tx-count has reached zero.
+
+When an mbarrier object completes the current phase, below
+actions are performed ``atomically``:
+
+* The mbarrier object transitions to the next phase.
+* The pending arrival count is reinitialized to the expected arrival count.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-mbarrier>`_.
+
+'``llvm.nvvm.mbarrier.init``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.mbarrier.init(ptr %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.init.shared(ptr addrspace(3) %addr, i32 %count)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.init.*``' intrinsics are used to initialize
+an mbarrier object located at ``addr`` with the value ``count``.
+``count`` is a 32-bit unsigned integer value and must be within
+the range [1...2^20-1]. During initialization:
+
+* The tx-count and the current phase of the mbarrier object are set to 0.
+* The expected and pending arrival counts are set to ``count``.
+
+Semantics:
+""""""""""
+
+The ``.shared`` variant explicitly uses shared memory address space for
+the ``addr`` operand. If the ``addr`` does not fall within the
+shared::cta space, then the behavior of this intrinsic is undefined.
+Performing ``mbarrier.init`` on a valid mbarrier object is undefined;
+use ``mbarrier.inval`` before reusing the memory for another mbarrier
+or any other purpose.
+
+'``llvm.nvvm.mbarrier.inval``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.mbarrier.inval(ptr %addr)
+  declare void @llvm.nvvm.mbarrier.inval.shared(ptr addrspace(3) %addr)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.inval.*``' intrinsics invalidate the mbarrier
+object at the address specified by ``addr``.
+
+Semantics:
+""""""""""
+
+The ``.shared`` variant explicitly uses shared memory address space for
+the ``addr`` operand. If the ``addr`` does not fall within the
+shared::cta space, then the behavior of this intrinsic is undefined.
+It is expected that ``addr`` was previously initialized using
+``mbarrier.init``; otherwise, the behavior is undefined.
+
+'``llvm.nvvm.mbarrier.expect.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.expect.tx.*``' intrinsics increase the transaction
+count of the mbarrier object at ``%addr`` by ``%tx_count``. The ``%tx_count``
+is a 32-bit unsigned integer value.
+
+Semantics:
+""""""""""
+
+The ``.space.{cta/cluster}`` indicates the address space where the mbarrier
+object resides.
+
+The ``.scope.{cta/cluster}`` denotes the set of threads that can directly
+observe the synchronizing effect of the mbarrier operation. When scope is
+"cta", all threads executing in the same CTA (as the current thread) can
+directly observe the effect of the ``expect.tx`` operation. Similarly,
+when scope is "cluster", all threads executing in the same Cluster
+(as the current thread) can directly observe the effect of the operation.
+
+If the ``addr`` does not fall within shared::cta or shared::cluster space,
+then the behavior of this intrinsic is undefined. This intrinsic has
+``relaxed`` semantics and hence does not provide any memory ordering
+or visibility guarantees.
+
+'``llvm.nvvm.mbarrier.complete.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.mbarrier.complete.tx.*``' intrinsics decrease the transaction
+count of the mbarrier object at ``%addr`` by ``%tx_count``. The ``%tx_count``
+is a 32-bit unsigned integer value. As a result of this decrement,
+the mbarrier can potentially complete its current phase and transition
+to the next phase.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.expect.tx.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.arrive``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.*`` intrinsics signal the arrival of the
+executing thread or completion of an asynchronous instruction associated with
+an arrive operation on the mbarrier object at ``%addr``. This operation
+decrements the pending arrival count by ``%count``, a 32-bit unsigned integer,
+potentially completing the current phase and triggering a transition to the
+next phase.
+
+Semantics:
+""""""""""
+
+The ``.space.{cta/cluster}`` indicates the address space where the mbarrier
+object resides. When the mbarrier is in shared::cta space, the intrinsics
+return an opaque 64-bit value capturing the phase of the mbarrier object
+_prior_ to this arrive operation. This value can be used with a try_wait
+or test_wait operation to check for the completion of the mbarrier.
+
+The ``.scope.{cta/cluster}`` denotes the set of threads that can directly
+observe the synchronizing effect of the mbarrier operation. When scope is
+"cta", all threads executing in the same CTA (as the current thread) can
+directly observe the effect of the ``arrive`` operation. Similarly,
+when scope is "cluster", all threads executing in the same Cluster
+(as the current thread) can directly observe the effect of the operation.
+
+If the ``addr`` does not fall within shared::cta or shared::cluster space,
+then the behavior of this intrinsic is undefined.
+
+These intrinsics have ``release`` semantics by default. The release semantics
+ensure ordering of operations that occur in program order _before_ this arrive
+instruction, making their effects visible to subsequent operations in other
+threads of the CTA (or cluster, depending on scope). Threads performing
+corresponding acquire operations (such as mbarrier.test.wait) synchronize
+with this release. The ``relaxed`` variants of these intrinsics do not
+provide any memory ordering or visibility guarantees.
+
+'``llvm.nvvm.mbarrier.arrive.expect.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.expect.tx.*`` intrinsics are similar to
+the ``@llvm.nvvm.mbarrier.arrive`` intrinsics except that they also
+perform an ``expect-tx`` operation _prior_ to the ``arrive`` operation.
+The ``%tx_count`` specifies the transaction count for the ``expect-tx``
+operation and the count for the ``arrive`` operation is assumed to be 1.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.arrive.drop``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.drop.*`` intrinsics decrement the
+expected arrival count of the mbarrier object at ``%addr`` by
+``%count`` and then perform an ``arrive`` operation with ``%count``.
+The ``%count`` is a 32-bit integer.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.arrive.drop.expect.tx``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare i64  @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+  declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %addr, i32 %tx_count)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.arrive.drop.expect.tx.*`` intrinsics perform
+the below operations on the mbarrier located at ``%addr``.
+
+* Perform an ``expect-tx`` operation i.e. increase the transaction count
+  of the mbarrier by ``%tx_count``, a 32-bit unsigned integer value.
+* Decrement the expected arrival count of the mbarrier by 1.
+* Perform an ``arrive`` operation on the mbarrier with a value of 1.
+
+Semantics:
+""""""""""
+
+The semantics of these intrinsics are identical to those of the
+``llvm.nvvm.mbarrier.arrive.*`` intrinsics described above.
+
+'``llvm.nvvm.mbarrier.test.wait``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state)
+  declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state)
+  declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase)
+  declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase)
+
+  declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state)
+  declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state)
+  declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase)
+  declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.test.wait.*`` intrinsics test for the completion
+of the current or the immediately preceding phase of an mbarrier object at
+``%addr``. The test for completion can be done with either the ``state`` or
+the ``phase-parity`` of the mbarrier object.
+
+* When done through the ``i64 %state`` operand, the state must be
+  returned by an ``llvm.nvvm.mbarrier.arrive.*`` on the _same_
+  mbarrier object.
+* The ``.parity`` variant of these intrinsics test for completion
+  of the phase indicated by the operand ``i32 %phase``, which is
+  the integer parity of either the current phase or the immediately
+  preceding phase of the mbarrier object. An even phase has integer
+  parity 0 and an odd phase has integer parity of 1. So the valid
+  values for phase-parity are 0 and 1.
+
+Semantics:
+""""""""""
+
+The ``.scope.{cta/cluster}`` denotes the set of threads that the
+test_wait operation can directly synchronize with.
+
+If the ``addr`` does not fall within shared::cta space, then the
+the behavior of this intrinsic is undefined.
+
+These intrinsics have ``acquire`` semantics by default. This acquire
+pattern establishes memory ordering for operations occurring in program
+order after this ``test_wait`` instruction by making operations from
+other threads in the CTA (or cluster, depending on scope) visible to
+subsequent operations in the current thread. When this wait completes,
+it synchronizes with the corresponding release pattern from the
+``mbarrier.arrive`` operation. The ``relaxed`` variants of these intrinsics
+do not provide any memory ordering or visibility guarantees.
+
+This ``test.wait`` intrinsic is non-blocking and immediately returns
+the completion status without suspending the executing thread.
+
+The boolean return value indicates:
+
+* True: The immediately preceding phase has completed
+* False: The current phase is still incomplete
+
+When this wait returns true, the following ordering guarantees hold:
+
+* All memory accesses (except async operations) requested prior to
+  ``mbarrier.arrive`` having release semantics by participating
+  threads of a CTA (or cluster, depending on scope) are visible to
+  the executing thread.
+* All ``cp.async`` operations requested prior to ``cp.async.mbarrier.arrive``
+  by participating threads of a CTA are visible to the executing thread.
+* All ``cp.async.bulk`` operations using the same mbarrier object requested
+  prior to ``mbarrier.arrive`` having release semantics by participating CTA
+  threads are visible to the executing thread.
+* Memory accesses requested after this wait are not visible to memory
+  accesses performed prior to ``mbarrier.arrive`` by other participating
+  threads.
+* No ordering guarantee exists for memory accesses by the same thread
+  between an ``mbarrier.arrive`` and this wait.
+
+'``llvm.nvvm.mbarrier.try.wait``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare i1 @llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state)
+  declare i1 @llvm.nvvm.mbarrier.try.wait{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state)
+
+  declare i1 @llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase)
+  declare i1 @llvm.nvvm.mbarrier.try.wait.parity{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase)
+
+  declare i1 @llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i64 %state, i32 %timelimit)
+  declare i1 @llvm.nvvm.mbarrier.try.wait.tl{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i64 %state, i32 %timelimit)
+
+  declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cta.space.cta(ptr addrspace(3) %addr, i32 %phase, i32 %timelimit)
+  declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl{.relaxed}.scope.cluster.space.cta(ptr addrspace(3) %addr, i32 %phase, i32 %timelimit)
+
+Overview:
+"""""""""
+
+The ``@llvm.nvvm.mbarrier.try.wait.*`` intrinsics test for the completion of
+the current or immediately preceding phase of an mbarrier object at ``%addr``.
+Unlike the ``test.wait`` intrinsics, which perform a non-blocking test, these
+intrinsics may block the executing thread until the specified phase completes
+or a system-dependent time limit expires. Suspended threads resume execution
+when the phase completes or the time limit elapses. This time limit is
+configurable through the ``.tl`` variants of these intrinsics, where the
+``%timelimit`` operand (an unsigned integer) specifies the limit in
+nanoseconds. Other semantics are identical to those of the ``test.wait``
+intrinsics described above.
+
 Electing a thread
 -----------------
 

diff  --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index c9df6c43fd396..719181a09f475 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -1866,6 +1866,73 @@ let IntrProperties = [IntrConvergent, IntrNoCallback] in {
 def int_nvvm_mbarrier_pending_count : NVVMBuiltin,
     Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem, IntrConvergent, IntrNoCallback]>;
 
+// mbarrier.{expect_tx/complete_tx}
+foreach op = ["expect_tx", "complete_tx"] in {
+  foreach scope = ["scope_cta", "scope_cluster"] in {
+    foreach space = ["space_cta", "space_cluster"] in {
+      defvar suffix = StrJoin<"_", [op, scope, space]>.ret;
+      defvar mbar_addr_ty = !if(!eq(space, "space_cta"),
+                            llvm_shared_ptr_ty, llvm_shared_cluster_ptr_ty);
+
+      def int_nvvm_mbarrier_ # suffix :
+          Intrinsic<[], [mbar_addr_ty, llvm_i32_ty],
+          [IntrConvergent, IntrArgMemOnly, IntrNoCallback]>;
+    } // space
+  } // scope
+} // op
+
+// mbarrier.arrive and mbarrier.arrive.expect_tx
+// mbarrier.arrive_drop and mbarrier.arrive_drop.expect_tx
+foreach op = ["arrive", "arrive_expect_tx",
+              "arrive_drop", "arrive_drop_expect_tx"] in {
+  foreach scope = ["scope_cta", "scope_cluster"] in {
+    foreach space = ["space_cta", "space_cluster"] in {
+      defvar suffix = StrJoin<"_", [scope, space]>.ret;
+      defvar mbar_addr_ty = !if(!eq(space, "space_cta"),
+                            llvm_shared_ptr_ty, llvm_shared_cluster_ptr_ty);
+      defvar args_ty = [mbar_addr_ty, // mbar_address_ptr
+                        llvm_i32_ty]; // tx-count
+
+      // mbarriers in shared_cluster space cannot return any value.
+      defvar mbar_ret_ty = !if(!eq(space, "space_cta"),
+                              [llvm_i64_ty], []<LLVMType>);
+
+      def int_nvvm_mbarrier_ # op # "_" # suffix:
+          Intrinsic<mbar_ret_ty, args_ty,
+          [IntrConvergent, IntrNoCallback]>;
+      def int_nvvm_mbarrier_ # op # "_relaxed_" # suffix :
+          Intrinsic<mbar_ret_ty, args_ty,
+          [IntrConvergent, IntrArgMemOnly, IntrNoCallback]>;
+    } // space
+  } // scope
+} // op
+
+// mbarrier.{test_wait and try_wait}
+foreach op = ["test_wait", "try_wait"] in {
+  foreach scope = ["scope_cta", "scope_cluster"] in {
+    foreach parity = [true, false] in {
+      foreach time_limit = !if(!eq(op, "try_wait"), [true, false], [false]) in {
+        defvar base_args = [llvm_shared_ptr_ty]; // mbar_ptr
+        defvar parity_args = !if(parity, [llvm_i32_ty], [llvm_i64_ty]);
+        defvar tl_args = !if(time_limit, [llvm_i32_ty], []<LLVMType>);
+        defvar args = !listconcat(base_args, parity_args, tl_args);
+        defvar tmp_op = StrJoin<"_", [op,
+                           !if(parity, "parity", ""),
+                           !if(time_limit, "tl", "")]>.ret;
+        defvar suffix = StrJoin<"_", [scope, "space_cta"]>.ret;
+
+        def int_nvvm_mbarrier_ # tmp_op # "_" # suffix :
+            Intrinsic<[llvm_i1_ty], args,
+            [IntrConvergent, NoCapture<ArgIndex<0>>, IntrNoCallback]>;
+        def int_nvvm_mbarrier_ # tmp_op # "_relaxed_" # suffix :
+            Intrinsic<[llvm_i1_ty], args,
+            [IntrConvergent, NoCapture<ArgIndex<0>>, IntrNoCallback,
+             IntrArgMemOnly, IntrReadMem]>;
+      } // tl
+    } // parity
+  } // scope
+} // op
+
 // Generated within nvvm. Use for ldu on sm_20 or later.  Second arg is the
 // pointer's alignment.
 let IntrProperties = [IntrReadMem, IntrArgMemOnly, IntrNoCallback, IntrWillReturn, NoCapture<ArgIndex<0>>] in {
@@ -3000,4 +3067,4 @@ foreach sp = [0, 1] in {
   }
 }
 
-} // let TargetPrefix = "nvvm"
\ No newline at end of file
+} // let TargetPrefix = "nvvm"

diff  --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 598735f5972bc..c923f0ec907e7 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -1082,6 +1082,161 @@ let Predicates = [hasPTX<70>, hasSM<80>] in {
             "mbarrier.pending_count.b64",
             [(set i32:$res, (int_nvvm_mbarrier_pending_count i64:$state))]>;
 }
+
+class MBAR_UTIL<string op, string scope,
+                string space = "", string sem = "",
+                bit tl = 0, bit parity = 0> {
+  // The mbarrier instructions in PTX ISA are of the general form:
+  // mbarrier.op.semantics.scope.space.b64 arg1, arg2 ...
+  // where:
+  // op -> arrive, expect_tx, complete_tx, arrive.expect_tx etc.
+  // semantics -> acquire, release, relaxed (default depends on the op)
+  // scope -> cta or cluster (default is cta-scope)
+  // space -> shared::cta or shared::cluster (default is shared::cta)
+  //
+  // The 'semantics' and 'scope' go together. If one is specified,
+  // then the other _must_ be specified. For example:
+  // (A) mbarrier.arrive             <args> (valid, release and cta are default)
+  // (B) mbarrier.arrive.release.cta <args> (valid, sem/scope mentioned explicitly)
+  // (C) mbarrier.arrive.release     <args> (invalid, needs scope)
+  // (D) mbarrier.arrive.cta         <args> (invalid, needs order)
+  //
+  // Wherever possible, we prefer form (A) to (B) since it is available
+  // from early PTX versions. In most cases, explicitly specifying the
+  // scope requires a later version of PTX.
+  string _scope_asm = !cond(
+                      !eq(scope, "scope_cluster") : "cluster",
+                      !eq(scope, "scope_cta") : !if(!empty(sem), "", "cta"),
+                      true : scope);
+  string _space_asm = !cond(
+                      !eq(space, "space_cta") : "shared",
+                      !eq(space, "space_cluster") : "shared::cluster",
+                      true : space);
+
+  string _parity = !if(parity, "parity", "");
+  string asm_str = StrJoin<".", ["mbarrier", op, _parity,
+                                  sem, _scope_asm, _space_asm, "b64"]>.ret;
+
+  string _intr_suffix = StrJoin<"_", [!subst(".", "_", op), _parity,
+                                      !if(tl, "tl", ""),
+                                      sem, scope, space]>.ret;
+  string intr_name = "int_nvvm_mbarrier_" # _intr_suffix;
+
+  // Predicate checks:
+  // These are used only for the "test_wait/try_wait" variants as they
+  // have evolved since sm80 and are complex. The predicates for the
+  // remaining instructions are straightforward and have already been
+  // applied directly.
+  Predicate _sm_pred = !cond(!or(
+                       !eq(op, "try_wait"),
+                       !eq(scope, "scope_cluster"),
+                       !eq(sem, "relaxed")) : hasSM<90>,
+                       true : hasSM<80>);
+  Predicate _ptx_pred = !cond(
+                        !eq(sem, "relaxed") : hasPTX<86>,
+                        !ne(_scope_asm, "") : hasPTX<80>,
+                        !eq(op, "try_wait") : hasPTX<78>,
+                        parity : hasPTX<71>,
+                        true   : hasPTX<70>);
+  list<Predicate> preds = [_ptx_pred, _sm_pred];
+}
+
+foreach op = ["expect_tx", "complete_tx"] in {
+  foreach scope = ["scope_cta", "scope_cluster"] in {
+    foreach space = ["space_cta", "space_cluster"] in {
+      defvar intr = !cast<Intrinsic>(MBAR_UTIL<op, scope, space>.intr_name);
+      defvar suffix = StrJoin<"_", [op, scope, space]>.ret;
+      def mbar_ # suffix : BasicNVPTXInst<(outs), (ins ADDR:$addr, B32:$tx_count),
+                           MBAR_UTIL<op, scope, space, "relaxed">.asm_str,
+                           [(intr addr:$addr, i32:$tx_count)]>,
+                           Requires<[hasPTX<80>, hasSM<90>]>;
+    } // space
+  } // scope
+} // op
+
+multiclass MBAR_ARR_INTR<string op, string scope, string sem,
+                         list<Predicate> pred = []> {
+  // When either of sem or scope is non-default, both have to
+  // be explicitly specified. So, explicitly state that
+  // sem is `release` when scope is `cluster`.
+  defvar asm_sem = !if(!and(!empty(sem), !eq(scope, "scope_cluster")),
+                    "release", sem);
+
+  defvar asm_cta  = MBAR_UTIL<op, scope, "space_cta", asm_sem>.asm_str;
+  defvar intr_cta = !cast<Intrinsic>(MBAR_UTIL<op, scope,
+                                     "space_cta", sem>.intr_name);
+
+  defvar asm_cluster  = MBAR_UTIL<op, scope, "space_cluster", asm_sem>.asm_str;
+  defvar intr_cluster = !cast<Intrinsic>(MBAR_UTIL<op, scope,
+                                         "space_cluster", sem>.intr_name);
+
+  def _CTA : NVPTXInst<(outs B64:$state),
+             (ins ADDR:$addr, B32:$tx_count),
+             asm_cta # " $state, [$addr], $tx_count;",
+             [(set i64:$state, (intr_cta addr:$addr, i32:$tx_count))]>,
+             Requires<pred>;
+  def _CLUSTER : NVPTXInst<(outs),
+                 (ins ADDR:$addr, B32:$tx_count),
+                 asm_cluster # " _, [$addr], $tx_count;",
+                 [(intr_cluster addr:$addr, i32:$tx_count)]>,
+                 Requires<pred>;
+}
+foreach op = ["arrive", "arrive.expect_tx",
+              "arrive_drop", "arrive_drop.expect_tx"] in {
+  foreach scope = ["scope_cta", "scope_cluster"] in {
+    defvar suffix = !subst(".", "_", op) # scope;
+    defm mbar_ # suffix # _release : MBAR_ARR_INTR<op, scope, "", [hasPTX<80>, hasSM<90>]>;
+    defm mbar_ # suffix # _relaxed : MBAR_ARR_INTR<op, scope, "relaxed", [hasPTX<86>, hasSM<90>]>;
+  } // scope
+} // op
+
+multiclass MBAR_WAIT_INTR<string op, string scope, string sem, bit time_limit> {
+  // When either of sem or scope is non-default, both have to
+  // be explicitly specified. So, explicitly state that the
+  // semantics is `acquire` when the scope is `cluster`.
+  defvar asm_sem = !if(!and(!empty(sem), !eq(scope, "scope_cluster")),
+                    "acquire", sem);
+
+  defvar asm_parity  = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+                                 time_limit, 1>.asm_str;
+  defvar pred_parity = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+                                 time_limit, 1>.preds;
+  defvar intr_parity = !cast<Intrinsic>(MBAR_UTIL<op, scope, "space_cta",
+                                        sem, time_limit, 1>.intr_name);
+
+  defvar asm_state  = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+                                time_limit>.asm_str;
+  defvar pred_state = MBAR_UTIL<op, scope, "space_cta", asm_sem,
+                                time_limit>.preds;
+  defvar intr_state = !cast<Intrinsic>(MBAR_UTIL<op, scope, "space_cta",
+                                       sem, time_limit>.intr_name);
+
+  defvar ins_tl_dag = !if(time_limit, (ins B32:$tl), (ins));
+  defvar tl_suffix = !if(time_limit, ", $tl;", ";");
+  defvar intr_state_dag = !con((intr_state addr:$addr, i64:$state),
+                               !if(time_limit, (intr_state i32:$tl), (intr_state)));
+  defvar intr_parity_dag = !con((intr_parity addr:$addr, i32:$phase),
+                               !if(time_limit, (intr_parity i32:$tl), (intr_parity)));
+
+  def _STATE : NVPTXInst<(outs B1:$res), !con((ins ADDR:$addr, B64:$state), ins_tl_dag),
+               asm_state # " $res, [$addr], $state" # tl_suffix,
+               [(set i1:$res, intr_state_dag)]>,
+               Requires<pred_state>;
+  def _PARITY : NVPTXInst<(outs B1:$res), !con((ins ADDR:$addr, B32:$phase), ins_tl_dag),
+                asm_parity # " $res, [$addr], $phase" # tl_suffix,
+                [(set i1:$res, intr_parity_dag)]>,
+                Requires<pred_parity>;
+}
+foreach op = ["test_wait", "try_wait"] in {
+  foreach scope = ["scope_cta", "scope_cluster"] in {
+    foreach time_limit = !if(!eq(op, "try_wait"), [true, false], [false]) in {
+      defvar suffix = StrJoin<"_", [op, scope, !if(time_limit, "tl", "")]>.ret;
+      defm mbar_ # suffix # "_acquire" : MBAR_WAIT_INTR<op, scope, "", time_limit>;
+      defm mbar_ # suffix # "_relaxed" : MBAR_WAIT_INTR<op, scope, "relaxed", time_limit>;
+    } // time_limit
+  } // scope
+} // op
+
 //-----------------------------------
 // Math Functions
 //-----------------------------------

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll b/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll
new file mode 100644
index 0000000000000..c440caaf98aba
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_arr.ll
@@ -0,0 +1,165 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; === space_cta (addrspace 3) ===
+define void @test_mbarrier_arrive_scope_cta_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cta_space_cta(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cta_space_cta_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cta_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cta_space_cta(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_scope_cta_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %r1 = call i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  ret void
+}
+
+define void @test_mbarrier_arrive_scope_cluster_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cluster_space_cta(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cluster_space_cta_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cta_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.release.cluster.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.release.cluster.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.release.cluster.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.release.cluster.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cluster_space_cta(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_scope_cluster_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.release.cluster.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.release.cluster.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.release.cluster.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.release.cluster.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %r1 = call i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  ret void
+}
+
+; === space_cluster (addrspace 7) ===
+define void @test_mbarrier_arrive_scope_cta_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cta_space_cluster(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cta_space_cluster_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cluster_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cta_space_cluster(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cta_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_scope_cta_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  call void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  ret void
+}
+
+define void @test_mbarrier_arrive_scope_cluster_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_scope_cluster_space_cluster(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.release.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_scope_cluster_space_cluster(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_scope_cluster_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_scope_cluster_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.release.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  call void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  ret void
+}
+
+declare i64 @llvm.nvvm.mbarrier.arrive.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+
+declare i64 @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll b/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll
new file mode 100644
index 0000000000000..e4d2aa21f7def
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_arr_relaxed.ll
@@ -0,0 +1,165 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; === space_cta (addrspace 3) ===
+define void @test_mbarrier_arrive_relaxed_scope_cta_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cta(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.relaxed.cta.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.relaxed.cta.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.relaxed.cta.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cta.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cta(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cta_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.relaxed.cta.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.relaxed.cta.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.relaxed.cta.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cta.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %r1 = call i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  ret void
+}
+
+define void @test_mbarrier_arrive_relaxed_scope_cluster_space_cta(ptr addrspace(3) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cta(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<6>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.relaxed.cluster.shared.b64 %rd2, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.relaxed.cluster.shared.b64 %rd3, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.relaxed.cluster.shared.b64 %rd4, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared.b64 %rd5, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cta(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<5>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cluster_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.relaxed.cluster.shared.b64 %rd1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.relaxed.cluster.shared.b64 %rd2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.relaxed.cluster.shared.b64 %rd3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared.b64 %rd4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %r1 = call i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r2 = call i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r3 = call i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  %r4 = call i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx)
+  ret void
+}
+
+; === space_cluster (addrspace 7) ===
+define void @test_mbarrier_arrive_relaxed_scope_cta_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cluster(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cta.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cta_space_cluster(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cta_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cta.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  call void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  ret void
+}
+
+define void @test_mbarrier_arrive_relaxed_scope_cluster_space_cluster(ptr addrspace(7) %mbar, i32 %tx) {
+; CHECK-PTX64-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cluster(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.arrive.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_arrive_relaxed_scope_cluster_space_cluster(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_arrive_relaxed_scope_cluster_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.arrive_drop.expect_tx.relaxed.cluster.shared::cluster.b64 _, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  call void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  call void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx)
+  ret void
+}
+
+declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+
+declare i64 @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i64 @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cta.space.cluster(ptr addrspace(7), i32)
+
+declare void @llvm.nvvm.mbarrier.arrive.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.arrive.drop.expect.tx.relaxed.scope.cluster.space.cluster(ptr addrspace(7), i32)

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll b/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll
new file mode 100644
index 0000000000000..441ade3351206
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_tx.ll
@@ -0,0 +1,87 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7), i32)
+declare void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7), i32)
+
+define void @test_mbarrier_tx_space_cta(ptr addrspace(3) %mbar, i32 %tx_count) {
+; CHECK-PTX64-LABEL: test_mbarrier_tx_space_cta(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_tx_space_cta_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_tx_space_cta_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.expect_tx.relaxed.cta.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.expect_tx.relaxed.cluster.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.complete_tx.relaxed.cta.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.complete_tx.relaxed.cluster.shared.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_tx_space_cta(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_tx_space_cta_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_tx_space_cta_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.expect_tx.relaxed.cta.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.expect_tx.relaxed.cluster.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.complete_tx.relaxed.cta.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.complete_tx.relaxed.cluster.shared.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+  tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+
+  tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+  tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %tx_count)
+
+  ret void
+}
+
+define void @test_mbarrier_tx_space_cluster(ptr addrspace(7) %mbar, i32 %tx_count) {
+; CHECK-PTX64-LABEL: test_mbarrier_tx_space_cluster(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [test_mbarrier_tx_space_cluster_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [test_mbarrier_tx_space_cluster_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.expect_tx.relaxed.cta.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.expect_tx.relaxed.cluster.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.complete_tx.relaxed.cta.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.complete_tx.relaxed.cluster.shared::cluster.b64 [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: test_mbarrier_tx_space_cluster(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [test_mbarrier_tx_space_cluster_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [test_mbarrier_tx_space_cluster_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.expect_tx.relaxed.cta.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.expect_tx.relaxed.cluster.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.complete_tx.relaxed.cta.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.complete_tx.relaxed.cluster.shared::cluster.b64 [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+  tail call void @llvm.nvvm.mbarrier.expect.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+
+  tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cta.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+  tail call void @llvm.nvvm.mbarrier.complete.tx.scope.cluster.space.cluster(ptr addrspace(7) %mbar, i32 %tx_count)
+
+  ret void
+}

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll
new file mode 100644
index 0000000000000..5130ae2bfea67
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx70.ll
@@ -0,0 +1,35 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70| %ptxas-verify -arch=sm_80 %}
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx70 --nvptx-short-ptr| %ptxas-verify -arch=sm_80 %}
+
+declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3), i64)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<2>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+
+  ret void
+}

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll
new file mode 100644
index 0000000000000..9327e7908cabd
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm80_ptx71.ll
@@ -0,0 +1,36 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.1 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71| %ptxas-verify -arch=sm_80 %}
+; RUN: %if ptxas-sm_80 && ptxas-isa-7.1 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_80 -mattr=+ptx71 --nvptx-short-ptr| %ptxas-verify -arch=sm_80 %}
+
+; --- test.wait.parity ---
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3), i32)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<2>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.parity.shared.b64 %p1, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<2>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.parity.shared.b64 %p1, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+  ret void
+}

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll
new file mode 100644
index 0000000000000..9b19ad5f26026
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx78.ll
@@ -0,0 +1,83 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-7.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-7.8 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx78 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; --- try.wait without timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.scope.cta.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cta.space.cta(ptr addrspace(3), i32)
+
+; --- try.wait with timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cta.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cta.space.cta(ptr addrspace(3), i32, i32)
+
+define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_try_wait(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_try_wait_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_try_wait_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.shared.b64 %p2, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_try_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_try_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.shared.b64 %p2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+  ret void
+}
+
+define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) {
+; CHECK-PTX64-LABEL: mbar_try_wait_tl(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_try_wait_tl_param_1];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_try_wait_tl_param_3];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.shared.b64 %p1, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r2, [mbar_try_wait_tl_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.shared.b64 %p2, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<4>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_tl_param_1];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_try_wait_tl_param_3];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.shared.b64 %p1, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r3, [mbar_try_wait_tl_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.shared.b64 %p2, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+
+  ret void
+}

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll
new file mode 100644
index 0000000000000..034953ddb3072
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx80.ll
@@ -0,0 +1,123 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.0 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx80 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; with sm-90 and ptx-80, we have support for cluster-scope
+
+; --- test.wait ---
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3), i64)
+
+; --- try.wait without timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.scope.cluster.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+; --- try.wait with timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cluster.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cluster.space.cta(ptr addrspace(3), i32, i32)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_test_wait_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_test_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+  ret void
+}
+
+define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_try_wait(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_try_wait_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_try_wait_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_try_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_try_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+  ret void
+}
+
+define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) {
+; CHECK-PTX64-LABEL: mbar_try_wait_tl(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_try_wait_tl_param_1];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_try_wait_tl_param_3];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r2, [mbar_try_wait_tl_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<4>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_tl_param_1];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_try_wait_tl_param_3];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.acquire.cluster.shared.b64 %p1, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r3, [mbar_try_wait_tl_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.acquire.cluster.shared.b64 %p2, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+
+  ret void
+}

diff  --git a/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll
new file mode 100644
index 0000000000000..652634b67da98
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/mbarrier_wait_sm90_ptx86.ll
@@ -0,0 +1,148 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| FileCheck --check-prefixes=CHECK-PTX64 %s
+; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| FileCheck --check-prefixes=CHECK-PTX-SHARED32 %s
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86| %ptxas-verify -arch=sm_90 %}
+; RUN: %if ptxas-sm_90 && ptxas-isa-8.6 %{ llc < %s -mtriple=nvptx64 -mcpu=sm_90 -mattr=+ptx86 --nvptx-short-ptr| %ptxas-verify -arch=sm_90 %}
+
+; --- test.wait ---
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64)
+
+; --- try.wait without timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cta.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3), i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32)
+
+; --- try.wait with timelimit ---
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cta.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3), i64, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cta.space.cta(ptr addrspace(3), i32, i32)
+declare i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3), i32, i32)
+
+define void @mbar_test_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_test_wait(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<5>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_test_wait_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_test_wait_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_test_wait_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.test_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_test_wait(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<5>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_test_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_test_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_test_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.test_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.test.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+
+  %ret2 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+  %ret3 = call i1 @llvm.nvvm.mbarrier.test.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+  ret void
+}
+
+define void @mbar_try_wait(ptr addrspace(3) %mbar, i64 %state, i32 %parity) {
+; CHECK-PTX64-LABEL: mbar_try_wait(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<5>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<2>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_try_wait_param_1];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_try_wait_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<5>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_try_wait_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_param_1];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_try_wait_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state)
+
+  %ret2 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+  %ret3 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity)
+
+  ret void
+}
+
+define void @mbar_try_wait_tl(ptr addrspace(3) %mbar, i64 %state, i32 %parity, i32 %tl) {
+; CHECK-PTX64-LABEL: mbar_try_wait_tl(
+; CHECK-PTX64:       {
+; CHECK-PTX64-NEXT:    .reg .pred %p<5>;
+; CHECK-PTX64-NEXT:    .reg .b32 %r<3>;
+; CHECK-PTX64-NEXT:    .reg .b64 %rd<3>;
+; CHECK-PTX64-EMPTY:
+; CHECK-PTX64-NEXT:  // %bb.0:
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX64-NEXT:    ld.param.b64 %rd2, [mbar_try_wait_tl_param_1];
+; CHECK-PTX64-NEXT:    ld.param.b32 %r1, [mbar_try_wait_tl_param_3];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%rd1], %rd2, %r1;
+; CHECK-PTX64-NEXT:    ld.param.b32 %r2, [mbar_try_wait_tl_param_2];
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT:    mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%rd1], %r2, %r1;
+; CHECK-PTX64-NEXT:    ret;
+;
+; CHECK-PTX-SHARED32-LABEL: mbar_try_wait_tl(
+; CHECK-PTX-SHARED32:       {
+; CHECK-PTX-SHARED32-NEXT:    .reg .pred %p<5>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b32 %r<4>;
+; CHECK-PTX-SHARED32-NEXT:    .reg .b64 %rd<2>;
+; CHECK-PTX-SHARED32-EMPTY:
+; CHECK-PTX-SHARED32-NEXT:  // %bb.0:
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r1, [mbar_try_wait_tl_param_0];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b64 %rd1, [mbar_try_wait_tl_param_1];
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r2, [mbar_try_wait_tl_param_3];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.relaxed.cta.shared.b64 %p1, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.relaxed.cluster.shared.b64 %p2, [%r1], %rd1, %r2;
+; CHECK-PTX-SHARED32-NEXT:    ld.param.b32 %r3, [mbar_try_wait_tl_param_2];
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.relaxed.cta.shared.b64 %p3, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT:    mbarrier.try_wait.parity.relaxed.cluster.shared.b64 %p4, [%r1], %r3, %r2;
+; CHECK-PTX-SHARED32-NEXT:    ret;
+  %ret0 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+  %ret1 = call i1 @llvm.nvvm.mbarrier.try.wait.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i64 %state, i32 %tl)
+
+  %ret2 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cta.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+  %ret3 = call i1 @llvm.nvvm.mbarrier.try.wait.parity.tl.relaxed.scope.cluster.space.cta(ptr addrspace(3) %mbar, i32 %parity, i32 %tl)
+
+  ret void
+}


        


More information about the llvm-commits mailing list