[llvm] [NVPTX] Add TMA bulk tensor reduction intrinsics (PR #116854)
Durgadoss R via llvm-commits
llvm-commits at lists.llvm.org
Tue Nov 19 10:27:27 PST 2024
https://github.com/durga4github created https://github.com/llvm/llvm-project/pull/116854
This patch adds NVVM intrinsics and NVPTX codegen for:
* cp.async.bulk.tensor.reduce.1D -> 5D variants, supporting both Tile and Im2Col modes.
* These intrinsics optionally support cache_hints as indicated by the boolean flag argument.
* Lit tests are added for all combinations of these intrinsics in cp-async-bulk-tensor-reduce.ll.
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst file.
PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor
>From 51207bc32ec87efbee1d616afe8da364ad2f968a Mon Sep 17 00:00:00 2001
From: Durgadoss R <durgadossr at nvidia.com>
Date: Mon, 18 Nov 2024 17:45:08 +0530
Subject: [PATCH] [NVPTX] Add TMA bulk tensor reduction intrinsics
This patch adds NVVM intrinsics and NVPTX codegen for:
* cp.async.bulk.tensor.reduce.1D -> 5D variants,
supporting both Tile and Im2Col modes.
* These intrinsics optionally support cache_hints as
indicated by the boolean flag argument.
* Lit tests are added for all combinations of these
intrinsics in cp-async-bulk-tensor-reduce.ll.
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst file.
PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor
Signed-off-by: Durgadoss R <durgadossr at nvidia.com>
---
llvm/docs/NVPTXUsage.rst | 78 ++++
llvm/include/llvm/IR/IntrinsicsNVVM.td | 22 +
llvm/include/llvm/IR/NVVMIntrinsicFlags.h | 37 ++
.../NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp | 37 ++
.../NVPTX/MCTargetDesc/NVPTXInstPrinter.h | 2 +
llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp | 69 ++-
llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h | 2 +
llvm/lib/Target/NVPTX/NVPTXIntrinsics.td | 36 +-
.../NVPTX/cp-async-bulk-tensor-reduce.ll | 426 ++++++++++++++++++
9 files changed, 693 insertions(+), 16 deletions(-)
create mode 100644 llvm/include/llvm/IR/NVVMIntrinsicFlags.h
create mode 100644 llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 2152de9709dc6e..ed29d87edbad8e 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -663,6 +663,84 @@ the same functionality as described in the ``tile`` mode intrinsics above.
For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_.
+'``llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch, i8 %flag_red_op)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(..., i32 %d0, i32 %d1, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``' intrinsics
+correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
+These instructions initiate an asynchronous reduction operation of tensor data
+in global memory with tensor data in shared::cta memory, using ``tile`` mode.
+The dimension of the tensor data ranges from 1d to 5d with the coordinates
+specified by the ``i32 %d0 ... i32 %d4`` arguments.
+
+* The last two arguments to these intrinsics are flags.
+ These flag arguments must be compile-time constants. The backend
+ looks through these flags and lowers the intrinsics appropriately.
+
+* The Nth argument (denoted by ``i8 flag_red_op``) indicates the
+ kind of reduction operation performed. The argument must be in
+ the range [0, 7], representing the following reduction operations:
+
+ ========== =============
+ Enum Value Reduction Op
+ ========== =============
+ ``0`` ADD
+ ``1`` MIN
+ ``2`` MAX
+ ``3`` INC
+ ``4`` DEC
+ ``5`` AND
+ ``6`` OR
+ ``7`` XOR
+ ========== =============
+
+* The [N-1]th argument (denoted by ``i1 flag_ch``) when set, indicates
+ the presence of a valid cache_hint (``i64 %ch``) and generates the
+ ``.L2::cache_hint`` variant of the PTX instruction.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
+
+'``llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch, i8 %flag_red_op)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.[1-5]d``' intrinsics
+correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
+These instructions initiate an asynchronous reduction operation of tensor data
+in global memory with tensor data in shared::cta memory, using ``im2col`` mode.
+In this mode, the tensor has to be at least three-dimensional.
+The last two arguments of these intrinsics are compile-time flags,
+with the same functionality as described in the ``tile`` mode intrinsics above.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor>`_.
+
Other Intrinsics
----------------
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 115fcee0b04f22..b55b71a9418baa 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -635,6 +635,26 @@ class CP_ASYNC_BULK_TENSOR_PREFETCH_INTR<int dim, string mode> {
ImmArg<ArgIndex<FlagsStartIdx>>];
}
+class CP_ASYNC_BULK_TENSOR_REDUCE_INTR<int dim, string mode> {
+ string Name = "int_nvvm_cp_async_bulk_tensor_reduce_" # mode # "_" # dim # "d";
+
+ list<LLVMType> TensorDimsTy = !listsplat(llvm_i32_ty, dim);
+ list<LLVMType> ArgsTy = !listconcat(
+ [llvm_shared_ptr_ty, // src_smem_ptr
+ llvm_ptr_ty], // tensormap_ptr
+ TensorDimsTy, // actual tensor dims
+ [llvm_i64_ty, // cache_hint
+ llvm_i1_ty, // Flag for cache_hint
+ llvm_i8_ty] // Flag for Reduction Op
+ );
+ int FlagsStartIdx = !add(dim, 3);
+ list<IntrinsicProperty> IntrProp = [IntrConvergent,
+ ReadOnly<ArgIndex<0>>, ReadOnly<ArgIndex<1>>,
+ NoCapture<ArgIndex<0>>, NoCapture<ArgIndex<1>>,
+ ImmArg<ArgIndex<FlagsStartIdx>>,
+ ImmArg<ArgIndex<!add(FlagsStartIdx, 1)>>];
+}
+
let TargetPrefix = "nvvm" in {
def int_nvvm_prmt : ClangBuiltin<"__nvvm_prmt">,
DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
@@ -4926,6 +4946,8 @@ foreach dim = [1, 2, 3, 4, 5] in {
def s2g.Name : DefaultAttrsIntrinsic<[], s2g.ArgsTy, s2g.IntrProp>;
foreach prefetch = [CP_ASYNC_BULK_TENSOR_PREFETCH_INTR<dim, mode>] in
def prefetch.Name : DefaultAttrsIntrinsic<[], prefetch.ArgsTy, prefetch.IntrProp>;
+ foreach reduce = [CP_ASYNC_BULK_TENSOR_REDUCE_INTR<dim, mode>] in
+ def reduce.Name : DefaultAttrsIntrinsic<[], reduce.ArgsTy, reduce.IntrProp>;
}
}
diff --git a/llvm/include/llvm/IR/NVVMIntrinsicFlags.h b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
new file mode 100644
index 00000000000000..c82c4044e03cfa
--- /dev/null
+++ b/llvm/include/llvm/IR/NVVMIntrinsicFlags.h
@@ -0,0 +1,37 @@
+//===--- NVVMIntrinsicFlags.h -----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This file contains the definitions of the enumerations and flags
+/// associated with NVVM Intrinsics.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+#define LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+
+namespace llvm {
+namespace nvvm {
+
+// Reduction Ops supported with TMA Copy from Shared
+// to Global Memory for the "cp.reduce.async.bulk.tensor.*"
+// family of PTX instructions.
+enum class TMAReductionOp : uint8_t {
+ ADD = 0,
+ MIN = 1,
+ MAX = 2,
+ INC = 3,
+ DEC = 4,
+ AND = 5,
+ OR = 6,
+ XOR = 7,
+};
+
+} // namespace nvvm
+} // namespace llvm
+#endif // LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
index 7af3f76249d61d..7c80736adf3e8e 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.cpp
@@ -14,6 +14,7 @@
#include "NVPTX.h"
#include "NVPTXUtilities.h"
#include "llvm/ADT/StringRef.h"
+#include "llvm/IR/NVVMIntrinsicFlags.h"
#include "llvm/MC/MCExpr.h"
#include "llvm/MC/MCInst.h"
#include "llvm/MC/MCInstrInfo.h"
@@ -416,3 +417,39 @@ void NVPTXInstPrinter::printPrmtMode(const MCInst *MI, int OpNum,
return;
}
}
+
+void NVPTXInstPrinter::printTmaReductionMode(const MCInst *MI, int OpNum,
+ raw_ostream &O,
+ const char *Modifier) {
+ const MCOperand &MO = MI->getOperand(OpNum);
+
+ switch (static_cast<nvvm::TMAReductionOp>(MO.getImm())) {
+ case nvvm::TMAReductionOp::ADD:
+ O << ".add";
+ return;
+ case nvvm::TMAReductionOp::MIN:
+ O << ".min";
+ return;
+ case nvvm::TMAReductionOp::MAX:
+ O << ".max";
+ return;
+ case nvvm::TMAReductionOp::INC:
+ O << ".inc";
+ return;
+ case nvvm::TMAReductionOp::DEC:
+ O << ".dec";
+ return;
+ case nvvm::TMAReductionOp::AND:
+ O << ".and";
+ return;
+ case nvvm::TMAReductionOp::OR:
+ O << ".or";
+ return;
+ case nvvm::TMAReductionOp::XOR:
+ O << ".xor";
+ return;
+ default:
+ llvm_unreachable(
+ "Invalid Reduction Op in printCpAsyncBulkTensorReductionMode");
+ }
+}
diff --git a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
index 2ce40bd6e8b973..2b19386ef17fe5 100644
--- a/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
+++ b/llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXInstPrinter.h
@@ -54,6 +54,8 @@ class NVPTXInstPrinter : public MCInstPrinter {
raw_ostream &O, const char *Modifier = nullptr);
void printPrmtMode(const MCInst *MI, int OpNum, raw_ostream &O,
const char *Modifier = nullptr);
+ void printTmaReductionMode(const MCInst *MI, int OpNum, raw_ostream &O,
+ const char *Modifier = nullptr);
};
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index 173c37cfd8c8f7..15172a8d06fddb 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -4157,9 +4157,9 @@ bool NVPTXScopes::empty() const { return Scopes.size() == 0; }
? NVPTX::CP_ASYNC_BULK_TENSOR_##dir##_##dim##_SHARED32_##mode##suffix \
: NVPTX::CP_ASYNC_BULK_TENSOR_##dir##_##dim##_##mode##suffix)
-#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(dim, mode) \
- (IsCacheHint ? (CP_ASYNC_BULK_TENSOR_OPCODE(S2G, dim, mode, _CH)) \
- : (CP_ASYNC_BULK_TENSOR_OPCODE(S2G, dim, mode, )))
+#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(op, dim, mode) \
+ (IsCacheHint ? (CP_ASYNC_BULK_TENSOR_OPCODE(op, dim, mode, _CH)) \
+ : (CP_ASYNC_BULK_TENSOR_OPCODE(op, dim, mode, )))
#define GET_CP_ASYNC_BULK_TENSOR_OPCODE_G2S(dim, mode) \
[&]() -> auto { \
@@ -4177,15 +4177,19 @@ bool NVPTXScopes::empty() const { return Scopes.size() == 0; }
: NVPTX::CP_ASYNC_BULK_TENSOR_PREFETCH_##dim##_##mode)
static unsigned GetCpAsyncBulkTensorS2GOpcode(size_t Dim, bool IsShared32,
- bool IsCacheHint, bool IsIm2Col) {
+ bool IsCacheHint, bool IsIm2Col,
+ bool IsReduce = false) {
if (IsIm2Col) {
switch (Dim) {
case 3:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(3D, IM2COL);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 3D, IM2COL)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 3D, IM2COL);
case 4:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(4D, IM2COL);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 4D, IM2COL)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 4D, IM2COL);
case 5:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(5D, IM2COL);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 5D, IM2COL)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 5D, IM2COL);
default:
llvm_unreachable("Invalid Dimension in im2col mode for "
"GetCpAsyncBulkTensorS2GOpcode.");
@@ -4193,15 +4197,20 @@ static unsigned GetCpAsyncBulkTensorS2GOpcode(size_t Dim, bool IsShared32,
} else {
switch (Dim) {
case 1:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(1D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 1D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 1D, TILE);
case 2:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(2D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 2D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 2D, TILE);
case 3:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(3D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 3D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 3D, TILE);
case 4:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(4D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 4D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 4D, TILE);
case 5:
- return GET_CP_ASYNC_BULK_TENSOR_OPCODE_S2G(5D, TILE);
+ return IsReduce ? GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(RED, 5D, TILE)
+ : GET_CP_ASYNC_BULK_TENSOR_OPCODE_CH(S2G, 5D, TILE);
default:
llvm_unreachable(
"Invalid Dimension in tile mode for GetCpAsyncBulkTensorS2GOpcode.");
@@ -4377,6 +4386,30 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N,
ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
}
+void NVPTXDAGToDAGISel::SelectCpAsyncBulkTensorReduceCommon(SDNode *N,
+ bool IsIm2Col) {
+ // We have {Chain, Intrinsic-ID} followed by the actual intrisic args:
+ // src, dst, dims{d0...dN}, cache_hint, cache_hint_flag, reduction_kind_flag
+ // NumOperands = {Chain, IID} + {Actual intrinsic args}
+ // = {2} + {5 + dims}
+ size_t NumOps = N->getNumOperands();
+ size_t NumDims = NumOps - 7;
+ unsigned ReductionKind = N->getConstantOperandVal(NumOps - 1);
+ bool IsCacheHint = N->getConstantOperandVal(NumOps - 2) == 1;
+ size_t NumArgs = NumDims + (IsCacheHint ? 3 : 2); // src, dst, cache_hint
+
+ SDLoc DL(N);
+ SmallVector<SDValue, 12> Ops(N->ops().slice(2, NumArgs));
+ Ops.push_back(getI32Imm(ReductionKind, DL)); // Reduction Op
+ Ops.push_back(N->getOperand(0)); // Chain operand
+
+ bool IsShared32 =
+ CurDAG->getDataLayout().getPointerSizeInBits(ADDRESS_SPACE_SHARED) == 32;
+ unsigned Opcode = GetCpAsyncBulkTensorS2GOpcode(
+ NumDims, IsShared32, IsCacheHint, IsIm2Col, /*IsReduce=*/true);
+ ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
+}
+
bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
unsigned IID = N->getConstantOperandVal(1);
switch (IID) {
@@ -4418,5 +4451,17 @@ bool NVPTXDAGToDAGISel::tryIntrinsicVoid(SDNode *N) {
case Intrinsic::nvvm_cp_async_bulk_tensor_prefetch_im2col_5d:
SelectCpAsyncBulkTensorPrefetchCommon(N, /*IsIm2Col=*/true);
return true;
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_1d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_2d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_3d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_4d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_tile_5d:
+ SelectCpAsyncBulkTensorReduceCommon(N);
+ return true;
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_3d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_4d:
+ case Intrinsic::nvvm_cp_async_bulk_tensor_reduce_im2col_5d:
+ SelectCpAsyncBulkTensorReduceCommon(N, /*IsIm2Col=*/true);
+ return true;
}
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
index d6c80a31b7463d..0a79428fcec2d8 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h
@@ -95,6 +95,8 @@ class LLVM_LIBRARY_VISIBILITY NVPTXDAGToDAGISel : public SelectionDAGISel {
void SelectCpAsyncBulkTensorG2SCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorS2GCommon(SDNode *N, bool IsIm2Col = false);
void SelectCpAsyncBulkTensorPrefetchCommon(SDNode *N, bool IsIm2Col = false);
+ void SelectCpAsyncBulkTensorReduceCommon(SDNode *N, bool IsIm2Col = false);
+
inline SDValue getI32Imm(unsigned Imm, const SDLoc &DL) {
return CurDAG->getTargetConstant(Imm, DL, MVT::i32);
}
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 5878940812f62b..103c92b608dfba 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -564,17 +564,19 @@ foreach dim = [1, 2, 3, 4, 5] in {
}
// From Shared to Global memory (S2G)
-class S2G_STRINGS<int dim, string mode, bit ch, bit is_shared32 = 0> {
- string prefix = "cp.async.bulk.tensor";
+class S2G_STRINGS<int dim, string mode, bit ch,
+ bit is_shared32 = 0, bit is_reduce = 0> {
string dir = "global.shared::cta";
string completion = "bulk_group";
- string inst_name = prefix
+ string inst_name = !if(is_reduce, "cp.reduce", "cp")
+ # ".async.bulk.tensor"
# "." # dim # "d"
# "." # dir
# "." # mode
# "." # completion
# !if(ch, ".L2::cache_hint", "");
- string intr_name = "CP_ASYNC_BULK_TENSOR_S2G_"
+ string intr_name = "CP_ASYNC_BULK_TENSOR_"
+ # !if(is_reduce, "RED_", "S2G_")
# dim # "D"
# !if(is_shared32, "_SHARED32", "")
# !if(!eq(mode, "tile"), "_TILE", "_IM2COL");
@@ -596,11 +598,37 @@ multiclass CP_ASYNC_BULK_TENSOR_S2G_INTR<int dim, bit shared32, string mode> {
Requires<[hasPTX<80>, hasSM<90>]>;
}
+def TMAReductionFlags : Operand<i32> {
+ let PrintMethod = "printTmaReductionMode";
+}
+
+// TMA Copy from Shared to Global memory with Reduction
+multiclass CP_ASYNC_BULK_TENSOR_REDUCE_INTR<int dim, bit shared32, string mode> {
+ defvar dims_dag = !dag(ins, !listsplat(Int32Regs, dim), !foreach(i, !range(dim), "d" # i));
+ defvar dims_str = !interleave(!foreach(i, !range(dim), "$d" # i), ", ");
+ defvar asm_str = " [$tmap, {{" # dims_str # "}}], [$src]";
+ defvar rc = !if(shared32, Int32Regs, Int64Regs);
+
+ defvar prefix = "cp.reduce.async.bulk.tensor" # "." # dim # "d" # ".global.shared::cta";
+ defvar suffix = "." # mode # ".bulk_group";
+
+ def "": NVPTXInst<(outs),
+ !con((ins rc:$src, Int64Regs:$tmap), dims_dag, (ins TMAReductionFlags:$red_op)),
+ !strconcat(prefix, "${red_op}", suffix, asm_str, ";"), []>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+ def _CH: NVPTXInst<(outs),
+ !con((ins rc:$src, Int64Regs:$tmap), dims_dag, (ins Int64Regs:$ch, TMAReductionFlags:$red_op)),
+ !strconcat(prefix, "${red_op}", suffix, ".L2::cache_hint", asm_str, ", $ch;"), []>,
+ Requires<[hasPTX<80>, hasSM<90>]>;
+}
+
foreach dim = [1, 2, 3, 4, 5] in {
foreach shared32 = [true, false] in {
foreach mode = !if(!ge(dim, 3), ["tile", "im2col_no_offs"], ["tile"]) in {
defm S2G_STRINGS<dim, mode, 0, shared32>.intr_name :
CP_ASYNC_BULK_TENSOR_S2G_INTR<dim, shared32, mode>;
+ defm S2G_STRINGS<dim, mode, 0, shared32, 1>.intr_name :
+ CP_ASYNC_BULK_TENSOR_REDUCE_INTR<dim, shared32, mode>;
}
}
}
diff --git a/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll b/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll
new file mode 100644
index 00000000000000..28713109e742fa
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/cp-async-bulk-tensor-reduce.ll
@@ -0,0 +1,426 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| FileCheck --check-prefixes=CHECK-PTX %s
+; RUN: %if ptxas-12.3 %{ llc < %s -march=nvptx64 -mcpu=sm_90 -mattr=+ptx80| %ptxas-verify -arch=sm_90 %}
+
+target triple = "nvptx64-nvidia-cuda"
+
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i32 %d1, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 %flag_ch, i8 %flag_red);
+
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 %flag_ch, i8 %flag_red);
+declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %s, ptr %tm, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 %flag_ch, i8 %flag_red);
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_tile_1d
+define void @cp_async_bulk_tensor_reduce_tile_1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_tile_1d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<2>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_tile_1d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_tile_1d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_tile_1d_param_2];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_tile_1d_param_3];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.add.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.min.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.max.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.inc.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.dec.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.and.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.or.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.xor.tile.bulk_group.L2::cache_hint [%rd2, {%r1}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.add.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.min.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.max.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.inc.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.dec.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.and.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.or.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.1d.global.shared::cta.xor.tile.bulk_group [%rd2, {%r1}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i64 undef, i1 0, i8 7)
+ ret void
+}
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_tile_2d
+define void @cp_async_bulk_tensor_reduce_tile_2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_tile_2d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<3>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_tile_2d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_tile_2d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_tile_2d_param_2];
+; CHECK-PTX-NEXT: ld.param.u32 %r2, [cp_async_bulk_tensor_reduce_tile_2d_param_3];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_tile_2d_param_4];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.add.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.min.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.max.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.inc.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.dec.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.and.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.or.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.xor.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.add.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.min.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.max.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.inc.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.dec.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.and.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.or.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.2d.global.shared::cta.xor.tile.bulk_group [%rd2, {%r1, %r2}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i64 undef, i1 0, i8 7)
+ ret void
+}
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_tile_3d
+define void @cp_async_bulk_tensor_reduce_tile_3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_tile_3d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<4>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_tile_3d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_tile_3d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_tile_3d_param_2];
+; CHECK-PTX-NEXT: ld.param.u32 %r2, [cp_async_bulk_tensor_reduce_tile_3d_param_3];
+; CHECK-PTX-NEXT: ld.param.u32 %r3, [cp_async_bulk_tensor_reduce_tile_3d_param_4];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_tile_3d_param_5];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.add.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.min.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.max.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.inc.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.dec.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.and.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.or.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.xor.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.add.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.min.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.max.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.inc.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.dec.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.and.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.or.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.xor.tile.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 7)
+ ret void
+}
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_tile_4d
+define void @cp_async_bulk_tensor_reduce_tile_4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_tile_4d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<5>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_tile_4d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_tile_4d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_tile_4d_param_2];
+; CHECK-PTX-NEXT: ld.param.u32 %r2, [cp_async_bulk_tensor_reduce_tile_4d_param_3];
+; CHECK-PTX-NEXT: ld.param.u32 %r3, [cp_async_bulk_tensor_reduce_tile_4d_param_4];
+; CHECK-PTX-NEXT: ld.param.u32 %r4, [cp_async_bulk_tensor_reduce_tile_4d_param_5];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_tile_4d_param_6];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.add.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.min.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.max.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.inc.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.dec.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.and.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.or.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.xor.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.add.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.min.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.max.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.inc.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.dec.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.and.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.or.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.xor.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 7)
+ ret void
+}
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_tile_5d
+define void @cp_async_bulk_tensor_reduce_tile_5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_tile_5d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<6>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_tile_5d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_tile_5d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_tile_5d_param_2];
+; CHECK-PTX-NEXT: ld.param.u32 %r2, [cp_async_bulk_tensor_reduce_tile_5d_param_3];
+; CHECK-PTX-NEXT: ld.param.u32 %r3, [cp_async_bulk_tensor_reduce_tile_5d_param_4];
+; CHECK-PTX-NEXT: ld.param.u32 %r4, [cp_async_bulk_tensor_reduce_tile_5d_param_5];
+; CHECK-PTX-NEXT: ld.param.u32 %r5, [cp_async_bulk_tensor_reduce_tile_5d_param_6];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_tile_5d_param_7];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.add.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.min.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.max.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.inc.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.dec.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.and.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.or.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.xor.tile.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.add.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.min.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.max.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.inc.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.dec.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.and.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.or.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.xor.tile.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 7)
+ ret void
+}
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_im2col_3d
+define void @cp_async_bulk_tensor_reduce_im2col_3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_im2col_3d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<4>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_im2col_3d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_im2col_3d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_im2col_3d_param_2];
+; CHECK-PTX-NEXT: ld.param.u32 %r2, [cp_async_bulk_tensor_reduce_im2col_3d_param_3];
+; CHECK-PTX-NEXT: ld.param.u32 %r3, [cp_async_bulk_tensor_reduce_im2col_3d_param_4];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_im2col_3d_param_5];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.add.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.min.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.max.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.inc.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.dec.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.and.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.or.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.xor.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.add.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.min.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.max.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.inc.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.dec.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.and.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.or.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.3d.global.shared::cta.xor.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.3d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i64 undef, i1 0, i8 7)
+ ret void
+}
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_im2col_4d
+define void @cp_async_bulk_tensor_reduce_im2col_4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_im2col_4d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<5>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_im2col_4d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_im2col_4d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_im2col_4d_param_2];
+; CHECK-PTX-NEXT: ld.param.u32 %r2, [cp_async_bulk_tensor_reduce_im2col_4d_param_3];
+; CHECK-PTX-NEXT: ld.param.u32 %r3, [cp_async_bulk_tensor_reduce_im2col_4d_param_4];
+; CHECK-PTX-NEXT: ld.param.u32 %r4, [cp_async_bulk_tensor_reduce_im2col_4d_param_5];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_im2col_4d_param_6];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.add.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.min.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.max.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.inc.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.dec.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.and.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.or.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.xor.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.add.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.min.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.max.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.inc.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.dec.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.and.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.or.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.4d.global.shared::cta.xor.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.4d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i64 undef, i1 0, i8 7)
+ ret void
+}
+
+; CHECK-LABEL: cp_async_bulk_tensor_reduce_im2col_5d
+define void @cp_async_bulk_tensor_reduce_im2col_5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch) {
+; CHECK-PTX-LABEL: cp_async_bulk_tensor_reduce_im2col_5d(
+; CHECK-PTX: {
+; CHECK-PTX-NEXT: .reg .b32 %r<6>;
+; CHECK-PTX-NEXT: .reg .b64 %rd<4>;
+; CHECK-PTX-EMPTY:
+; CHECK-PTX-NEXT: // %bb.0:
+; CHECK-PTX-NEXT: ld.param.u64 %rd1, [cp_async_bulk_tensor_reduce_im2col_5d_param_0];
+; CHECK-PTX-NEXT: ld.param.u64 %rd2, [cp_async_bulk_tensor_reduce_im2col_5d_param_1];
+; CHECK-PTX-NEXT: ld.param.u32 %r1, [cp_async_bulk_tensor_reduce_im2col_5d_param_2];
+; CHECK-PTX-NEXT: ld.param.u32 %r2, [cp_async_bulk_tensor_reduce_im2col_5d_param_3];
+; CHECK-PTX-NEXT: ld.param.u32 %r3, [cp_async_bulk_tensor_reduce_im2col_5d_param_4];
+; CHECK-PTX-NEXT: ld.param.u32 %r4, [cp_async_bulk_tensor_reduce_im2col_5d_param_5];
+; CHECK-PTX-NEXT: ld.param.u32 %r5, [cp_async_bulk_tensor_reduce_im2col_5d_param_6];
+; CHECK-PTX-NEXT: ld.param.u64 %rd3, [cp_async_bulk_tensor_reduce_im2col_5d_param_7];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.add.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.min.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.max.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.inc.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.dec.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.and.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.or.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.xor.im2col_no_offs.bulk_group.L2::cache_hint [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1], %rd3;
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.add.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.min.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.max.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.inc.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.dec.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.and.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.or.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: cp.reduce.async.bulk.tensor.5d.global.shared::cta.xor.im2col_no_offs.bulk_group [%rd2, {%r1, %r2, %r3, %r4, %r5}], [%rd1];
+; CHECK-PTX-NEXT: ret;
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 %ch, i1 1, i8 7)
+
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 0)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 1)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 2)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 3)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 4)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 5)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 6)
+ tail call void @llvm.nvvm.cp.async.bulk.tensor.reduce.im2col.5d(ptr addrspace(3) %src, ptr %tmap, i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i64 undef, i1 0, i8 7)
+ ret void
+}
More information about the llvm-commits
mailing list