[llvm] [NVPTX] Add TMA bulk tensor copy intrinsics (PR #96083)

Mon Jul 22 05:31:52 PDT 2024

================
@@ -0,0 +1,40 @@
+//===--- NVVMIntrinsicFlags.h -----------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+/// \file
+/// This file contains the definitions of the enumerations and flags
+/// associated with NVVM Intrinsics.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+#define LLVM_SUPPORT_NVVMINTRINSICFLAGS_H
+
+#include <stdint.h>
+
+namespace llvm {
+namespace nvvm {
+
+enum class CpAsyncBulkTensorLoadMode {
+  TILE = 0,
+  IM2COL = 1,
+};
+
+typedef union {
+  int V;
+  struct {
+    unsigned CacheHint : 1;
+    unsigned MultiCast : 1;
+    unsigned LoadMode : 3; // CpAsyncBulkTensorLoadMode
+    unsigned reserved : 27;
+  } U;
+} CpAsyncBulkTensorFlags;
----------------
durga4github wrote:

The value of this union is used as the 'flag' operand.
This is the first i32 operand of intrinsic which must be a compile-time constant.
This encodes the modifiers and load modes which then the backend
peeks at and lowers to the corresponding instructions. This is not
passed to PTX but used only from IR to ISel.

The usage of the 'flag' enables us to have one intrinsic (per-dim) here
instead of the 8 intrinsics (2(ch)x 2(mc) x 2(mode)) required otherwise.

For example, in the G2S case:
* The optional cache_hint is always passed to the intrinsic as an i64 operand.
  If the cache_hint bit is set, the i64 operand is used and the intrinsic
  gets lowered to the _CH variant in the backend. Since this is optional,
  when the cache_hint bit is not set, the corresponding i64 operand is ignored
  (can be an undef) during codegen.
* Similarly, depending on the multicast and cache_hint bits, the
  intrinsic lowers to one of the four variants:
  no_MC_CH(default), only _MC, only _CH, both _MC_CH.
* The same approach is extended for the load-modes too.

It is possible to accommodate additional modifiers in future archs,
by extending the flag-bits instead of doubling the number of
intrinsics for each modifier.

I hope this clarifies the behavior of the flags. I would be
happy to discuss more details offline. (I have pinged you
on discord)

https://github.com/llvm/llvm-project/pull/96083